At Databricks, we are extremely excited to have Microsoft as a Diamond sponsor of the first virtual Spark + AI Summit. Microsoft and Azure Databricks customers are coming together at Summit. Rohan Kumar, Corporate Vice President of Azure Data, will deliver a keynote on Thursday morning. Additional Microsoft speakers and Azure Databricks practitioners will present a wide variety of topics in breakout sessions as well.
The first-ever virtual Spark + AI Summit is this week, the premier event for data teams — data scientists, engineers and analysts — who will tune in from all over the world to share best practices, discover new technologies, network and learn. We are excited to have Microsoft as a Diamond sponsor, bringing Microsoft and Azure Databricks customers together for a lineup of great keynotes and sessions.
Rohan Kumar, Corporate Vice President of Azure Data, returns as a keynote speaker for the third year in a row, along with presenters from a number of Azure Databricks customers including Starbucks, Credit Suisse, CVS, ExxonMobil, Mars, Zurich North America and Atrium Health. Below are some of the top sessions to add to your agenda:
KEYNOTE
How Starbucks is achieving its 'Enterprise Data Mission' to enable data and ML at scale and provide world-class customer experiences
Starbucks During the WEDNESDAY MORNING KEYNOTE, 8:30 AM - 10:30 AM (PDT)
Vishwanath Subramanian, Director of Data and Analytics Engineering, Starbucks
Starbucks makes sure that everything we do is through the lens of humanity – from our commitment to the highest quality coffee in the world, to the way we engage with our customers and communities to do business responsibly. A key aspect to ensuring those world-class customer experiences is data. This talk highlights the Enterprise Data Analytics mission at Starbucks that helps making decisions powered by data at tremendous scale. This includes everything ranging from processing data at petabyte scale with governed processes, deploying platforms at the speed-of-business and enabling ML across the enterprise. This session will detail how Starbucks has built world-class Enterprise data platforms to drive world-class customer experiences.
KEYNOTE
Responsible ML – Bringing Accountability to Data Science
Microsoft During the THURSDAY MORNING KEYNOTE, 9:00 AM - 10:30 AM (PDT)
Rohan Kumar, Corporate Vice President of Azure Data, Microsoft
Sarah Bird, AI Research and Products, Microsoft
Responsible ML is the most talked about field in AI at the moment. With the growing importance of ML, it is even more important for us to exercise ethical AI practices and ensure that the models we create live up to the highest standards of inclusiveness and transparency. Join Rohan Kumar, as he talks about how Microsoft brings cutting-edge research into the hands of customers to make them more accountable for their models and responsible in their use of AI. For the AI community, this is an open invitation to collaborate and contribute to shape the future of Responsible ML.
KEYNOTE
How Credit Suisse is Leveraging Open Source Data and AI Platforms to Drive Digital Transformation, Innovation and Growth
Credit Suisse During the THURSDAY MORNING KEYNOTE, 9:00 AM - 10:30 AM (PDT)
Anurag Sehgal, Managing Director, Credit Suisse Global Markets
Despite the increasing embrace of big data and AI, most financial services companies still experience significant challenges around data types, privacy, and scale. Credit Suisse is overcoming these obstacles by standardizing on open, cloud-based platforms, including Azure Databricks, to increase the speed and scale of operations, and the democratization of ML across the organization. Now, Credit Suisse is leading the way by successfully employing data and analytics to drive digital transformation, delivering new products to market faster, and driving business growth and operational efficiency.
Automating Federal Aviation Administration’s (FAA) System Wide Information Management ( SWIM ) Data Ingestion and Analysis
Microsoft, Databricks and U.S. DOT WEDNESDAY, 12:10 PM (PDT)
The System Wide Information Management (SWIM) Program is a National Airspace System (NAS)-wide information system that supports Next Generation Air Transportation System (NextGen) goals. SWIM facilitates the data-sharing requirements for NextGen, providing the digital data-sharing backbone of NextGen. The SWIM Cloud Distribution Service (SCDS) is a Federal Aviation Administration (FAA) cloud-based service that provides publicly available FAA SWIM content to FAA approved consumers via Solace JMS messaging. In this session we are going to showcase the work we did at USDOT-BTS on Automating the required Infrastructure, Configuration, Ingestion and Analysis of public SWIM Data Sets.
How Azure and Databricks Enabled a Personalized Experience for Customers and Patients at CVS Health
CVS Health WEDNESDAY, 2:30 PM (PDT)
CVS Health delivers millions of offers to over 80 million customers and patients on a daily basis to improve the customer experience and put patients on a path to better health. In 2018, CVS Health embarked on a journey to personalize the customer and patient experience through machine learning on a Microsoft Azure Databricks platform. This presentation will discuss how the Microsoft Azure Databricks environment enabled rapid in-market deployment of the first machine learning model within six months on billions of transactions using Apache Spark. It will also discuss several use cases for how this has driven and delivered immediate value for the business, including test and learn experimentation for how to best personalize content to customers. The presentation will also cover lessons learned on the journey in the evolving industries of cloud computing and machine learning in a dynamic healthcare environment.
Productionizing Machine Learning Pipelines with Databricks and Azure ML
ExxonMobil WEDNESDAY, 2:30 PM (PDT)
Deployment of modern machine learning applications can require a significant amount of time, resources, and experience to design and implement – thus introducing overhead for small-scale machine learning projects.
In this tutorial, we present a reproducible framework for quickly jumpstarting data science projects using Databricks and Azure Machine Learning workspaces that enables easy production-ready app deployment for data scientists in particular. Although the example presented in the session focuses on deep learning, the workflow can be extended to other traditional machine learning applications as well.
The tutorial will include sample-code with templates and recommended project organization structure and tools, along with shared key learnings from our experiences in deploying machine learning pipelines into production and distributing a repeatable framework within our organization.
Cloud and Analytics—From Platforms to an Ecosystem
Zurich North America WEDNESDAY, 3:05 PM (PDT)
Zurich North America is one of the largest providers of insurance solutions and services in the world with customers representing a wide range of industries from agriculture to construction and more than 90 percent of the Fortune 500. Data science is at the heart of Zurich’s business with a team of 70-data scientists working on everything from optimizing claims-handling processes to protecting against the next risk to revamping the suite of data and analytics for the customers.
In this presentation, we will discuss how Zurich North America implements a scalable self-service data science ecosystem built around Databricks to optimize and scale the activities in the data science project lifecycle and integrates the Azure data lake with analytical tools to streamline machine learning and predictive analytics efforts.
Building the Petcare Data Platform using Delta Lake and ‘Kyte’: Our Spark ETL Pipeline
Mars THURSDAY, 12:10 PM (PDT)
At Mars Petcare (in a division known as Kinship Data & Analytics) we are building out the Petcare Data Platform – a cloud based Data Lake solution. Leveraging Microsoft Azure, we were faced with important decisions around tools and design. We chose Delta Lake as a storage layer to build out our platform and bring insight to the science community across Mars Petcare. We leveraged Spark and Databricks to build ‘Kyte’, a bespoke pipeline tool which has massively accelerated our ability to ingest, cleanse and process new data sources from across our large and complicated organisation. Building on this we have started to use Delta Lake for our ETL configurations and have built a bespoke UI for monitoring and scheduling our Spark pipelines. Find out more about why we chose a Spark-heavy ETL design and a Delta Lake driven platform, and why we are committing to Spark and Delta Lake as the core of our Platform to support our mission: Making a Better World for Pets!
Leveraging Apache Spark for Large Scale Deep Learning Data Preparation and Inference
Microsoft THURSDAY, 3:05 PM (PDT)
To scale out deep learning training, a popular approach is to use Distributed Deep Learning Frameworks to parallelize processing and computation across multiple GPUs/CPUs. Distributed Deep Learning Frameworks work well when input training data elements are independent, allowing parallel processing to start immediately. However preprocessing and featurization steps, crucial to Deep Learning development, might involve complex business logic with computations across multiple data elements that the standard Distributed Frameworks cannot handle efficiently. These preprocessing and featurization steps are where Spark can shine, especially with the upcoming support in version 3.0 for binary data formats commonly found in Deep Learning applications. The first part of this talk will cover how Pandas UDFs together with Spark’s support for binary data and Tensorflow’s TFRecord formats can be used to efficiently speed up Deep Learning’s preprocessing and featurization steps. For the second part, the focus will be techniques to efficiently perform batch scoring on large data volume with Deep Learning models where real-time scoring methods do not suffice. Upcoming Spark 3.0’s new Pandas UDFs’ features helpful for Deep Learning inference will be covered.
All In – Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) – A Real World Case Study
Atrium Health THURSDAY, 3:40 PM (PDT)
Molecular profiling provides precise and individualized cancer treatment options and decisions points. By assessing DNA, RNA, proteins, etc. clinical teams are able to understand the biology of the disease and provide specific treatment plans for oncology patients. An integrated database with demographic, clinical and molecular data was created to summarize individualized genomic reports. Oncologist are able to review the reports and receive assistance interpreting results and potential treatments plans. The architecture to support the current environment includes Wasbi storage, bash/corn/PowerShell, Hive and Office 365 (SharePoint). Via an automated process personalized genomics data is delivered to physicians. As we supported this environment we noted unique challenges and brainstormed a plan for the next generation of the critical business pipeline line.
After researching different platforms we felt that Databricks would allow us to cut cost, standardize our workflow and easily scale for a large organization. This presentation will detail some of the challenges with the previous environment, why we chose Apache Spark and Databricks, migration plans and lessons learned, new technology used after the migration (Data Factory/Databricks, PowerApp/Power Automate/Logic App, Power BI), and how the business has been impacted post migration. Migration to Databricks was critical for our organization due to the time sensitivity of the data and our organizational commitment to personalized treatment for oncology patients.
SparkCruise: Automatic Computation Reuse in Apache Spark
Microsoft FRIDAY, 10:35 AM (PDT)
Queries in production workloads and interactive data analytics are often overlapping, i.e., multiple queries share parts of the computation. These redundancies increase the processing time and total cost for the user. To reuse computations, many big data processing systems support materialized views. However, it is challenging to manually select common computations in the workload given the size and evolving nature of the query workloads. In this talk, we will present Spark Cruise, an automatic computation reuse system developed for Spark. It can automatically detect overlapping computations in the past query workload and enable automatic materialization and reuse in future Spark SQL queries.
SparkCruise requires no active involvement from the user as the materialization and reuse is applied automatically in the background as part of query processing. We can perform all these steps without changing the Spark code, thus demonstrating the extensibility of Spark SQL engine. Spark Cruise has shown to improve the overall runtime of TPC-DS queries by 30%. Our talk will be divided into three parts. First, we will explain the end-to-end system design with focus on how we added workload awareness to the Spark query engine. Then, we will demonstrate all the steps including analysis, feedback, materialization, and reuse on a live Spark cluster. Finally, we will show the workload insights notebook. This Python notebook displays the information from query plans of the workload in a flat table. This table helps the users and administrators to understand the characteristics of their workloads and the cost/benefit tradeoff of enabling SparkCruise.
Deploy and Serve Model from Azure Databricks onto Azure Machine Learning
Microsoft FRIDAY, 11:10 AM (PDT)
We demonstrate how to deploy a PySpark based Multi-class classification model trained on Azure Databricks using Azure Machine Learning (AML) onto Azure Kubernetes (AKS) and associate the model to web services. This presentation covers end-to-end development cycle; from training the model to using it in web application. Machine Learning problem formulation The current solutions to detect semantic types of tabular data mostly rely on dictionary/vocabulary, regular expressions and rule-based look up to identify the semantic types. However, these solutions are 1. Not robust to dirty and complex data and 2. Not generalized to diverse data types. We formulate this into a Machine Learning problem by training a multi-class classifier to automatically predict the semantic type for tabular data. Model Training on Azure Databricks We choose Azure Databricks to perform the featurization and model training using PySpark SQL and Machine Learning Library. To speed up the featurization process, we leverage the PySpark Functions (UDF) to register and distribute the featurization functions into UDFs. For the model training, we pick the Random Forests as the classification algorithms and optimize the model hyperparameters using PySpark MLLib. Model Deployment using Azure Machine Learning Azure Machine Learning provided the reusable and scalable capabilities to manage the lifecycle of Machine Learning models. We developed the E2E deployment pipeline on Azure Machine Learning including model preparation, computing initialization, model registration, and web service deployment. Serving as Web Service on Azure Kubernetes Azure Kubernetes provide the fast response and autoscaling capabilities serving model as web service together with the security authorization. We customized the AKS cluster with PySpark runtime to support PySpark based featurization and model scoring. Our model and scoring service are being deployed onto an AKS cluster and served as HTTPS endpoints with both key-based and token-based authentication.
We look forward to connecting with you at Spark + AI Summit! If you have questions about Azure Databricks or Azure service integrations, please visit the Microsoft Azure virtual booth at Spark + AI Summit.
For more information about Azure Databricks, go to www.databricks.com/azure