Denny Lee is a Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.
As ML-driven innovations are propelled by the Self-Service capabilities in the Enterprise Data and Analytics Platform, teams face a significant entry barrier and productivity issues in moving from POCs to Operating ML-powered apps at scale in production. This talk is the journey of a team in using the Starbucks AI foundational capabilities in EDAP to deploy, manage and operate ML models as secure and scalable cognitive services that have the potential of powering internet-scale inferences for use cases and applications.
With the current COVID-19 pandemic impacting many aspects of our lives, understanding the data and models around COVID-19 data are ever more crucial. Understanding the potential number of cases impacts the guidance around our policies (needing more hospital ICU beds, when to ease stay at home orders, when to open schools, etc.). In this session, we will focus on some exploratory data analysis to understand the accuracy of these models. We will then use machine learning models to improve them.
Many organizations using machine learning are facing challenges storing and versioning their complex ML data as well as a large number of models generated from those data. To simplify this process, organizations tend to start building their customized 'ML platforms.' However, even such platforms are limited to only a few supported algorithms and they tend to be strongly coupled with companies’ internal infrastructures. MLflow, an open-source project designed to standardize and unify the machine learning process, and Delta Lake, an open-source storage layer that brings reliability to data lakes. Both originated from Databricks, can be used together to provide a reliable full data lineage through different machine learning life cycles.
In this talk, we will give a detailed introduction to two popular features: MLflow Model Registry and Delta Lake Time Travel, as well as how they can work together to help create a full data lineage in machine learning pipelines.
MLflow Model Registry provides a suite of APIs and intuitive UI for organizations to register and share new versions of models as well as perform lifecycle management on their existing models. It is seamlessly integrated with the existing MLflow tracking component, allowing it to be used to trace back the original run where the model artifacts were generated as well as the version of source code for that run, giving a complete lineage of the lifecycle for all models. It can also be integrated with existing ML pipelines to deploy the latest version of a model to production.
Delta Lake Time Travel capabilities automatically version the big data that you store in your data lake as you write into a Delta table or directory. You can access any historical version of the data with a version number or a timestamp. This temporal data management simplifies your data pipeline by making it easy to audit, roll back data in case of accidental bad writes or deletes, and reproduce experiments and reports.
A live demo will be provided to show how the above features from MLflow and Delta Lake can work together to help create a full data lineage through life cycles of a machine learning pipeline.
Seattle Children's is dedicated to providing the best medical care possible through strategies which include researchers and clinicians working alongside each other to improve our understanding of pediatric diseases. Full realization of this relationship requires systems and processes designed to enable the capture, discovery, and effective communication of knowledge and information. So how do we enable the translation of knowledge and expertise, generated by our scientists and clinicians, to improve patient care?
In this talk we will discuss how we are building a loosely coupled framework comprised of MLflow, Vega-lite, and other open source tools as part of our knowledge capture, management, and communication strategy. We will demonstrate how we leverage the MLFlow model registry to capture visualizations in a way that makes them discoverable and shareable to clinicians.
Running a global, world-class business with data-driven decision making requires ingesting and processing diverse sets of data at tremendous scale. How does a company achieve this while ensuring quality and honoring their commitment as responsible stewards of data? This session will detail how Starbucks has embraced big data, building robust, high-quality pipelines for faster insights to drive world-class customer experiences.
Kick-off your Spark + AI Summit week with Databricks Developer Advocates Denny and Jules! While bantering over some brew, they will (1) take a peek at a few tracks coming up as part of this virtual and global conference, (2) give a quick update on developments in Delta Lake, Apache Spark, Koalas, and MLflow, and (3) show you how to get involved as a contributor for Delta Lake, Apache Spark, Koalas, and MLflow. Laurence from TensorFlow will talk about the latest developments in TensorFlow 2.x, how to analyze and classify images, and in particular, how TF 2.x can be used to classify images for diabetic retinopathy.
Instead of better understanding and optimizing their machine learning models, data scientists spend a majority of their time training and iterating through different models even in cases where there the data is reliable and clean. Important aspects of creating an ML model include (but are not limited to) data preparation, feature engineering, identifying the correct models, training (and continuing to train) and optimizing their models. This process can be (and often is) laborious and time-consuming.
In this session, we will explore this process and then show how the AutoML toolkit (from Databricks Labs) can significantly simplify and optimize machine learning. We will demonstrate all of this financial loan risk data with code snippets and notebooks that will be free to download.
In addition to the many data engineering initiatives at Starbucks, we are also working on many interesting data science initatives. The business scenarios involved in our deep learning initatives include (but are not limited to) planogram analysis (layout of our stores for efficient partner and customer flow) to predicting product pairings (e.g. purchase a caramel machiato and perhaps you would like caramel brownie) via the product components using graph convolutional networks. For this session, we will be focusing on how we can run distributed Keras (TensorFlow backend) training to perform image analytics. This will be combined with MLflow to showcase the data science lifecycle and how Databricks + MLflow simplifies it.