Machine Learning Data Lineage with MLflow and Delta Lake

Download Slides

Many organizations using machine learning are facing challenges storing and versioning their complex ML data as well as a large number of models generated from those data. To simplify this process, organizations tend to start building their customized ‘ML platforms.’ However, even such platforms are limited to only a few supported algorithms and they tend to be strongly coupled with companies’ internal infrastructures. MLflow, an open-source project designed to standardize and unify the machine learning process, and Delta Lake, an open-source storage layer that brings reliability to data lakes. Both originated from Databricks, can be used together to provide a reliable full data lineage through different machine learning life cycles.

In this talk, we will give a detailed introduction to two popular features: MLflow Model Registry and Delta Lake Time Travel, as well as how they can work together to help create a full data lineage in machine learning pipelines.

MLflow Model Registry provides a suite of APIs and intuitive UI for organizations to register and share new versions of models as well as perform lifecycle management on their existing models. It is seamlessly integrated with the existing MLflow tracking component, allowing it to be used to trace back the original run where the model artifacts were generated as well as the version of source code for that run, giving a complete lineage of the lifecycle for all models. It can also be integrated with existing ML pipelines to deploy the latest version of a model to production.

Delta Lake Time Travel capabilities automatically version the big data that you store in your data lake as you write into a Delta table or directory. You can access any historical version of the data with a version number or a timestamp. This temporal data management simplifies your data pipeline by making it easy to audit, roll back data in case of accidental bad writes or deletes, and reproduce experiments and reports.

A live demo will be provided to show how the above features from MLflow and Delta Lake can work together to help create a full data lineage through life cycles of a machine learning pipeline.


 
Try Databricks
« back
About Richard Zang

Databricks

Richard Zang is a software engineer on the ML Platform team at Databricks. Richard has great interest and extensive experience building data-intensive enterprise applications. Before Databricks he worked at Hortonwork on Apache Ambari and prior to that he worked at Opentext Analytics building its BI visualization suite. Richard holds an MS in Computer Science from the University of Chicago and BE in Software Engineering from Sun Yat-Sen University.

About Denny Lee

Databricks

Denny Lee is a Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.