Data Versioning and Reproducible ML with DVC and MLflow

Download Slides

Machine Learning development involves comparing models and storing the artifacts they produced. We often compare several algorithms to select the most efficient ones. We assess different hyper-parameters to fine-tune the model. Git helps us store multiple versions of our code. Additionally, we need to keep track of the datasets we are using. This is important not only for audit purposes but also for assessing the performances of the models, developed at a later time. Git is a standard code versioning tool in software development. It can be used to store your datasets but it does not offer an optimal solution.

An alternative solution is to use Data Version Control (DVC). Despite its name, it is not just a data versioning tool, but also enables model and pipeline tracking. It runs on top of Git, which makes it easy to learn for Git users. At the same time, it overcomes the limitations of storing big files by storing them remotely (e.g. Azure, S3) and keeping in Git only their metadata.

MLflow is a tool that is easily integrated with the code of your model and can track dependencies, model parameters, metrics, and artifacts. Every run is linked with its corresponding Git commit. Once the model is trained, MLflow can pack it in different flavors (e.g. Python/R function, H2O, Spark, TensorFlow…) ready to be deployed. DVC also runs along with Git. When MLflow helps you manage Machine Learning lifecycle, DVC helps you manage your datasets.

In this tutorial, we will learn how to leverage the capabilities of these powerful tools. We will go through a toy ML project and look at the sample code on how to increase the reproducibility of individual steps.

Speaker: Petra Kaferle Devisschere


 
Watch more Data + AI sessions here
or
Try Databricks for free
« back
About Petra Kaferle Devisschere

Adaltas

Petra Kaferle Devisschere is a Data Scientist in Adaltas with 13 years of experience in Data Analytics. Prior to Adaltas, she was working in several research laboratories, where she developed different tools and protocols to standardize and accelerate data analyses. She was processing the data originating from various sources: retrieving, cleaning, integrating, analyzing, and visualizing them. Following a specialization in Data Science and AI, she got interested in data throughout its life cycle, to industrialization and exploitation.