Michael Shtelma

Senior Solutions Architect, Databricks

Databricks Senior Solutions Architect and ex-Teradata Data Engineer with focus on operationalizing Machine Learning workloads in cloud.

Past sessions

Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance. All these qualities of data & ML projects lead us to the necessity of continuous testing and monitoring of our models and pipelines.

In this talk we will show how CI/CD Templates can simplify these tasks: bootstrap new data project within a minute, set up CI/CD pipeline using GitHub Actions, implement integration tests on Databricks. All this is possible because of conventions introduced by CI/CD Templates which helps automate deployments & testing of abstract data pipelines and ML models.

Speakers: Michael Shtelma and Ivan Trusov

Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance.

All these qualities of data & ML projects lead us to the necessity of continuous testing and monitoring of our models and pipelines. In this talk we will show how CI/CD Templates can simplify these tasks: bootstrap new data project within a minute, set up CI/CD pipeline using GitHub Actions, implement integration tests on Databricks. All this is possible because of conventions introduced by CI/CD Templates which helps automate deployments & testing of abstract data pipelines and ML models.

The CI/CD templates are used by Runtastic for automating deployment processes of their Databricks pipelines. During this webinar Emanuele Viglianisi, Data Engineer at Runtastic will show how Runtasic is using CI/CD templates during their day to day development to run, test and deploy their pipelines directly from PyCharm IDE to Databricks. Emanuele will present the challenges Runtastic has faced and how they successfully solved them by integrating the CI/CD template in their workflow.

Speakers: Michael Shtelma and Emanuele Viglianisi

ML development brings many new complexities beyond the traditional software development lifecycle. ML projects, unlike software projects, after they were successfully delivered and deployed, cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. In most ML use cases, we have to deal with updates of our training set, which can influence model performance. In addition, most models require certain data pre- and post-processing in runtime, which makes the deployment process even more challenging. In this talk, we will show how MLflow can be used to build an automated CI/CD pipeline that can deploy a new version of the model and code around it to production. In addition, we will show how the same approach can be used in the data training pipeline that will retrain model on arrival of new data and deploy the new version of the model if it satisfies all requirements.

Summit Europe 2019 Managing the Complete Machine Learning Lifecycle with MLflow EU

October 16, 2019 05:00 PM PT

ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models.

To solve for these challenges, Databricks unveiled last year MLflow, an open source project that aims at simplifying the entire ML lifecycle. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.

In the past year, the MLflow community has grown quickly: over 120 contributors from over 40 companies have contributed code to the project, and over 200 companies are using MLflow.

In this tutorial, we will show you how using MLflow can help you:

  • Keep track of experiments runs and results across frameworks.
  • Execute projects remotely on to a Databricks cluster, and quickly reproduce your runs.
  • Quickly productionize models using Databricks production jobs, Docker containers, Azure ML, or Amazon SageMaker.

We will demo the building blocks of MLflow as well as the most recent additions since the 1.0 release.

What you will learn:

  • Understand the three main components of open source MLflow (MLflow Tracking, MLflow Projects, MLflow Models) and how each help address challenges of the ML lifecycle.
  • How to use MLflow Tracking to record and query experiments: code, data, config, and results.
  • How to use MLflow Projects packaging format to reproduce runs on any platform.
  • How to use MLflow Models general format to send models to diverse deployment tools.

Prerequisites:

  • A fully-charged laptop (8-16GB memory) with Chrome or Firefox
  • Python 3 and pip pre-installed
  • Pre-Register for a Databricks Standard Trial
  • Basic knowledge of Python programming language
  • Basic understanding of Machine Learning Concepts

ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models.

To solve for these challenges, Databricks unveiled last year MLflow, an open source project that aims at simplifying the entire ML lifecycle. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.

In the past year, the MLflow community has grown quickly: over 120 contributors from over 40 companies have contributed code to the project, and over 200 companies are using MLflow.

In this tutorial, we will show you how using MLflow can help you:

  • Keep track of experiments runs and results across frameworks.
  • Execute projects remotely on to a Databricks cluster, and quickly reproduce your runs.
  • Quickly productionize models using Databricks production jobs, Docker containers, Azure ML, or Amazon SageMaker.

We will demo the building blocks of MLflow as well as the most recent additions since the 1.0 release.

What you will learn:

  • Understand the three main components of open source MLflow (MLflow Tracking, MLflow Projects, MLflow Models) and how each help address challenges of the ML lifecycle.
  • How to use MLflow Tracking to record and query experiments: code, data, config, and results.
  • How to use MLflow Projects packaging format to reproduce runs on any platform.
  • How to use MLflow Models general format to send models to diverse deployment tools.

Prerequisites:

  • A fully-charged laptop (8-16GB memory) with Chrome or Firefox
  • Python 3 and pip pre-installed
  • Pre-Register for a Databricks Standard Trial
  • Basic knowledge of Python programming language
  • Basic understanding of Machine Learning Concepts