What’s New with Databricks Machine Learning

May 27, 2021 11:00 AM (PT)

Download Slides

In this session, the Databricks product team provides a deeper dive into the machine learning announcements. Join us for a detailed demo that gives you insights into the latest innovations that simplify the ML lifecycle — from preparing data, discovering features, and training and managing models in production.

In this session watch:
Clemens Mewald, Director of Product Management, Data Science and Machine Learning, Databricks
Kasey Uhlenhuth, Sr. Product Manager, Databricks

 

Transcript

Clemens Mewald: All right. Welcome to a deep dive on what’s new with Databricks Machine Learning. If you’ve seen the keynote earlier today, you’ve already seen the exciting new announcements that we’ve made. In this talk we’ll have much more time to dive into each one of the components of Databricks Machine Learning. And I’ll also cover the new components that we announced earlier today.
This is the overview of Databricks Machine Learning. A data-native and collaborative solution for the full machine learning lifecycle. So I’m going to be walking through each one of these boxes, and talk about what we cover in these boxes, what functionality is available, and some [inaudible] items of what’s coming in the future. As mentioned, AutoML and the feature store are two brand new components that we’re adding today.
Now, before we go through each one of these boxes, let me cover of how you get actually access to Databricks Machine Learning. We’ve added a brand new persona-based navigation to Databricks. We can select between data science and engineering, machine learning, and SQL analytics. This gives you access to a purpose-built surface for each member of the data team. If you care about machine learning, you can configure the navigation to be in the machine learning configuration to have access to the specific components. But, if you don’t care about machine learning and the physical analyst, you can click on SQL analytics, and you have a purpose-built surface only for that purpose. But if you’re on the Database Machine Learning integration, this new dashboard gives you access to all of the machine learning related assets and resources, like the most recently reviewed feature tables, models, or experiments, and entry points to our new AutoML product as well.
So coming back to this overview, and let’s start with data, which is the most important piece of any machine learning platform. Of course, Databricks Machine Learning is built on an Open Data Lake Foundation with Delta Lake. Now that has a lot of implications of the scalability and what type of data that you can access. And really what we’re talking about is this data prep and data versioning component in Databricks Machine Learning.
So let’s quickly dive into this topic. Delta Lake provides a lot of powerful capabilities for machine learning. The most important one is that it can adjust any type of data, at any scale, from any source. So you may imagine you have images, tabular data, audio, maybe even videos that have come in from all kinds of different sources, and Delta Lake really enables you to ingest all of this data and process it at high scale.
Of course, that works across cloud and on of the different Delta Lake products that you’re familiar with. And specifically for machine learning, Delta Lake provides optimized performance of reading from and writing to Delta tables. And because Delta provides asset transaction guarantees, you can be sure that your data for machine learning are always available at consistent quality, so that you can train on high quality data, and don’t waste training time on data that doesn’t conform to your expectations.
And then you’ll see later as well, we have integrated Delta with MLflow, and specifically we use the Delta data versioning feature that is used to provide time travel to actually provide them you full lineage and governance in MLflow. So when you train a model and log the information to MLflow, we track exactly which Delta table you use and which version of the Delta table that you use, which comes in handy for reproducibility.
So jumping out of the data aspect to the data science workspace that makes Databricks Machine Learning collaborative. One of the most important parts of building a product for every member of the data team is providing model language support. Databricks notebooks provide support for Scala, SQL, Python, and R all within the same notebook. So your first cell in your notebook can be a SQL query to load data and the next one can be Python to train a machine learning model.
Now this model language support really comes into play when you can actually collaborate with multiple people. So Databrick’s notebooks can be shared within a data team and provide cloud native collaboration features such as commenting, co-presence, and co-editing. So you can really in real time collaborate with your colleagues. And finally, and most importantly, MLflow’s primary tracking is deeply integrated with the Databricks’ notebook. So there is an experiment sidebar that shows you the latest runs that have been tracked from this notebook, all of the parameters, metrics, and models. And it’s also integrated with code versioning from the notebook. So each time you create an MLflow run from a Databricks’ notebook, we snapshot the version of the code, which is used later for reproducibility.
And of course, during the experimentation, you can iterate faster new code, but for production workloads, ideally you want to check it into [inaudible] repositories. So we provide a native integration through our new repos feature with the most popular providers that allows you to check in the code into a [inaudible] repository and pull changes from it and integrate with your CSED system.
From the data science workspace, let’s go to the core of Databricks Machine Learning where you can train the models. At the center of this platform is really the machine learning runtime that provides a DevOps-free environment for optimized machine learning. In this screenshot here, you can see configuring a Databricks cluster to use the GPU runtime, which actually, all of the GPU drive is pre-installed and pre-configured so that you can utilize GPUs to train your machine learning models.
The machine learning runtime packages are all of the most popular machine learning tool kits, and make sure that they’re well tested and can run at scale. So we include TensorFlow, Keras, [inaudible], Scikit Learn, XGBoost, and a much, much longer list of the most popular frameworks. You can just use the machine learning runtime and to get going without about setting up the environment. Of course, we support distributed machine learning natively. So we also package up a popular library called [inaudible] that allows us to distribute the TensorFlow and [inaudible] models.
And finally, we also integrate with popular hyperparameter tuning libraries, such as Hyperopt that is natively built into the machine learning runtime, and the machine learning runtime also contains our new AutoML products that we announced today. So let’s jump into the AutoML product and cover it in more detail.
When we look at AutoML, we really see it as an opaque box that often leads to a lot of different problems. The way that we frame this is that there’s a lot of different people that do machine learning at different levels of familiarity with machine learning libraries. ML experts and researchers really want full flexibility. And we like to use a driving analogy for this. Really what they want is a manual transmission, three pedals and the gear shifter.
While developer engineers, they may not be formally trained in machine learning, but can still do data science into machine learning models. And they just want to have the most tedious things taken care of for them. So the analogy in driving is an automated gearbox, so that they don’t have to worry about shifting and which gear they’re in.
And last but not least, our citizen data scientists, where there’s a promise of full automation. And the reason why I like this analogy is that fully autonomous cars just don’t quite work yet. And you know that they work really well in really constrained environments, but once they get out of their depths and if there’s no controls so that you can take over, all you can do is open a door and leave.
And that’s what we see happen with a lot of the products that are targeted at this fully autonomous level in data science and machine learning, that they work really well on demos and PLCs, but then you reach the limits and there’s no way out. So we basically approach AutoML in a very different way.
So our AutoML product is embedded in our experiment management pilot, and there’s also different entry points, but the default entry point will be the creation of a new experiment. And then there’s a UI based workflow that takes you through configuring the AutoML run, augmenting your data with different tables from the feature store, which we’ll discuss later. Training and evaluating these models and applying them, all within the UI.
Now, this addresses this fully autonomous UI only level. However, if you want more flexibility, or if you have an expert in collaborating with, our AutoML product generates an MLflow experiment and runs in this experiment to track all the information. And in addition, it also generates code for each one of the miles that it produced. And that code is the native code that you would have written if you had that written model.
So let’s say the best model from the AutoML model is an next to XGBoost model. The notebook that is generated for this XGBoost model is native XGBoost code that loads the data, trains the model, and adds it to the model registry. And we also add things such as model interpretability to help you understand your model. So now you can actually “break glass” into one layer of abstraction below, look at the code and customize it, or pass it on to an expert who can apply it to main knowledge, to help them work, which provides you with this level of control within the AutoML product.
So now that we talked about model training in AutoML, let’s talk about the other new component, the feature store. So feature stores really solve a key problem of reusing features that are being written from missionary models and helping with the deployment path. In this graph that was also used in the keynote, you can see how we go from raw data, apply some transformations, and then train machine learning models.
The first component of the feature store is what we call the Feature Registry. That really takes care of tracking what feature tables you use, what features are in those feature tables, and what is their schema, the upstream and downstream lineage, which I’ll discuss in more detail in a minute, what code was run to produce these features, and versioning. And that is also the component that powers our UI, that I’ll show you on a later slide in screenshots.
Once you define your features and they are added to the Feature Registry, there’s two different access patterns. The first one is batch for high throughput. So you can train your models directly in features from the feature store. And that this is really important because, once you train models directly from the feature store, we can actually retain that metadata within the model. And I’ll discuss that in more detail on a later slide. But for storing the features offline and training the models, one key aspect about our feature store is that this was co-designed with Delta Lake. So the feature store really inherits all the benefits from Delta Lake, most importantly, that the features are stored in an open format that can be accessed through native libraries and APIs. And of course, all of the aspects about Delta Lake in terms of versioning also apply to our feature store.
The second access pattern is online for low latency. And so the feature story really guarantees that the features that are right for training and for online serving are accumulant, which removes the risk of online, offline skew. And of course, if you read the features from an online store, it removes the necessity to implement these features on the client.
So then already significantly simplifies the client, but the client still needs to deal with the complexity of knowing which versions of the features it needs to request them the feature store to pass to a specific version of the machine learning model. So what we have done in our feature store, and the reason why we co-designed with the MLflow is that we actually store the information about which features a model understands in the model it itself. So if you train a model using features from the feature store, the MLflow model format carries forward the information about which features came from the feature store, which versions and how they extract them. So that the model deployment, the client can be completely oblivious to the fact that there is a feature set in the first place and just send the raw features to the model.
And then the model deals with extracting or looking up the features from the feature store. Now, that significantly simplifies the deployment of these machine learning models. And it also means that you can actually update features in the feature store without touching the client, which accelerates model deployment and makes it much, much easier.
So this is a screenshot of the UI that is powered by the feature registry, and I just want to highlight the lineage aspects. So in the UI, you can here see that we have upstream data sources. What that means is we automatically track which data sources are used to compute a specific feature. That is not only important for lineage and governance, but also for feature reuse. So now data scientists can come to the feature store UI and find all of the feature tables that are being computed based on the raw data that they’re planning to use. And this is a much better way of actually discovering features than just trying to guess the name that someone else gave the feature tables.
So this is upstream linage. We also implemented downstream lineage. So on the feature table detail page, you can see for each feature where this feature’s being used. So we track exactly which models, endpoints, jobs, and notebooks are you using a specific feature. Now, as you may be able to guess, that’s really important so that we can actually answer the full end-to-end lineage question, such as what raw data do you have? What are the features that are being completed on those raw data? What are the models that consume these specific features? And where are these models being deployed?
But it also helps with answering questions such as, is it safe to deprecate the feature table? If you’re planning to deprecate a feature table that still has consumers, probably you should first check on those consumers and see if it’s okay to turn down those feature tables, because it will most likely break those consumers that rely on these features.
So now that we went from model training and how we can make sure that the features are available at training and online, let’s look at the different deployment options. So the Databricks Machine Learning provides a model registry based on the MLflow registry out of the box. And of course, through using the MLflow model formats, we have all of the model deployment options available to us because MLflow is open source, and the open source project provides a lot of different flexibility for model deployment. So as you can see, you can easily build a Docker container and deploy it on your [inaudible]. You can easily deploy models as Spark PDF, as I’ll show in a minute. There’s a lot of different integrations with online inference services. And because the MLflow model just wraps up the native [inaudible] formats, you can also of course, pull out those formats and run them on on-device on times, such as Onyx and Tensorflow.
So let’s look at two specific deployment use-cases in more detail. The first one is batch scoring. So this is the line of code, or the two lines of code that you have to write to apply an MLflow model as a Spark PDF. And here you can actually see that we’re using the model registry. So we’re requesting the production version of the model named Clements windfare signature and applying it as a UTF to a Spark data frame. That’s how simple it is.
Now this is the line of code that you would write if you wanted to apply an XGBoost model to a Spark PDF. This is the line of code that you would write to deploy a Scikit Learn model as a Spark PDF. And this is the line of code that you would write to deploy the TensorFlow models to Spark PDF. So just for dramatic effect, I’m going to go back. XGBoost, Scikit Learn, TensorFlow. And yes, you may have noticed it. They’re all the same.
And this is one of the benefits of the MLflow model format is that it has this Python flavor, which abstracts away what MLflow would’ve used and exposes them as a Python function. So in the deployment path, you don’t have to worry about what MLflow was used initially. And it’s extremely important, because if there’s many people using different MLflow works within your company, you don’t want to have to change your deployment every single time someone comes up with a new model and new model framework.
And these of course are just the three examples. MLflow supports many different ML frameworks as well. And on the serving size of the options, we do provide model serving out of the box in Databricks. It’s actually one-click deployment with MLflow models, you just have to go to the model registry and click in observing, and we will expose all of the active versions within the model as a raised endpoint. And here you can see a screenshot of the certain product where you can see all of the active versions being ready for serving. And of course the serving product being integrated in the model registry, most of which model versions have reached deployment stage. So if you call the production endpoint, the request will always be sent to the active version that is marked production in model registry.
So now that we’ve discussed all the different components to train your models, to track your features and to deploy them, let’s look at the foundation really that forms, develops, and governance there in Databricks Machine Learning. And what we’ve really found is that MLOps is really the combination of DataOps, DevOps, and ModelOps. And of course on Databricks, we provide DataOps through Delta Lake and data versioning using the time travel feature, DevOps with integration with repos and Git providers, and ModelOps with model lifecycle management through MLflow.
So we’re going to look at each one of the components of MLflow from the model format to tracking to the model registry and the deployment options. And we’re going to be looking at how we actually enable end-to-end develops and governance on Databricks when integrating all of these different components.
It’s anchored on capabilities that we’re building, that we call autologging. What you often look at when you look at tracking your parameters and metrics in your machine learning framework is code like this, that logs all of the parameters and metrics into model [inaudible] itself. Of course, we can simplify this because this code starts looking awfully similar for all of the different models that are being trained.
So we’ve introduced autologging, which provides and autolog API for all of the popular ML frameworks, that takes care of all of this work for you. And with one line of code, you get all of the metrics, all of the parameters, and the model artifact logs to MLflow. And because we provide many different autologging APIs, and I’ll show you later of the different things that we can automatically log, we even simplify it more so that you can just call MLflow an autolog and we will log all of the information that we can for you.
So taking this MLOps and governance there and unboxing it to look at different components, we’ll walk through DataOps experiment tracking, DevOps, and ModelOps to show you how we actually automate the full website. So starting with DataOps, we also automatic track what data that you access in your MLflow run. This uses the Spark data source API. So it actually works for every Spark data source. If you read a CSV file using the Spark data source, we will keep track of the name of the CSV file that you’ve accessed. But if you read a Delta table, we cannot only tell you which Delta table you read, we can also tell you what version of the Delta table that you used. Because Delta has the time travel feature, and this actually provides you with end-to-end governance in terms of taking a model and finding out what data you’ve used when you trained that model.
We also capture which features the model actually takes in. Of course, this is important for the feature store to know exactly what features are being fed in, but it’s also important for many different use-cases, such as being able to tell whether two model versions are compatible based on the schema. And this works transparently for all different MLflow worlds. So it doesn’t matter if you’re use intensive TensorFlow, Python, or Scikit Learn, or SparkML, the schema will always be available in MLflow.
This is the autologging that I just showed you from specific to machine learning frameworks. So for all of these frameworks, we keep track of the parameters, the metrics, and the model artifact itself without you having to specify any of that. You just have to call MLflow an autolog.
As mentioned earlier, in the machine learning run time we also package up hyperparameter tuning libraries. So if you run hyperparameter tuning on Databricks, we automatically keep track of all the different trials and the parameters that you used. And in MLflow, we actually visualize this using this parallel coordinates plot. And again, all of this is automated and out of the box without you having to specify anything.
And then we’re also in the process of adding automated model interpretability using the [inaudible] library under the hood. So depending on the model type, we will give you feature importance. So here on the right side, you can see feature importance for, I believe it is a tree based model under the hood. And on the left side, you can see if you have an image model, we can give you these plots that tell you which areas of an image contribute to the classification. In this case, it’s an aircraft classification model.
And then on the DevOps front, of course, we keep track of the exact code that you use when you train a model, the hardware configuration of the cluster, and the environment configuration. And you may be able to tell where this is going, and you’ll see it in a couple of slides. And then on the ModelOps front, of course we have the model registry that is really becoming the get-up for machine learning models, that keeps track of all of the models and the versions. And we provide stage-based access controls in the model registry to facilitate the handoff in the deployment lesson.
So on this slide, you can see actually the stage requests and stage transitions, which again, depends on your access controls. So if you don’t have access to make and model production, all you can do is request the transition to production. And then someone else who has the appropriate access controls gets notified and can reject or approve your request. And we keep a full audit log of those activities so we can basically tell when changes to your model happen and when it was made production, when it was archived again.
Then, as you saw earlier, deploying model as a rest endpoint, from there is a one-click process. The model serving part, of course, is integrated with the model registry and understands the production tags and exposes the model as a rest endpoint from there. And of course it also keeps a full audit log of what happened to the model and when it was deployed.
And last but not least, here’s an example of a streaming model quality monitoring application that actually computes the streaming RMSE to tell you the model quality as new requests and predictions come in. And also alerts based on the quality. There’s much more investment than we are going to be making in model quality monitoring in the future, but this is just one example.
So finally, as you may be able to guess, the result of all of this is full end-to-end governance and reproducibility, because once you have the code that you used, the data that you used, the cluster configuration, the environment specification, we can give you a single button that reproduces a run and gives you the exact environment that you used when you originally trained the model.
So in summary, this is Databricks Machine Learning, a data-native and collaborative solution for the full machine learning lifecycle. It contains all of the different capabilities for you to go from data to model training, to evaluation and deployment of these models in all different environments with full MLOps and governance that stretches from DataOps and the versioning of data to DevOps the versioning of code, and ModelOps of versioning of models and deploying them on an Open Data Lakehouse Foundation that is built on Delta Lake.
Of course, as always, there’s much more content. So please feel free to go at databricks.com/ml to learn more.

Clemens Mewald

Clemens Mewald leads the product team for Machine Learning and Data Science at Databricks. Previously, he spent four years on the Google Brain team building AI infrastructure for Alphabet, where his p...
Read more

Kasey Uhlenhuth

Kasey Uhlenhuth

Kasey Uhlenhuth is a product manager on the machine learning team at Databricks. Before Databricks, she worked on the Visual Studio and C# team at Microsoft building developer productivity tools. Kase...
Read more