Solving a data science problem is about more than making a model. It entails data cleaning, exploration, modeling and tuning, production deployment, and workflows governing each of these steps. In this simple example, we’ll take a look at how health data can be used to predict life expectancy. It will start with data engineering in Apache Spark, data exploration, model tuning and autologging with hyperopt and MLflow. It will continue with examples of how the model registry governs model promotion, and simple deployment to production with MLflow as a job or REST endpoint. This tutorial will cover the latest innovations from MLflow 1.12.
Speaker: Sean Owen
– Hello, this is Sean Owen. I’m a principal solutions architect at Databricks, and I focus on machine learning and data science here at Databricks. And I’d like to tell you today about what it looks like to do machine learning and data science on Databricks from end-to-end, from the data, all the way through the modeling, to production. And along the way, we’ll see how this works in Databricks with some of the tools you may have heard of, like Spark and Delta, but also of course, particularly MLflow. The data science life cycle has a couple steps at least. So there’s probably three main parts here. There is the data engineering where we take the raw data and we normalize it, we filter it, we extract what is a good representation for the analyst to use. And then the data science takes over, we analyze that data, we explore it, we build models, we select models. And once we’ve selected the best model, we need to get it out to production in some sense. And the nice thing about Databricks and these tools we’re gonna talk about is they support all steps of this workflow. So I wanna show you today how this all fits together. Within Databricks, of course, we’re gonna see today the workspace, which is the piece you interact with as a user, but there’s a couple of interesting pieces under the hood, Spark and Delta, which I alluded to, but also primarily MLflow. I wanna show you how MLflow supports users working in the workspace to implement a data science pipeline, from data all the way to production. So I wanna go ahead and jump into a demo here. And for those that have not seen this particular demo, I hope you enjoy it, for those that have, there are a couple of new elements here from maybe the last time you’ve seen it in MLflow, and I’ll highlight those along the way. And with that, let me jump to the demo itself. Okay, let’s take a look at this demo here. And this is my way of showing you what I think a simple data science problem would look like, solved end-to-end on Databricks. So this is Databricks, for those that haven’t seen it, it’s a hosted by web based service where you interact with a notebook-like environment here. So we can write documentation in Markdown, you can write code, which you’ll see in a second. And the one thing you don’t really have to care too much about is a compute resources. So here I’m connected to an existing cluster. If it wasn’t didn’t exist, I could create a pretty simple cluster. These clusters scale up and they scale down without you having to think too much about it. So Spark is available, a lot of common libraries are available here, and I can just really focus on my work. So in this example, I’m gonna work on a problem that you’re probably not working on, but I hope you can extrapolate and insert your problem here, but to make it make sense, let me explain what we’re gonna look at. We’re gonna try and predict life expectancy as a function of health indicators from developed nations in the past 10 or 20 years or so. And we’re gonna grab some data from the World Health Organization, the World Bank, and Our World in Data, and use it to build a model that predicts life expectancy and see what that tells us about life expectancy. And then we’re gonna try and move that to production, for example. So let’s get started. We began, as I said in the world of the data engineer, that’s the first thing we have to do, we have to engineer the the data. And the data here comes in the form of CSV files, they’re not that big. None of this would be any different if the data source were Parquet files or a database or whatever, Spark can read all of it. And I just wanna call out that the data engineers, for example, could work in a language like Scala, they can use SQL, they can use Spark directly. And that does not mean that data scientists have to. We’ll see later that we’re gonna switch into Python even within the same notebook. And that’s fine, you can mix and match that within Databricks on top of Spark. No problem. So the details of the ETL are, frankly, not that interesting. I’ve never met a CSV file that was perfectly formatted. And in this case, we have to do a little bit of work here to massage the data, to get in the right form here. We are trying to head for a representation like this. So after a little bit of manipulation, a little bit of ETL, we’re getting something a little more like what we’re gonna need for the modeling problem here, we’re getting a table that’s country, year, and all these features and values. These are health indicator values, these are demographic values over time. And you can see even here that I’ve switched into SQL because I can easily query data frames that I’m creating with Spark, with Scala, in SQL, if I like. So I can, for example, also switch in that view, that table view into a built-in visualization view. So same query, but here I’m looking at life expectancy as a function of time and country. And this is using the built-in visualization in Databricks based on Plotly. So you can see, for example, already that there an outlier here. The life expectancy for one country here is low and declining, whereas others are higher and increasing, and this country is the United States. So you might wonder what’s different about the US that would explain that difference in life expectancy. So the data engineers may continue here, they’re gonna grab a different dataset And you may have noticed along the way, I’m actually saving off some of these health indicator codes and their descriptions as a table, because I’m gonna need that later to understand what these column names mean. Let me come back to this in a second, because it’s important that I’m writing them as a Delta table. More ETL of a second data source, we get it into a similar format, country, year, a bunch of values here, and we’re gonna join these a little bit later. Last, not least, we’re going to grab a dataset of drug overdose deaths by country and year, because that may be relevant. And given these three data sets, the data engineering workflow ends right about here. We’re gonna combine them all using SQL here into one dataset and write that out as a Delta table. And I do wanna note that over here we’ve actually mixed and matched the Spark APIs, SQL, and language native ETFs. And that’s easy to do in Spark, that’s easy to do in Databricks. So it’s just that whatever transformation you need to express, you can express it in the most natural language that makes sense, and just get on with creating the ETL job. But this, for this example, I’m gonna skip a little bit past the data engineering and assume that this is where the workflow ends. We joined all these tables, we have all these features, and we’re gonna present them to the data scientist for analysis, for modeling and so on. And we’re gonna do this by creating a Delta table, or writing this data as a Delta table, and we’re actually registering it as well in the metastore. So if we do this, we will see that in the data tab, here in Databricks, these several tables I’m running show up, and that helps with discoverability, for example. And if we clicked into, for example, the descriptions table, this is this lookup table I was writing. You’ll see, for example, the schema, sure, sample data, but also a history. So Delta tables, unlike most data sets can be updated, they can be modified, and as they are, there’s a transaction history that’s created. So for example later, I could go back and query the results, the state of this table as of a previous point in time, as of a previous version. And maybe that doesn’t help too much for this particular lookup table, but it might make a big difference for this input table, this so-called Silver table that we are creating here for the data scientists. For example, with Delta the data scientists can come back later and query the data as of two weeks ago, to see what the data looked like when they built their model. Delta also ensures that this data set is transactional, so if this ETL job is running one night while the data science inference job is running, no problem, that job was gonna not read the data in an inconsistent state, it’s gonna read a consistent version of the data. There’s a couple of other interesting things here, like for example, if the data engineering pipeline had a bug and we wrote some bad data here into this input table, well, we could easily roll it back, and we could rerun the ETL pipeline to update it as well. So although Delta is just a storage engine, it still is valuable to data scientists because it offers some features that make this table, this interface between data engineering and data science, more reliable and more robust. So let’s move on to the data science here. So the data scientists might pick up here, maybe in a separate notebook, although we’re here in the same notebook and they could easily use Python, for example. So we’ve switched to Python here, and that’s no problem. This data set is equally readable as a table in PySpark. So we read this data, and maybe look at some summary statistics here. We see that, for example, some of these features are entirely missing, you know, they’re present in no rows. Some of the features are present only in a few of the years and countries as well. So we might figure we need to do some imputations and inference of the missing values as part of our pipeline here. So let’s go ahead and do that as data scientists Now we could do that in PySpark, no problem, but I’m doing it in pandas here, just to make a point that for data scientists who are used to pandas, and are used to the whole Python ecosystem, no problem. You can use pandas here pretty easily with Spark. And here for example, I’m gonna take a Spark data frame, bring it down to one machine as a pandas data frame, and then use pandas to forward and backfill the missing values by country and by year. I could have done it in Spark, no problem but I’m just making the point that I can easily do that and then send it back to Spark. And this would not be a good idea if the dataset was large, but if it’s not, this is totally legitimate. And I’ll show you a couple of other ways that pandas interacts and inter-operates with Spark later. So as data scientists, maybe the next step is a little bit of exploratory data analysis, maybe we wanna take this, a couple of key features here and look at how they correlate. We might wanna make some plots here, and here I’m gonna use Seaborn to create these various pair plots between a couple of key features. So Seaborn is a common library for visualization of Python, and it’s already built into the machine learning runtime in Databricks. You don’t have to install it, if it wasn’t installed, it would be easy to install, it’s no problem, but things like Seaborn that you might commonly use are just already there, they just work. So I create this pair plot and we see things like, well, maybe GDP here is correlated with spending on healthcare, that makes sense. And we also see that deaths from opioid overdoses is, well, there’s a clear outlying trend here, and this series here in brown is the United States. So we might file that away as well, and maybe that’s an explanation of why the US life expectancy is lower, we need to investigate that. Moving along, maybe there’s a little more featurization that has to happen between the Silver table and the final Gold table of featurized data that we’re gonna build a model on, that we’re going to run the model on in production as well to do inference. And here there’s just a little bit of hand coding that has to happen, so it’s not very complicated, but you can imagine this could be more complicated in general. And so this might be an additional data pipeline you need to run repeatedly to get from Silver data to this Gold table of featurized data, that’s not just what you built your model on, but what you can apply the model to in production. And we’re gonna register that as a Delta table as well. So now we’re almost ready for a bit of modeling. So before we get to the modeling, I wanna introduce one other library here, called Koalas. And this is for the pandas users out there. So pandas is a common data manipulation tool in the Python ecosystem. And for those that know pandas, but want to use the power of Spark to distribute computation across a cluster, that’s where Koalas comes in. So you can import Koalas, and load data from Spark as a Koalas data frame, and then you get an object that is manipulatable as if it’s a pandas data frame, you can apply pandas syntax to it. But all these operations are really happening on this distributed dataset. So here, I’m just doing something pretty simple. I am splitting the data by year into a train and test set, so we’re gonna train up to 2014, and we’re gonna test after 2014, but you could do more elaborate things with Koalas. But here I’m just using it to show that you can use pandas-like syntax to manipulate data, even on top of Spark. But this data set is actually pretty small, So we’re gonna actually realize it as a pandas dataset and move on to modeling. Maybe wonder well, hm, is that interesting here? I thought Databricks was about scale, it was about Spark, and distributed cluster in computing. And this data set is small enough that building a model on it does not really require a cluster to build the model. But that doesn’t mean Spark’s irrelevant here. So for example, given this pandas data frame, we could write a little bit of code here to train a model, a regressor with XGBoost, to predict life expectancy as a function of these thousand 1600 parameters or so, and that’s great. That does not need Spark though, because this dataset, small XGBoost can handle this perfectly well on its own. But we do not, we don’t build one model. When we build models, we really build hundreds of them, because we’re not sure what the best settings of all these hyper-parameters are. Maximum depth, learning rates, regularization, and so on. And normally we would use sci-kit, we’d use spark.ml, to do a grid search, do a random search over all these parameter combinations, but really no matter what, we’re building hundreds of models. And that can happen in parallel, and because that can happen in parallel, can happen on a Spark cluster. So that’s why, for example in Databricks, we ship and support a tool called Hyperopt, it’s an open source tool for parameter tuning. So Hyperopt can help us figure out what the best values are of these hyper-parameters, we give it ranges and let it explore. And there’s three things that make Hyperopt appealing in this case. Number one, it’s Bayesian optimizer, so it conducts an intelligent search over the space, it’s not gonna bother looking at parameter combinations that haven’t worked out well, it’s gonna focus its attention on the combinations that seem to be working well and do a deeper dive. Second thing it can do is integrate with Spark. So for example, if we use the Spark integration here then when we run this, it’s going to actually use the Spark cluster to build these models in parallel, not necessarily on the driver, and we get some speed up there, even though the models themselves, these XGBoost models do not know about Spark at all. So there’s still some value for Spark, even if your modeling process does not use it directly. So as we run a tool like Hyperopt and try to let it optimize, find the best model, now we’ll see that its best loss goes down down, down. It’s trying different hyper-parameters, the best loss, lower is better, is going down, down, down. And the other interesting thing this does over time is to record what it does with Hyperopt, sorry, with MLflow, excuse me. So as MLflow, Hyperopt runs, it’s actually logging everything it does to MLflow. MLflow is this open source model tracking and deployment framework from Databricks. And the good news about MLflow is it integrates with a lot of tools and it can do a lot of stuff for you automatically. So here we really didn’t have to do anything with MLflow, but yet we got all these results here in the experiment for the notebook. I think this is more interesting if we pop this out, let me switch over here. And so we can see here, for example, all the 96 or so runs that MLflow created here. We can click into any of them and we can see, for example, in this instance, for some hyper parameters, it achieved a certain loss value, and so on. This view is not that interesting, I think for parameter tuning, it’s more interesting to, for example, select all these and compare all the 96 runs that Hyperopt created. And we can do this with MLflow, so what you’re seeing right now is the MLflow tracking server, it’s integrated into Databricks, you can run it on your own though, this is actually open source. And we might do a parallel coordinate plot and see, for example, that some of the best runs here with the lowest loss, we can see what their hyper parameters look like, and maybe, you know, figure out that well, we probably should have looked at lower learning rates, higher min child weights and so on. But back to the story here, we might take the results of this process, the best hyper-parameters and go ahead and run one more model, and log it with MLflow, on all the data with the best hyper-parameters. And we’re gonna do it manually here, but it’s really not that hard. So even if you do it manually, you import MLflow, you start a run, that’s the end of logging, you build your model, you log whatever parameters you want, and you log the model, which is really important here, we’re logging this XGBoost booster, and maybe here we also log a feature explanation plot by a tool called CHAP. I’m gonna show you the plot later. I will tell you that by the time you see this, I believe MLflow version 1.12 will support doing this automatically. So even though I had to write a bit of code to do this, you will get the plot I’m showing you now for free. So if we do this, we get this run here. And we can see we logged, you know, what were the best parameters here, and also the artifact itself, some of the environment requirements, so this requires xgboost–1.1.1 in MLflow. We even logged an example of the input to the model, which I’ll show you in a minute. And we logged this plot here, this feature explanation plot, which I will actually probably show you in a second here in the notebook. So we could declare victory and send this artifact here, this pickled file, to some dev ops person to deploy and wish them good luck, but we probably want a little more process around that. So getting to production means managing the promotion process at least. So that’s where the model registry comes in, and if you see the models tab in Databricks, that’s the model registry. And so this particular model is probably just the latest instance of many different models that we’ve built that implements this service we’re trying to build, this life expectancy predictor. And so we’ve registered this model card, called gartner_2020, this is for Gartner. And we can see, for example, in our notebook, all the various versions of this model that have existed over time. And so let’s just focus on the active one. So as we start this example, maybe version 30 is the current production model, and we’ve just created version 31, and we’ve registered that as the current staging candidate of this model. Maybe we think we’ve done better here, and we wanna propose this as maybe the next production model. So that’s what we do here with MLflow. Now, a couple of things could happen here. Maybe an automated process takes over, a web hook can trigger that to run a test on the staging candidate and if the model looks good, we flip the bit and promote it to production. But maybe, certainly for narrative purposes here, maybe we wanna do a manual sign-off process. So a data science manager could come in, load from the registry, the latest staging candidate, excuse me, and unpack those artifacts and take a look at the plots. Just to answer the question, according to the SHAP, this model says that the factor that most influences predicted life expectancy is mortality rate from cancer and diabetes. And for countries and years where that mortality is high, life expectancy is lower by about almost one and a half years, and where it’s low, it’s higher by almost a year, which you know, makes sense, of course. But what we don’t see in here is drug overdoses, so apparently that’s not really an explanatory factor, it’s really just the deaths from cancer and diabetes, darkly. So after some more analysis, maybe the data science manager decides that’s fine, we can promote this to production. You can do that with the API, you can do it with the UI as well. So I could’ve gone in here and, you know, promoted this model to production. And of course this is all permissionable, so I can control who is allowed to register new versions, who’s allowed to transition them, who’s allowed to transition to production and so on. So let’s get to production then. So normally, production’s kind of a hard part. So we have a model that someone created in the lab and they send it over to dev ops engineers to implement in production, and that production environment may be totally different. So one thing MLflow does really well is to try to translate that model that you built in the lab to an artifact that’s usable in production. So we built an XGBoost booster, but what we probably need, maybe for a batch scoring of the model is a Spark UDF, and MLflow can do that for you. So for example, this could be my production job, this cell here. I load from the MLflow registry, the latest production version of this model, not as an XGBoost booster, but as a Spark UDF, this becomes a function that I can apply to data with Spark at scale, and then I read the latest featurized data that’s landed in my Gold table, maybe now some data has arrived for 2017 and 18, and I apply it with Spark and that’s it, it’s not more complicated than that. And I can, for example, join this with the actuals from previous years and recreate this plot I showed you earlier to show the extrapolation here. So these are the predicted life expectancies according to the model for 2017 and 2018. And this would not have been different if this were a streaming job, no problem, if data’s arriving in a stream, you could also do the same thing. This could also be a SQL UDF as well. So you can load these models as Spark ETFs that are usable in Spark SQL as if they’re SQL functions as well. Another interesting thing you can do with models is deploy them as real-time services. So you can have, MLflow wrap-up your service in a Docker container, it exposes it as a Rest API that responds to JSON formatted requests, you send it inputs, you get an output back. And that can be deployed to your choice of cluster manager, Kubernetes, Azure ML, SageMaker, but it’s also deployable in Databricks. So I’ve done that here to my model, let me open the serving tab here. And so for testing for low volume usage, so you can actually just deploy this within Databricks as a Rest end point here, and here’s the URLs, and I can grab, let me just quickly grab a bit of inputs, JSON formatted input that would work in this model, and I believe, I hope this’ll work. Yes. So given that a first line of input in the current JSON format, I can send it to the model and it gives me back a particular life expectancy of about almost 80 years here. So if you wanna deploy your models as a service, no problem, MLflow can do that for you. Last, not least, maybe production in MLflow means not a batch scoring job, not a rest API, but some kind of dashboard for business users. So maybe we wanna give to business users some tool where they can explore what the model’s saying and maybe evaluate some what if scenarios. So let’s take a look at what life expectancy would’ve looked like in the US over the years had that mortality rate from cancer and diabetes been different. So if it were low or high, how would that have affected the life expectancy according to the model? So we can write some code here to load the model to run this series of predictions to create this heat map here, but in Databricks, we can also export this as a dashboard, like this. And maybe this is easier to share with business users, it’s simpler than the whole notebook. But we can also instrument this with some widgets, as here, and we can make the plot update as we update the widget. So maybe you wanna drill a little more, let me change this to 17, and as I change it, it reruns the cells, it reruns all the scenarios and re plots this, and I see a slightly different plot here. So that really concludes my demo here from end-to-end. We took a couple of datasets, we did some feature engineering with Spark in Databricks. We managed those features with Delta, we did data science and used some third party plotting libraries and so on to explore the data. We used tools like Hyperopt to parallelize our model selection process with XGBoost on top of Spark. And once we got that best model, we logged it with MLflow and we use demo flow and the registry to manage its transition from say, test or staging to production. And once we got to production, we showed how you can grab that model and simply deploy it as a batch job, as a streaming job, as a rest API, or even maybe as a dashboard as well. So I hope this has been helpful, I hope that you can see how this might apply, this pattern might apply it to whatever models you are going to build, and I hope you try MLflow and get started today. Thank you.
Sean is a principal solutions architect focusing on machine learning and data science at Databricks. He is an Apache Spark committer and PMC member, and co-author Advanced Analytics with Spark. Previously, he was director of Data Science at Cloudera and an engineer at Google.