Machine learning models are only as good as the quality of data and the size of datasets used to train the models. Data has shown that data scientists spend around 80% of their time on preparing and managing data for analysis and 57% of the data scientists regard cleaning and organizing data as the least enjoyable part of their work. This further validates the idea of MLOps and the need for collaboration between data scientists and data engineers.
During the crucial phase of data acquisition and preparation, data scientists identify what types of (trusted) datasets are needed to train models and work closely with data engineers to acquire data from viable data sources.
Another important aspect of the ML lifecycle is experimentation–where data scientists take sufficient subsets of (trusted) datasets and create several models in a rapid, iterative manner. And without proper industry standards, data scientists have to rely on manual tracking of models, inputs, hyperparameters, outputs and any other such artifacts throughout the model experimentation and development process.
In this talk, you learn how to automate these crucial tasks using StreamSets and MLflow on Databricks.
Speaker: Dash Desai
– Hi, welcome to Model Experiments Tracking and Registration using MLflow and Databricks. My name is Dash Desai and I’m director of Platform and Technical Evangelism at StreamSets. Today we’re gonna talk about model experiments and automating some of the tasks that are involved. For example, data acquisition preparation and being able to track the model experiments. And we’re also gonnalook at hands-on demo that I’ve prepared for you guys today. So let’s get started. Machine learning models are only as good as the quality of data and the size of datasets used to train those models. And data has shown that data scientists spent around 80% of their time prepping and managing data for analysis and about 57% of the data scientists regard cleaning and organizing data as the least enjoyable part of their work. This further validates the idea of MLOps and the need for collaboration between data scientists and data engineers. And doing this crucial phase of data acquisition and preparation data scientists usually identify what types of datasets are needed to train models and work closely with engineers to acquire the data from viable data sources. Now let’s talk a little bit about model experiments tracking and registration. Experimentation is a big precursor to building models where data scientists take subsets of data sets and create several models in rapid iterative manner. But without proper industry standards, they have to rely on manual tracking of models, inputs, hyper parameters, outputs, and any other such artifacts throughout the model experimentation and development process. Now this can result in a very long model deployment and release cycles, which effectively prevents companies from adapting to dynamic changes, gaining competitive advantage and in some cases also staying in compliance with changing governance and regulations. Now let’s see how StreamSets can help. Some of the common data sources we’re acquiring data sets for data science projects include Amazon S3, Azure Blob Storage, Google Cloud Storage Kafka Hadoop on perimeter cloud data warehouses. The StreamSets provides easy to use graphical user interface for building smart data pipelines for both streaming and batch data flows. And you can build these smart data pipelines for fast data ingestion of large amounts of data from distributed systems, as well as all the common data sources that we just reviewed previously. Another aspect of data ingestion process is the storage. In some cases companies may already have a data Lake or a data warehouse, and in some cases they may need to build one. It seems that smart data pipelines are capable of connecting to existing data lakes and their warehouses as well as have built in capabilities of creating new ones. As part of building these smart data pipelines data engineers can also perform some of the key transformations needed by data scientists. Some of the common transformations required during data preparation include data type conversion, renaming features, merging data sets, repositioning. Data CEDAR for a Mac conversion from JSON to Parquet for example for efficient downstream analysis in Apache Spark and so on. All of these transformations and many more are rarely supported by StreamSets their ops platform. So as we just discussed, StreamSets is a modern data integration platform for building smart data pipelines. They’re really easy to build using drag and drop. There are cloud and platform agnostic. You have hundreds of connectors to choose from. The pipelines are really easy to scale and port without having to rewrite any of the parts of the pipelines. They’re extensible and you can also automate them real easily. And they’ll flow as an open-source platform for building end to end machine learning life cycles. And data breaks as you all know is a unified data analytics platform that provides a fully managed Apache Spark and MLflow as a service. Let’s see how everything comes together. Now before we dive into the demo, I just wanted to set the stage. This is a transformer pipeline. Transformer is a modern transformation and spark ETL engine that’s part of the StreamSet’s data ops platform. And with using MLflow and Databricks, this creates a powerful and seamless solution because transformer can run on Databricks clusters and Databricks comes bundled with Mlflow server. So the pipeline shown below is designed to load training data from S3, perform transformations, like remove row ID, rename target column from mdev to label, which is kind of required by spark ML, train gradient boosted ression model in Pi Spark. We partitioned the data so it’s all in a single file and archived the training data in S3. I would like to go back to this for a second. This is where all the magic happens. Not only we are creating a model experiment, but we’re also integrating with ML flow to track and version model training code data, hyper parameters, as well as registered and manage models in a central repository in MLflow, right from transformer. This is critical for retraining models as well as to reproduce experiments. And we’ll look at the code in more detail during the demo. Okay, let’s get our hands dirty. So today we will look at some data ingestion and preparation pipelines. Then we will run a couple of pipelines that will create model experiments in ML low and Databricks. And finally, I’ll show you how to create these smart data pipelines. So before I switch screens, I just wanna mention that we’re gonna be moving from tabs to tabs between different technologies so please bear with me. All right, let’s get started. Let’s start with this one. This pipeline is designed to run on EMR. It ingests raw logs, transference them and stores them in different formats, caring various use cases and personas. And if you notice down below, there’s a preview of data being displayed against live cluster. A very unique feature in the platform. As you can see it’s displaying all the columns currently being ingested along with their data types. And you can step through different transformations to see what exactly is happening. So in this case some of the columns are being renamed. For example, the number of purchases is being renamed to this. And similarly number of views to this column. You can also write your custom code to make some of the adjustments that you need to make. In this case, we’re removing coats around item column using Scala. And this is what the code looks like. And finally the data is stored in Parquet, back in S3 and also in Redshift for use cases where you’d like to run some SQL queries. Next up is sales insights. This pipeline is designed to run on Spark for Azure HD Insight. It takes raw sales data in CSV format transforms it, does some aggregations and stores the prep data back in Azure Data Lake Storage. Again, down below you can see the preview of data against the live cluster, and we could step through each stage to understand what exactly is happening. In this case we’re removing employee ID in order number because they may not be required downstream. We’re also calculating revenue based on units sold and unit price using SQL expression. And notice that there’s two different paths. The clean version of the data is stored in parquet by region in ADLS gen 2 as well as aggregations are being made and stored in JSON format. Again, these things depend on use cases, but this is just to show you how you can design these pipelines that can cater to different use cases as well as personas. Next one is data preparation on Google Dataproc. This pipeline is designed to run a Dataproc and it takes historical data and prepares it for a training of model. Let’s step through the data transformations. Using this processor we’ll filtering out some of the data that we don’t really need during training the model. And this processor is removing future information that’s present in historical data, incident count. Using Spark SQL expressions, average response time is being converted from string to double. And this is because that’s the target that the model is gonna train on to predict. And finally the data is stored in BitQuery. In this particular case, the data is being stored in BitQuery so you can actually take the data set and build a model using RML in Google Cloud. Next up is Twitter to Kafka. This pipeline is designed to take Twitter data in real time, transform it and send it to a Kafka cluster. Now if you look at the preview data down below Twitter API returns the responses in a list. Let’s check through all the transformations to see how we can prep this data for downstream analysis or model experiments. So the first thing that happens is every single tweet record in the list is converted into its own record using field pivitor okay. And then conditionally it’s routed to different paths based on if the tweet was already deleted. This processor is parsing the XML within source field so that we have all the attributes laid out. And then this processor is designed to reduce the payload and we flattened the nested structures, renamed some of the fields and then finally this is what gets stored in Kafka. Okay. And one final one that I wanted to share with you guys is taking real-time data from Twitter and storing that in S3. Again very similar transformations, but maybe applicable to a different use case where you would like to have as much data as possible for downstream analysis and building models. So that was just to show you how you can create these different pipelines that are cloud agnostic, platform agnostic to ingest data ,prep data, which is a very important step in a machine learning life cycle. Now let’s get to the heart of this presentation. Here are the two pipelines we’re gonna be working with today. Let’s look at the first one. Now this is the same pipeline you saw earlier in my slides. Basically we’re loading some data from S3. We’re prepping it to create a model, then archive that data back in S3. Let’s go ahead and preview and see what the data looks like. As you can see down below the preview is actually happening against live Databricks cluster that the pipeline is designed to run on. It will just take a couple more seconds. There you go. So you might recognize this data set based on the column names. This is actually a very popular and famous Boston housing dataset. Using this processor, we’re removing C zero column, which is almost like a real ID and not really required to train the model. We’re also renaming the target column to label, which is the requirement of Spark ML. And then this is where all our code is to train the model. To summarize this code in Pi Spark, it inputs libraries including MLflow. It converts features into vectors, splits the data into training and testing to evaluate the model. It performs cross validation and produces the best possible model based on the data and the hyper parameters provided, then it tracks features and hyper-parameters in Mlflow. Registers the model and finally, based on certain conditions, it transitions the experiment from staging to production. Let’s get back to our pipeline. So now what we’re gonna is actually run this pipeline as a job. Jobs allow you to run multiple instances of the same pipeline with different parameters. Here we go. Okay. All right, so we’re in action. This pipeline is actually going to get some data from S3 bucket, which is this one right here. It’s got one file with all the data. This is what the data looks like. And the pipeline will take a couple of minutes to run. In the meantime, let’s look at MLflow in Databricks. This is where all the models are registered through the pipeline. And these are all the sample runs from before. Once the pipeline runs, we should see another version of model right in here. Now, within MLflow, you can compare different model versions by clicking on these check boxes and hitting compare. This will show you the parameters that were used to train the model. Okay? And these are all the attributes that have been logged through the transformer pipeline as we saw in our Pi Spark code. Okay, now let’s go ahead and check our pipeline run. Here we go. Version 11. And this is where you can see all the details about the run itself. Okay, these are the features that we had logged through our Pi Spark code, hyper parameters. These are the metrics R2 and RMSC that were logged via Pi Spark code in our pipeline. And here’s the model that was registered during this run. Now, let’s see how we can automate the process of training the model if we have more data to train. And we’re gonna do that using the second job. I’m gonna open that in a second separate window so we can have both of them side by side. So you can create these pipelines where you can orchestrate different jobs, check for statuses and take action depending on the status of the job. So basically what’s happening in this pipeline is it’s looking for data in the same bucket as before. If there is data, it starts the model training job that we just looked at. Then it addresses some of the attributes, the metadata, and then checks the status of the job. Depending on the status, it either sends it a success notification or a failure notification. And this pipeline is designed to run in it always on mode versus the other pipeline, which runs in batch mode. Okay, so there’s no need to keep the training job running after it’s complete. Okay, and until there’s no, there’s more data to be trained. All right, so I’m gonna ahead and start this job. Doing so will kick off our training job just like before, automatically. Okay, there we go. Right. And this is an orchestrator pipeline. So not really a good way to understand what’s happening but basically you can see that this is a job that’s running right now, right? And that’s our training job. And once this pipeline or job completes, it’s gonna create another version of the model, in MLflow. Okay. These are all the runs and experiments. Let’s go back to the other view models and we should see version 12 once the pipeline completes training the model. And again, this is a repeat just because I wanna show how you can automate this. So what we’re gonna do next is upload a different dataset, which will wreak off the trading job. All right, so I’m going to go ahead and upload a second dataset data and then Boston housing paid. All the defaults are good. Okay. Let’s go back. And there we go. So it should have kicked off the training job automatically. There it is. All right. And if we look in here, we should have version 11. That was from the first run. 12 is the one that we just logged and we should have another one for the second set of data. Okay. So I’m not gonna ahead and wait for everything to complete, but let’s check the status of the pipeline, again should have completed. And then we should see model 13. There you go. Oh, cool, so it basically passed our, it satisfied the condition that we have in our code and that’s why this particular model has been promoted from staging to production. That’s pretty cool. Let’s compare the two models and see why that might be. You can see that on the latest version of the model, we have a better R2 score compared to the previous version. Nice. Okay, so in the last section, I’m gonna show you how to design these pipelines. Go ahead and create a pipeline and we’ll give it a name. Next we’ll pick the version of data collector we wanna use for this pipeline create, and this will present a blank canvas. This is where you design your pipelines. So the very first thing you wanna do is select your data source. Lots of different options from Amazon S3 to Google Pops Up up to Azure Data Lake Storage, what have you. Okay. Let’s just go ahead and pick one at random the query. And then depending on your use case, you can select one or more processors that will transform your data. Again, lots of different options to choose from. Okay. Let’s just go ahead and pick one at random and then pick another one. Okay. And then finally you would select your destination where you want the data to be stored. Again, lots of different options. Okay. Let’s just go ahead and select one at random, right? So it’s as easy as that. And as far as what you need to configure, obviously depends on the origins, the data sources, the destinations and the processing you need to do for your data sets. Okay. So hopefully that was fun, educational and you learned something today. StreamSets is not a machine learning platform, but it does provide important capabilities and extensibility that it can help and expedite operations that some of the most crucial stages of the machine learning life cycle and MLOps, along with technologies like Databricks and Mlflow. Please don’t forget to provide your feedback. It’s really, really important to me as well as the summit organizers. Thank you so much, really, really appreciate your time and I’ll be happy to take any questions that you may have.
Dash Desai, Director of Platform and Technical Evangelism at StreamSets, has 18+ years of hands-on software and data engineering background. With recent experience in Big Data, Data Science, and Machine Learning, Dash applies his technical skills to help build solutions that solve business problems and surface trends that shape markets in new ways.
Dash has worked for global enterprises and tech startups in agile environments as an engineer and a solutions architect. As a Platform and Technical Evangelist, he is passionate about evaluating new ideas to help articulate how technology can address a given business problem. He also enjoys writing technical blog posts, hands-on tutorials, and conducting technical workshops.