A Collaborative Data Science Development Workflow

May 26, 2021 05:00 PM (PT)

Download Slides

Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.

In this session watch:
Nicholas Hale, Data Scientist, Trillion Technology Solutions

 

Transcript

Nick Hale : Hi. I’m Nick Hale, senior researcher at Trillion Technology Solutions out of the Washington DC area. And today, I’m going to go over a collaborative and scalable machine learning workflow that we’ve developed here at Trillion to allow individual developers on our teams to contribute to a given machine learning project, compare their methods and models with one another, and prototype models and train them on the cloud and deploy those models in a production setting with really, a sort of a seamless integration of parts.
And we want this capability to be able to continuously train models, and deploy them whenever new data is received as well. So, a brief overview of what I’ll just talk about today. We’ll talk about an overview of the workflow, and the components involved in the workflow. And we’ll discuss the architecture of how all of the components connect to one another to achieve this end of collaborative machine learning. And then, I’ll go through sort of a sample project as if we are starting a machine learning project from scratch, and show exactly how to use the workflow.
So, some of the objectives that we were looking to get out of this workflow is allowing team of machine learning researchers to develop on a new machine learning project, and collaborate independently of one another, whether they be on an individual developer cloud instance, or a local laptop, and allow them to prototype new models without using lots of resources, working on a small set of data and save costs on that end, and then, be able to scale that up to large dataset training, and ultimately, model production.
And we also wanted easy experiment tracking and model deployment through this workflow, and a flexible framework that can be used with any cloud platform. So, some of the contributions we’ll talk about today, the workflow… We’re talking about how to use, Git version control with multiple data scientists working together, how to do those branches, this feature branches, and how to scale data science pipelines collaboratively. And also, showing how the method is tightly-integrated for each step from developing data science code to experiment tracking, to model deployment as well.
So, here are some of the core components that we’re using in the workflow. We have Docker, which is a containerization framework, Kedro, which is, we use to create our data science pipelines, MLflow, which is used for machine learning project management, model tracking, experiment tracking, parameter logging, et cetera. And Databricks is used to integrate our compute with our data stores, and allow us to process large data jobs in a very easy sort of integrated way. And we have PySpark as well, which allows us to send jobs in Python diskpart clusters and create those jobs for easy big data processing. And we have a cloud service. It could be this pipeline and this workflow work with AWS, Google cloud or Azure, all the same.
So, let’s get into the components of it. Some of you may have heard of Docker, I’ve used Docker quite a bit. It’s sort of the standard for containerization. So, really, what we’re using it here for is allowing for environment consistency between developers and between training, jobs, execution, environments. And Docker is an open platform for developing and running applications, and it provides really a lightweight virtual environment called containers that are sort of loosely isolated from the host machine’s operating system.
So, our workflow uses standardized container development and deployment, allowing for consistency across machine learning libraries and other software dependencies. So, all these libraries and dependencies are ubiquitous for each developer and each machine, each environment we’re running on. So, for machine learning pipelines, our main library that we’re using is Kedro, and Kedro provides an open source framework for building and running modular data science code, and QuantumBlack, the company came up with this owned by a McKinsey.
Really, the core idea behind Kedro is that it creates directed acyclic graphs, which are comprised of functions and datasets, and which are connected together to create a pipeline that can be executed in one process. So, nodes and the graph are the functions that drive the pipeline. It can be data transformations, model training, feature engineering or selection, are examples of node functions that we might be using.
Kedro provides a very nice and data engineering convention to track data transformations across local and cloud data stores. So, we’re having sort of the same data engineering done on local machines, and also, in the cloud. So, we can easily track each type of dataset we’re using throughout a given pipeline. So, here’s a little bit of an example of that data engineering convention. So, they have different layers going from the raw data, no processing whatsoever. And that goes into sort of an intermediate layer with maybe some cleaning, or cleansing, or a bit of data wrangling done.
And then, sort of a primary layer that could be the fully cleaned data that you would want to start the machine learning process at that layer. And then, a feature layer where we’ve logged different features that have been engineered, or they’re sets of features that are directly engineered at that layer. And the model input layer, which are data sets that are going directly into the models, the model themselves, which can be saved as pickle files or other files. And the outputs of the model predictions or other artifacts, and then, a reporting layer that shows everything that happened during training runs and the cross validation reports.
So, we have here on the right hand side, we have an example of this directed acyclic graph that Kedro provides. And you could see that, at the top, we start with a data sets and some parameters that goes into a function called process data. That function is executed, and the outputs of that function, it splits the X and Y from the training set. And then, some more parameters are hooked in, that goes into a function called train model. Test set is created for X and Y as well. So, those are other arcs going across on the right hand side.
Another function called train models executed, the output of that is the model artifact, that goes into a function called predict(), and predict() along with the X test data set, and the predictions are output from that. And then, you can see, at the end, we have predictions and Y tests going into a reporting function for the accuracy. So, that’s just an example of what Kedro allows you to do in one modular pipeline.
So, another key component of the workflow is Mlflow, and MLflow is an open source platform for managing machine learning life cycles. It allows for experiment tracking and easy reproducibility, deployment and registering models. It allows for multiple users to connect to an MLflow instance and run their experiments in different compute environments and log all the experiments together. And these Mlflow allows for easy deployment of models to web applications through REST APIs, to serve predictions. Let’s say if you’re doing machine learning in a web application. And it allows for easy versioning of models and parameters to be tracked, so, we can always reproduce those models.
So, here’s an example of the Mlflow server user interface. We have a lot of experiments being run here in the main section. The different runs and the users and different parameters that have been trained on each of those models. So, it logs all that for you automatically. You can quickly just add MLflow logging, turned on basically in your code and it’ll run all of this for you, and do all the logging automatically. And so, you can have multiple users putting their model, training their models, and then, logging them together here. So, we can easily track and compare the outputs of multiple users.
Another really key component to this is Databricks. For us, we’re using it for ML lifecycle integration. And it’s a platform that enables seamless integration of data science code, experiment tracking from MLflow Like we saw, and cloud resources that we want to tap into, as well as data, of course. With Databricks, we can allow data scientists to run their Kedro pipelines as we previously saw on compute clusters via Databricks Connect, which is a Spark client library that connects local development environments to Databricks clusters.
So, we can send Spark jobs quickly to Databricks through Databricks Connect. And so, we’ll talk about Spark. So PySpark is what we’re using for big data processing. It’s a Python library that ships data science jobs to Spark clusters running with Databricks. And it works with Databricks Connect, so we can quickly translate our pipelines into Spark jobs on a local machine, and then, send it to the cloud to run and do training, or what have you, with very little friction.
And so, Spark, usually… Spark has a different sort of framework of programming datasets and constructing data sets. So, Koalas is a library that implements native, the native Pandas Pandas DataFrame on top of Spark, which allows for very little learning curve for a new data scientist who’s familiar with Pandas, but isn’t familiar with Spark, to come in and start creating pipelines in the same way that they would always… They’ve learned to do that.
So now, I’ll show you a bit about the architecture that goes into this workflow. At the top, we have sort our Git repositories and Docker images where we use the same Docker image for a developer. So, they install and clone that Docker image on their local computer, and they can also install a custom scaffold repositories what we call. And a scaffold consists of proprietary ML functions, custom ML functions that we’ve built, and we use over and over, and also gives you all the dependencies with all the same versions to keep the versions consistent across all developers.
And our ML project that we’re working on… So this scenario, kind of working on new ML project, let’s say. We initialize it is a Kedro project, which Kedro allows you to do, sets up the data engineering and sets up a bunch of different files you need to run the pipelines. So, once individual developers are set up, in that way, we can… Inside of their Docker container, they develop, and on the bottom left, you can see that they’ll have ML libraries in there, they’ll have the ability to build Kedro pipelines. They are PySpark and Databricks Connect, to send those jobs to the clouds, as well as Koalas to work with Pandas DataFrame. It can be sent to Spark just as if you are working with a Pandas DataFrame.
And so, once we’re ready to send those Spark jobs, they can easily just be sent to the cloud where Databricks is running through Databricks Connect. And on the right hand side, we have in our… Let’s say, we’re working on Amazon, we have our, our S3 bucket where we have adopted Kedro data engineering convention in there as well that’s identical to other individual machines for this game or project.
In our cloud environment, we have Databricks which allows connections to our cloud storage like S3, or databases. We have Mlflow that is easily connected to Databricks. We have our training compute and deployment compute. So, once we run our jobs, we do the training, and we can quickly deploy. We can do continuous training and deploy models to production applications directly through commands, directly from local environments. So, once the cloud is up and running in this format, we can just continue develop locally, send these commands to the cloud, it’ll do all these training and compute functions with no friction.
So, I’ll walk through sort of an example scenario. Like we’re just starting a brand new machine learning project requires some research to get to a workable model that we want to put into production. So, we can start the product set up, we initialize a Kedro project template, step one at the top here. Okay, there’s my slide, step one at the top here. And we clone the repositories and correct necessarily libraries like Databricks Connect and our dependencies and build the development training and training containers.
And on the Gitflow side, we’re initializing a master in development branches within the project set up. And so, step two, compute-wise, we’re on our local machines, or individual cloud instances, and we are working… With our data engineering, we’re creating pipelines with Pandas, prototyping pipelines with Pandas and other scientific machine learning libraries in Python. So, we’re sort of getting a sense without going and using cloud resources. What are some promising ways we can start the modeling? We can log these local experiments in MLflow in the same manner.
So, for this part of the process, each data scientist has their own feature branch where they’re working. So, all of their individual experiments are logged in there, so they can do whatever they want experimentation-wise on their own feature branch. So the next step would be to go to the clouds, and we’ll be working with Koalas and PySpark. And in this case, we are essentially… We’ve done enough experimentation that we want to train on lots of data, and really, do a full training run on all the data, and not so much prototyping anymore.
From the individual experiments, we found a promising solution, and we can train this candidate model on all the data. And so, we just do that in the cloud through Databricks Connect, using Koalas and can register that model in MLflow as well. And then, we’ll commit that code to the development branch and add in a new functions to our scaffold library as well. And then, step four, we are going to select a final model that we want to put into production.
And we’ll use the same Databricks Connect with the inference compute, and on the engineering side, we’re still using Koalas and PySpark. And we’re going to deploy that final model through MLflow and Databricks, and commit that code to master, sort of a production 1.0, if you will.
And so, in step five, we can basically introduce continuous retraining. So, with that final model, when new data is received in the production environment, we can re-train the model with an automation of that pipeline that was the final pipeline, and automate the retraining and logging and reporting as well, whenever new data is received.
So, that constitutes the full workflow that we are using here at Trillion Technology Solutions. And I’m excited to take your questions on this. There should be a lot of discussion. I think a lot of you are working with a lot of these tools as well. So, I’m interested to hear your feedback, how you’re using some of these tools together, what systems you’ve designed to work with multiple data scientists on a project as well. So, thank you very much.

Nicholas Hale

Nick Hale is a Senior R&D Specialist at Trillion Technology Solutions. Nick leads AI/ML initiatives at Trillion to support public sector customers. He has worked on AI/ML related Department of Defen...
Read more