Deploying machine learning models from training to production requires companies to deal with the complexity of moving workloads through different pipelines and re-writing code from scratch. Yaron Haviv will explain how to automatically transfer machine learning models to production by running Spark as a microservice for inferencing, achieving auto-scaling, versioning and security. He will demonstrate how to feed feature vectors aggregated from multivariate real-time and historical data to machine learning models and serverless functions for real-time dashboards and actions.
– Hi everyone. My name is Yaron. I’m CTO and Co-founder of Iguazio, a data science platform company. What I’m going to talk about in this presentation and demonstrate is how to accelerate production of machine learning and data science workloads using microservices architecture.
First before we dive into the solution let’s understand the problem. The key problem today for many organizations is how to move from research environments where they do experiments of machine learning workloads usually working off a small data set using CSV Excel spreadsheets, running some analysis and data science on their Jupyter notebooks and finally doing some iterative work and throwing that into production. The problem is that when you’re going to real production, the work does look pretty different. You have real data which runs at scale. Data may be streaming, may arrive from operational databases to ETL, you may need integration with real APIs that produce additional data. Once you’ve brought all the data into the system you have to run preparation at scale, not on CSVs on Excel spreadsheet but on rather large files or actually work against databases in order to denormalize and create meaningful features off the data and that requires larger scale and more distributed systems. Once you’ve done that you want to run machine learning and even at that phase you want to apply automation, you want to use every time you change the model, you change the data, you see a drift happening to your model. You want to retrain and maybe even with multiple versions of parameters and algorithms in order to guarantee that you have the maximum accuracy. So that needs to go through automated mechanisms and again work on large data sets. And finally when you have the model, you have to serve that model in production against real time data, real features that come from streams, from transactions, from user requests and you want to be able to monitor your models in real time, elastically scale them as needed and increase a feedback loop back into the training and data preparation. When you’re starting to do that, you understand that the basic principles of developing in research essentially doing Jumbo, Jupyter notebooks with lot of code are not really applicable to production. This is where usually large teams of people come into the picture, DevOps, analog, data engineers, ML engineers whatever in order to deliver everything and in many cases they happen to rewrite the code from scratch. So what we want to try and do is be able to build architecture that encompass the production requirements already in day one. So now just to understand further what are those things that you need to apply to a workload before you make it production worthy. So this, in this diagram you can see that once you finish developing the code, it’s not enough to have code. The next thing you need to do is essentially package it, you know create docker containers and make and build script and test script and then you need to think about scaling out your code. And there are different scaling strategies depending on your workload. If it’s data analytics or machine learning or deep learning and once you’ve done that you have to apply optimizations to make sure that it performs pretty well because this model serving function may sit behind an API gateway and serve real customer requests. If it takes too long, the latency is too high, it may not respond in time or maybe it does not have the right throughput to support your business application. And in order to provide the production aware system then you have to apply monitoring, observability, login, security, versioning for everything so you can further explain and and understand what’s going on in production and you know what you want to apply in order not to do everything manually, you want to apply automated processes so CI/CD you know, continuous integration, continuous development, grading workflows that automate the work of productizing work or generating those models and datasets that are relevant, doing rolling upgrades so you don’t have downtime, potentially doing A/B testing and Canaries. So there are many operational aspects to building a machine-learning pipeline and workflow and what you end up doing with a couple of data scientists for a few weeks, you may have with what we hear is people spend you know 12 to 18 months trying to productize those artifacts because they need larger teams with different disciplines that work on that problem. What we’re trying to show here is how to automate most of that workflow so we can rapidly go into production.
So if we’re zooming into what is a machine learning pipeline? We have three different layers. We have the data layer, we have the computation layer and the pink one is served automation and orchestration layer.
So usually in the pipeline you have different forms of data. You’re starting with row features that are being ingested through external sources using ETLs or streaming or various other mechanisms. Once you have the base features, you need to run analysis on those data sets and the analysis will produce a new denormalized data set or a clean or aggregated data set that we want to use for the training and so here in the preparation we may use tools like Spark or analysis or other Python tools or various database capabilities. Once we do that, we apply training. And in the training we want to try and do parallelism, we want to try different forms of training, different algorithms, different parameter combinations in order to see which one results with the best accuracy or best performance. Sometimes it has the best accuracy but it lowers our performance significantly or it takes a lot of time to compute. That it’s not really worth the extra accuracy. On the other tier, once you’ve done the training you want to run validation and the validation will take a bigger data set and compress that with the model that we just created, validated it’s working, performing et cetera. And once we have that ready we move into deployment. So this… When you do deployment, you actually deploy two different things. You deploy the model but you may also need to deploy APIs that go and query the data, bring the feature vector and essentially launch a call against the model. Those two things, the deployment of the model and the APIs, they also generate a lot of telemetry and a lot of information that’ll later be used to help us monitor the models in production, try and detect drift, detect all sorts of anomalies, monitor performance behavior of our model et cetera which also feeds back into the beginning ’cause if we find some changes in the model then we may wanna do retraining and collect new data et cetera. Above that layer of computation, we need an orchestration, we need to monitor experiments, monitor the data changes. We want to orchestrate, we don’t want to run everything manually, we want some automation systems to build those pipelines for us and monitor your activities as I mentioned before. So in our solution we have those three layers. One comprised of set of managed data services and the other one comprised of a set of serverless function engine that will describe based on few tools that can run different engines whether it’s Spark or a function more oriented towards real-time and above that a set of orchestration tools based on several open source projects.
Okay. So if we need to build a stack, one of the recommended mechanism is to essentially leverage Kubernetes which is a generic micro-services orchestration framework and on top of that Kubernetes layer, we wanna plug all those different services including Spark and potentially Presto and TensorFlow, Horvod, Jupyter notebooks, service engines like Nuclio et cetera. This layer has to use some data. After all were processing some data so we may have a combination of on one end a data like in the cloud we may use S3 or objects or on Prem, we may use some fire solution like NFS or a dupe HDFS and in parallel we may want databases to do real-time, workloads and querying you know, key value stores et cetera. At the end on top, we need a framework to also run the AutoML you know, run various experiments automatically, we need a way to track the experiments, the data that is resulting from the experiments et cetera. We need a mechanisms to manage the mega data of all of our data sets and the features through a feature store and we need a workflow engine. There is one pretty good in Kubernetes called Kubeflow which is very tightly integrated into Kubernetes. So that is essentially a recommended architecture for building your micro services.
So we talked before about the concept of serverless. So serverless was invented few years ago by Amazon with introduction of lambda. Serverless is trying to enable two main things. One is write resource elasticity which translates to lower cost because you’re only using what you need and on the same time also translates to scaling, scaling out and being able to handle any scale of data and computation needs. The other aspect of serverless is essentially automating all those different tasks that I mentioned before. You don’t build containers, you write the function and that automatically runs on some container that’s generated for you, everything is auto scaling, it’s monitoring the workload automatically logging and it has more rigid security and all the things are automated so we don’t have to deal with those aspects but serverless also have downsides and I’ve listed here in this slide. So you can see serverless is usually event driven, very short lifespan on functions but when we drive the machine learning worker it may work for half an hour and also scaling is done through serverless using load balancer techniques while for example you take spark thence the spark scaling is out for RDDs and shuffles and reuse or in machine learning and deep learning using hyper parameters to run multiple permutation so that’s a different way of scaling the workload. In data science you have state, you have data and usually service is stateless. And finally the inputs and outputs of a function in those kind of data intensive job is more like a job meaning a parameter data set… You input a set of parameters to provide some input data set then you output with a set of results and some output data sets. So we need a different API to do serverless function but the general idea could stay the same of having essentially elastic scaling and doing fully automated deployment and operation. So with that we introduce an open-source framework, two actually one called Nuclio…
One called Nuclio, the other one called MLRun which working together and the general idea is to extract the computation resources and computation resource could be as far cluster, could be a dust cluster if you need, could be Horovod if you’re using deep learning or Nuclio engine which is used for mostly for real-time surveying, suggestion, et cetera And you take your code, you plug it into that runtime but you need an automation framework to do all those things that serverless do, to build containers automatically, auto scale the workloads, collect inputs and outputs, do logging, monitoring et cetera. So we came up with this concept of machine learning functions which is essentially doing all that work for you. Think of it as when you have a function, what you wanna do is inject the set of inputs, parameters, secrets, credentials in case you’re using accessing some databases or external connections and some data sets then you wanna run the workload and finally generate outputs from that function which could be results you know, accuracy et cetera. It could be operational data, logs, monitoring et cetera or could be data itself. And those functions need to plug into a framework for workflow management and work for castration so you can compose different workflows out of those individual function objects.
Okay. And one of the tools that we use for composition is an open source tool called Kubeflow, originally designed by Google which allows you to compose if you’re familiar with the Apache Airflow or Jenkins and the DevOs world that you can essentially go and compose a layout, a graph, a dag and say for example go bring the data, finish bringing the data, run some data preparation, run training validation, deploy with a bunch of conditions so you can use Kubeflow and if you combine Kubeflow as a workflow engine with serverless frameworks like MLRun and Nuclio then you can essentially go and create something which is far more automated and deliver those pipelines without all the DevOps work associated with that.
So if we’re looking into a workflow, how would such a workflow, development workflow work? So usually you start with writing some code. So you write code in your local Jupyter or PyCharm or in a managed Jupyter notebook or some other notebook framework. While you’re doing that, you may still wanna track everything you do, runs, experimenting, you need some SDK for that like MLflow or MLRun has its own mechanism for tracking experiments. Once I wanna move to running the job and the cluster I may need to also specify some resources for example I need GPUs, I want to an amount of CPU, I need dependencies on certain images. Maybe I need to mount some data volumes into my work routes to make my work load work against something, I need to specify that. And then there’s an automated machine or any serverless engine that will do the build for me and the deployment on the cluster. So I can deploy on the cluster, I can run out a scale for example I could write a spark job or any other Python job on my notebook. It adds some configuration and metadata for the runtime on that job for you. I need this image, I need this amount of CPUs or GPUs and launch. This launch will trigger a build process and a deployment process and my job can now run distributed on the cluster and we’re gonna see that in a minute, how I can run whether it’s a spark job or regular Python job on the cluster distributed with just simply just submitting the work. Once I finish experimenting with my function unit testing that I wanna publish that into a source code repository like GitHub for example potentiating some documentation so if I later on wanna go back into that function and use it, I know what it’s doing, what’s the API et cetera or I may wanna compose a bigger workflow using Kubeflow pipelines which essentially does a multi-stage process you know, for young preferred data, training, validate et cetera. When I’m building the pipeline, I may also take functions that were built before either by me or by other members of the team. So this function of programming is allowing me to break my code to smaller pieces that are managed, could be version, could be deployed you know, in a more easy way and are fully documented and I could use experiment tracking for the entire system to see what’s going on. So that’s one type of workflow. Another automated workflow that is more, what is referred to as Gitbox, Git Based Operations or the traditional CI/CD workflow is that instead of publishing your code at the end, you first you write some code and you provide some manifest, some configuration and in publish that into a source control system like Git for example. This publishing, this pull request generate the triggers which automatically takes your code, take the dependencies, build the images, runs automatically the pipeline, reports on the results and responds back into your pull request, responds back with all the results of the training or the workflow and you can take it for that.
That’s the benefit of that. It’s more production oriented because you can track the entire process. You could see who published, you can have reviewers that approve the process et cetera but those two processes will coexist. Usually in the initial development of a project, we’ll start with a more interactive way and as we introduce more numbers to the team and we grow, we have to move into more and more automated CI/CD based pipelines for our work.
And we’ll see how we can essentially combine both in an automated session. So with all that theory let’s look at the actual deployment example from one of our users called Payoneer. They’re doing a fraud prevention. They’re a payment company, a pretty large. They wanted to do two things. They wanna move from fraud detection to prevention meaning becoming more real-time and this provides a lot of value to their business. On the second aspect, they wanna move from to a micro-services which cuts the time for them to productize any new software artifact, any new thing bring it much faster to production. So they’ve chosen to move from a traditional active architecture which is Hadoop based to an architecture of micro-service.
So their original solution… By the way there’s a full case study. Anyone that will send me an email I can send him the case study on that. Their original solution based on the Duke was essentially doing ETL jobs from their database, running various analytics workload and then using Spark, sorry. Using our server to do predictions. The process because it was very battery and it took about 40 minutes from the minute this transaction were written to the database until they could this decide if they wanna block the fraudulent account and over the fraudulent transactions. And that’s really challenging because in 40 minutes you can empty the bank. The other challenge that working with Hadoop, it was very rigid. The process was very resource intensive and they wanted to move to more agile automated microservices architecture, pretty much like in the cloud.
So with that they changed into a new architecture based on real-time and micro services. So the general idea is that the database transactions are streamed, they’re not batch. They’re streamed into serverless functions that collect this stream, my serverless function gets intercepted stream they run first analysis or enrichment or rather aggregations on the data and ingest that into a real-time feature stored. In parallel we have spark functions that run analytics functions on a bigger window and trying to work and create additional insights from the data. We have now an enriched data set from historical data and real-time data so we have functions that essentially go and do model training, in this case use scikit-learn in a distributed fashion. And once we have those models, we can start serving those streaming events. So all those streaming events that are coming and getting contextualized and enriched in real time into an enriched feature vector are going into an increasing serverless function. This serverless function decides if to block the account or not and it essentially just writes into the database telling it the database block this account because there is a fraud and so this entire process from getting the data out of the database on the first transaction that is suspicious in fraud until the account was actually blocked, this entire process took about 12 seconds moving from 40 minutes. Why? Because it’s much easier to do that, those kind of things with serverless functions and streaming or message queues in this example and you can see that we combine traditional tools like spark alongside with others or more Python or other languages in a micro services and containers as workload. The major advantage beyond cutting the time to detection is also moving to an automated development and production environment using this serverless approach. If we wanna change a piece of code in any of those series, we just go and update something and push deploy and gets deployed into production. There’s no need for long build processes et cetera. It’s very automated workflow and in a rolling average it will actually without any incurring any downside it will automatically update in production. So this customer managed to move from quarterly a release cycle to a weekly release cycle because it’s so easy and automated to release new versions into production and it’s rather safe because they can always pull back and load the previous version. Now once we created a model, it’s not enough. We need to monitor our model in order to identify if the model is bad, if it’s good, if there is a drift, if there’s a problem with performance maybe we need to switch back to an older model or retrain in order to creates ensembles. So we also need a real-time system to monitor our model in production. So using this serverless or micro service architecture, once we’ve done the training and we created the serving API and model, though serving API and model also generate stream, stream of data with all the transactions of inferencing that entered that system. There is a micro service that listens on that stream and starts analyzing that stream, creating time series aggregations you know, which will show later on in a dashboard like the and essentially micro batching those streams into bigger parquet files and there is another function that periodically goes and analyze that stream of data, compares it with reference data in order to detect that there is drift. If there is a drift through a simple API call, it could launch, remodify ana assemble or do any corrective measurement. So it’s now not only the problem of creating the training and serving the model, we also need to be able to manage the model and monitor the model in production. Okay. So with that let’s go and see a demo of this architecture in action. So what we see is the… What we wanna show you is how it’s relatively easy to go and start with a notebook and write some code and turn that code into production without the local work. So we see here a notebook running on our manage platform
but that could also be running on your laptop. There’s a full documentation on that, that is using a project called MLRun in GitHub, you know.
MLRun has a repository in Git that you can just go and there’s a lot of demos, end to end demos of different use cases, image classification, face recognition you know, time series analysis. You can just load the project and go with it and start you know, customer churn prediction is an entire projects with many different complex steps that could be you know, tried out through that project. So let’s take this simple example of a project that’s just using the Iris dataset. So what we said before that we need to provide a manifest along with our code so that our code is the code that is running but sometimes we need some packages or we need some images or resources. So we can just annotate those in our notebook and give some commands to serve a compiler that will come later that those are packages that are required for production or this is the image we’re gonna use as a baseline and a docker image or potentially again multiple other configuration. Then we could just go and run or write our code and we could instrument our code, we could also add things like logging, artifacts or data set or models or just logging and notifications. I’m grabbing some secrets and credentials. There’re all sorts of things that are provided through the wrapper that runs this code. And again, it’s very simple code. There’s a context object. There’re certain parameters that aren’t injected into my code and I could just go and run it as a data scientist and play with it.
At some point I wanna create a project that uses a bunch of functions so I can just go and create the new project and I’m gonna run my code in my notebook by just wrapping it with something that will sandbox my code and record everything that my code did. So I’m using run local and I can run it. At the same time exactly I can just go and run it on the cluster as if it was running on my notebook so that also allows me a concept of CPU or GPU as a service. I can just launch a job on the cluster. At some point I wanna turn my code in my notebook into a container object so that’s relatively easy. I just run a command called code to function, I give it a name runtime engine, again could be job, Kubernetes job or could be a spark job or other thing supported through that engine. I may wanna register my function to my project so that will appear later on as part of my project and maybe I wanna start doing other things like for example I wanna analyze my data. I’m gonna use a library function called Described with a Son tag and because that function is already showing me how to go and analyze the features and the clustering and all that of my projects. So I’m just gonna load this function. I don’t really remember how to use it so I just ask for documentation and you see I understand what this function is doing is visualization and data analysis on the data set. I can provide it all sorts of parameters and then I just wanna run this function. I loaded it from the internet with some version. I wanna run it with certain parameters, what is the label columns? What is the input data which is essentially the output of my previous job that was running? Just pointing to it. Where do I wanna drop my artifacts? And I’m just gonna launch this job. Now if you’re familiar with Kubernete, you’ll see that actually in the background it’s creating a job in the cluster but all those outputs that are generated from the function that are going back to my Jupyter so I have the experience of running from a notebook but in parallel it’s actually running on a distributed cluster and recording all those actions. At the same point I wanna create a bigger pipeline that involves scikit-learn and classification with AutoML and potentially testing my classifier using validation, distributed validation techniques and creating a model serving function. And after I created the model serving function, I wanna test that it’s actually working and performing porter. I didn’t wanna write all this ’cause I’m too lazy so I’ll just import functions that do all those things for me and I’ll add them to my project and the only thing I need to do now is define a DSL, a graph that tells me you know random ingestion, forest ingestion good you know run the analysis. Once you finish the analysis, run the training and once you finish the training with you know various algorithm options ’cause it’s AutoML, it will have run through all those, run the validation. The validation will use the data and the test set and the model from the training and so on. The serving function we use the model from the training and the testing, the live testing function will use the addresses and other things from the deployment functions. So like essentially define the graph and I could just go and execute that graph by just saying project run, okay and what it does it just, it creates a queue flow, the Kubernetes automated workflow with all of those things that I just described. So you see I have a deployment, I’m building my containers for the set up and then I’m getting some data. Data is presented here. I’m summarizing my data, I’m running analytics and feature analysis on my data so you can see the behavior of my data, the features. Again all sorts of charts automatically generated for me. I wanna run training so again I can see our C curves and various charts that are generated through the training. The training is also doing AutoML and so I can actually see that it’s ran multiple algorithms on the same data set. Once it’s done I see a testing on the model with again different test set, it’s deploying my model and after deploying my model it’s actually testing that my deployed model is performing, this is like a latency curve of my model. So I fully automated my entire pipeline by writing a single notebook, I created an end-to-end project, something that would take months and months and it all scales on a horizontal scaling cluster. Now it’s not limited to just those simple Python functions, you could do exactly the same with also Spark. So I’m on run support Spark as a runtime engine and I could define a Spark job. In fact this is a task that for example reads a couple of data sets and run some query and some joins on those two data set so I define parameters to my function which are the source data sets, I define a sequel query which I wanna run and I have some spark code that is in the background. I can run this thing using a local what we call local run-time so I’ll run it locally on my notebook environment. Great! It works. And then I can run it on a distributed as a serverless spark function on a distributed cluster by just changing my runtime client to Spark. Now it’s essentially choosing Kubernetes operators if you’re familiar with that term. Essentially it’s going to launch a spark cluster for me and run the job on that spark cluster and I could customize the resource requirements and dependencies and chores and all of that. Again with this mechanism, you see that I can fully automate my productization pipeline, I can get really nice performance. All those functions that I’m building are part of Nuclio, again an open source project where I could just go and do those functions, dig into the code of the functions and you know I can attach those functions not only as batch, I can make them work off streams like for example Kafka stream, et cetera.
And just spend a few words, a few configuration elements on that. And the other thing I mentioned before is beyond just doing all those things manually, we can also work for everything in a Git based workflow. So we can for example publish our code, this notebook that I just mentioned. We can just publish it in to Git, to GitHub in this example. The code is also generating what we call project demo file, it’s the project configuration file of all those saying that I did those functions that I attached some other configuration and what they could do is every time I change the code through a pull request, let’s take a full request example, I can ask the Git to just run experiments on that workflow of my repository. This trigger will automatically launch a CI step so automatically MLRun will kick in, it will automatically respond. That’s a comment made by the bot. It’s automatically responding. I’m starting to run this pipeline that we just showed you. When the pipeline finishes, it will automatically plot a result. So this way we managed to create, to document the internal workflow, we can have reviewers, we can have a much more controlled workflow for our development. So without a lot of extra work, we just converted something that we worked on interactively into a fully automated CI/CD pipeline for machine learning, okay? So with that, thank you everyone.
I hope this demo was clear. Anyone that wants to ask more questions, you can reach me out. In my Twitter is my full name Yaron Haviv.
Yaron Haviv is a serial entrepreneur who has deep technological experience in the fields of big data, cloud, storage and networking. Prior to Iguazio, Yaron was the Vice President of Datacenter Solutions at Mellanox, where he led technology innovation, software development and solution integrations. He was also the CTO and Vice President of R&D at Voltaire, a high-performance computing, IO and networking company. Yaron is a CNCF member and one of the authors in the CNCF working group. He presented in various events including KubeCon + CloudNativeCon, Spark + AI Summit and Strata.