Model reproducibility is becoming the next frontier for successful AI models building and deployments for both Research and Production scenarios. In this talk we will show you how to build reproducible AI models and workflows using PyTorch and MLflow that can be shared across your teams, with traceability and speed up collaboration for AI projects.
Speaker: Geeta Chauhan
– Good morning, everyone. Today I will be talking about reproducible AI using PyTorch and MLflow. My name Geeta Chauhan and I’m in the AI PyTorch Partner engineering at Facebook. So the agenda today is to go over PyTorch community growth, then dive into the reproducible AI challenges and look at this solution using MLflow and titles, and then to production. The PyTorch community is growing at a very rapid clip. So yeah, we now have over 1600 contributors. So the PyTorch people so this includes contributors from companies like Microsoft, Facebook, Google, Uber, MIT, CMU. It has been more than 50% year over year growth. And we have a very active user forum as well with 34,000 plus active users participating. So we are very happy to see this. And if you’ll take a look at the published papers that the same book resides, you will see that more than 48% of papers that are getting published are on PyTorch now, which is a great close. inside Facebook, our data in our ML pipelines is growing very rapidly. In 2015, we had 30% of the data in our warehouses was being used for machine learning. And now more than 50% is being used for machine learning. And at the same time, we had a two X growth in the overall size of our data, their house. So if you can compute the math, you will see that this is more than three expose overall for the ML data. The number of software engineers who are building these machine learning models that is all things pleasing. We have seen a two X growth in the unique number of users running the models but then the complexity of the workloads is increasing. And the number of the workload is increasing. So as a result, the overall of compute load on other clusters has included some over to eight X. So let’s dive into some of the reproducible AI challenges. Unlike traditional software, machine learning has a very continuous and iterative process for building out the models. There one is optimizing for the metric like accuracy, and the quality of that data is what determines how good your models will be. You have to tune the parameters. Experiment tracking is very difficult. You run into challenges like overtime, your data changes, which is our sense model drift and poor performance of the models. One has to compare and combine many different libraries and models to get to the optimal performance and due to their diverse deployment environment packaging the models and running them for instance is still a big challenge. In the past few years, there have been lot of issues on the research done for the resources reproducibility of the results of a paper, just because the data is missing, or the model weights, or the scripts are missing. It is very difficult to reproduce the exact same results for them, and the research paper that was published. Similar challenges I’ve seen on the production side, where the hyper-parameters changed or, you know, the futures or the data on which the model was originally changed was not available. The vocabulary that was used from an NLP models, for example, got lost. And people who had originally created these models are no longer with the company. So over time, then these models are running in production. It is very difficult to go and get a new version of these models. Looking back on some of these challenges on the research side, reproducibility checklist was created and that NeurIPS 29 the challenge, the 30 predictability was not. And just by introducing that reproducibility checklist there was a big improvement of 75% papers that were submitted at NeurIPS hard code with them. A total of 173 papers are submitted as part of this challenge, which was a 92% increase compared to the ICLR last year. The reproducibility checklist contains things like dependencies does that model repository have all the instructions to how to set up the environment? Does it include the training scripts and the evaluation scripts on which the model was done? And does it have the pre-trained models are the scripts available for the results and the table and the plots of the main results published? So at Facebook we have been looking at how to simplify and improve the experience. So we just launched the integration of outside tables and code. We can now get the fourth corresponding to a paper, right from the outside cited post. So you click on the course app and you get the course, so you no longer have to hunt for the code corresponding to a paper. This is a huge step forward on moving towards reproducible research. So let’s look at this solution on the production side by combining Mlflow with PyTorch, we can get to reproducibility for models that are deployed in production. So MLflow comes to that set of features for the experiment tracking the model of the project for, for the models registry, the models for deployment, and by integrating PyTorch into each of these components you can now get a reproducibility of the PyTorch models selling on MLflow. For the features that you’re launching PyTorch auto logging examples with the ML project, torch scripted versions of the models, the ability to save and load artifacts. We are launching a new TorchServe Deployment plugin as well. So let’s look at the MLflow auto logging. So $5 off the logging feature has been implemented using the PyTorch lightning training loop. So all you have to do is import modules the auto logging module right. Your training loop script as usual and call it in the line auto log which will log those parameters by default, they log things like they had parameters for learning rate, model summary, optimizer name, or things like that. Store Min delta, you can control callbacks or like the early stopping callback. You can log for every N iterations. You can also have user defined metrics. Like we have fun score on that accuracy. So this here is an example of what the model experiment and comparison looks like in Mlflow. this is for the tuning model example. It’s also different iterations where you can select all the experiment trials and compare and get something. On the same artifact we have enhanced Mlflow Pytorch save model function to add the ability to save the extra artifact. And this can be XR design for NLP models like vocabulary or if you need to pass the requirements for running the models. We’ve also added support for our torch scripted models. So all you need to do is convert your model into torchscript. So you called our torchscript to convert models. And then when you call the Mlflow PyTorch log model, you can say the scripted model. for torch is an optimized version of the models, which can run in a python-free process is what we recommend for the new models in production. And if you need to load the model it is again exactly the same combine for loading models. The MLflow PyTorch model. We recently launched TorchServe for solving models in production. This was a core development with AWS. Our TorchServe comes out of the box with many common handlers for the different use cases like image segmentation, text classification. You can see at your own custom handlers and you can get easily started with a model, which is provided. You can serve multiple models on the same model server. It supports model versioning. You can go back to an earlier version of a model. You can do automatic batching of the inferences. You can, you get all the logging capabilities with the common metrics. And we have support for metrics that we’ve added integrations like SageMaker, Kubernetes. And there’s a very robust HTTP API for management. So in the MLflow deployment login we have now created a very easy way to deploy these models as part of your MLflow project itself. So all you need to do is call Mlflow deployment predict and then you can launch the predictions on your model what the CLI and Python API versions of this, a supported you can join torch either on your local machine or a work machine, and you can run your entrances once you deploy your models on the other end of the model repository. Okay. So let’s dive into a demo. So I have the ML flow UI already running on the machine and we will start an Mlflow entrance so I can show you what the project looks like. We have the Mlflow project set up with the parameters that will be called. Then all you have to do is call ML run dot I don’t want to launch . So I just thought, you know, conduct with it and be smarter and while we are waiting for the run to execute. Let me show you what some of these other runs look like. So just before the even edition, I ran through this version of this. As you can see all the parameters are logged automatically when functions are called so we include things like batch size, epochs, learning rate, optimizer name. You get all the artifacts associated with the model. So you get the action model file itself, You get the on the ML model version of this with all the parameters inside it, you get the model summary with all the layers. Let’s look at some of the more complex models that we have been running across our team. As you can see lots of experiment runs over here you can select multiple iterations and do the comparison. Oh, you can get the contour plot. You can do a parallel coordinates plot, et cetera no. See a new model experiment run got added and you can see all the parameters that were logged. Along with the model So let me walk you through now, what is the research to production cycle at Facebook, we often started a new idea out of paper. And now with the integration it is a lot easier to just start with the code on the paper from our papers with code we author the models then do the training of it, the evaluation of it, a lot of parameter sweep to get to the optimal version of the model. Once you are satisfied what that model looks like, we deployed to a small subset of our users at the smallest scale and collect the metrics. And then, you know, we analyze those. And once we are happy with the results then we start to do the possible productionizing the model revisions was exporting to TorchScript doing all the validations, doing all the performance tuning. Then we get to a torch scripted version of the model that gets deployed on our C++ inference and backing. And so all this is now enabled through ML, slow integrations the different points that you saw from the papers reports you can get the code as a starting point as you’re running and building your model. All the model experiment runs gets saved on the undergo experiments scanning server, and you can save the versions of the models and the model registry. And once you are ready for optimizing and deploying your model to production you can TorchScript your model and save that in the model registry, and finally you can package it and bundle it and deploy it on the Mlflow part of the bundle. Continuing to do more development with Mlflow and PyTorch. And so in the future you can expect integration for model interpretability with Captum, a hyper parameter optimization, there’s Ax and BoTorch. Do you have any more examples? And here are some references for you when you get started PyTorch 1.7 just came out of the reproducibility checklist would be a good thing to review, As you’re looking at reproducibility for your own models and workflows some data is available, the arXiv papers with code for the blog that talks about the whole process. You will notice that NeurIPS 2020 has yet another reproducibility challenge with the papers over there, please participate in that the Mlflow PyTorch auto log is available under the MLflow GitHub, Mlflow/PyTorch, and the deployment plugin is going to be in a separate repo the Mlflow/torchserve we will be releasing a bunch of medium articles and blogs to go along with this as we go to the PyTorch medium site. Yeah. So now I will open it up for questions and feel free to reach out to me on Linkedin or over email, if you would like any follow up.
Geeta Chauhan leads AI Partnership Engineering at Facebook AI with expertise in building resilient, anti-fragile, large scale distributed platforms for startups and Fortune 500s. As a core member of the PyTorch team, she leads TorchServe and many partner collaborations for building a strong PyTorch ecosystem and community.
She is winner of Women in IT – Silicon Valley – CTO of the year 2019 and a trusted advisor for Investment firms for Technology due diligence during merger and acquisitions and venture funding and led over a dozen due diligence projects across US and Europe with total ~$200 Million investments.
She is an ACM Distinguished Speaker and thought leader on topics ranging from Ethics in AI, Deep Learning, Blockchain, IoT. She is passionate about promoting use of AI for Good and mentors startups at CleanTech Open.