MLflow at Company Scale

Download Slides

50k runs, millions of metrics, parameters or tags, some bursts at 20k QPS. That’s the volume of data managed by our MLflow tracking servers this year at Criteo. In this talk, you will learn how we set up a shared instance of MLflow at company scale. We will present our contributions to the SQLAlchemyStore to make it responsive at this scale. We will present you how we turned MLflow to a production-ready system. How we scaled horizontally a shared instance on a mesos cluster ? Our monitoring system based on prometheus. Integration to the company Single Sign-On (SSO) authentication. And how our data scientists register their runs from the largest hadoop cluster in Europe.

Speaker: Jean-Denis Lesage


– Hello, my name is Jean-Denis. I’m the lead at Criteo, and today I’ll explain to you how we set up a shared instance of MLflow at Criteo. So first I will explain to you why we choose MLflow. And after, we’ll enter into the technical details. How we set up the instance, how we monitor the runs. And then I will explain some addition we developed this year. UI addition, and some plugins we developed this year. So Criteo is a AdTech company and we use machine learning a lot, four years now. So there are some figures about our usage of machine learning at Criteo. So, thousands models in production, a lot of prediction. and with a high latency constraint. We generate one petabyte of logs every day. And to improve our models, we do some offline experiments. So I will talk about these offline experiments in this presentation. So we developed our own framework for machine learning a few years ago, and because we have more use cases, and we want to try other machine learning techniques, like deep learning, we moved to MLflow. So why Mlflow? First, the first reason is because it’s multi-framework. With MLflow you can do TensorFlow, Torch, PyTorch, scikit-learn, and even we can integrate our own framework. And this whole point is, it can run on any kind of infrastructure, so at Criteo, we run our own infrastructure with a Hadoop cluster and Mesos clusters. So, MLFlow must run on this. Another point that is good with Mlflow, it is extensible with plugins, and it is open source, so it lowers the risk. If there is something missing or you want to add a feature, we can contribute. So we set up the MLflow instance, we started the project in June last year. And at the beginning we identified a risk because we would like to use it as a software service. And it wasn’t, MLflow was not designed for this usage. It is more, one Mlflow per project, or one MLflow per team. So I will explain to you why we would like to have one central service. It was a discussion we had during with kickoff of our project. Sometimes there is best use of one MLflow per team or project. But we selected this central service because first, we can factorize the maintenance of the service, there is a one team, my team, is running the service and the user just uses. They don’t have to configure or set up it. And another important point, we think it improves the collaboration and communication between teams, because there is only one single place to store all machine learning results. And you can see them as a form of your team communicate, and we really think it’s improved collaboration. But there are some drawbacks with this choice. First, there are more constraints to the service, more users, more queries. And so we have to scale up the server and service. And there is no isolation. It means one team can, can shut down the service, can impact all of the company. So, more specifically about the, how we set up a MLflow at Criteo. So we have a big Hadoop cluster to do all our offline jobs. So all machine learning jobs must run on our Hadoop cluster. So the storage for the artifacts is HDFS. It’s the filesystem with Hadoop, the Hadoop ecosystem. And for long-running applications, we use an Apache Mesos cluster. So our MLflow servers are running on Mesos, using Gunicorn. and the tracking server is just Tracking Store. It uses SQL, more specifically, MariaDB. So after this introduction I will explain to you how we set up this shared MLflow instance. So, one year ago, so it was the first version of MLflow. We had a lot of stability issues with MLflow. So it was quite easy to set up the service, but with the first users, we had some interruption of service. So here, I show you some pictures from our private slack channel. And you can see, every day we have “MLflow is down”, or “is there an issue with SQL?” In the first version of MLflow, when something went wrong, there was this Niagara falls picture, so, well, a lot of this. So the service was at the beginning, very unstable. And we’d, in the we notify the whole cluster. Basically in Mlflow, in the first version of the SQL executor, everything was done in the Python server. So when browser required a few lines, 100 lines, the service needs to select all the lines from MariaDB, so it’s a lot, lot of lines, do the filtering in Python, so it quite slow in Python, and pagination also in Python. So it needs a ton of lines to just return 100 lines. So with this design, there are several issues. First, it’s slow, because everything in Python. And it’s unstable, because all the runs of one experiment must be loaded in memory. So there is a risk of an out-of-memory, and basically, our issues was when it did what this does, this out-of-memory exception. So now the problem was identified, and we tried to solve it. And the solution, to scale the SQL cluster, was to move a filtering and pagination inside the SQL queries. So we did this pair, this first pair, that move all the Python code to the SQLAlchemy queries, and this solved our memory issue. So basically after this pair, we have no more instability in the server, and also speed up a lot, the server. 10 times on our use cases. But the problem to join everything in SQL is in the data model in MLflow is you one or table for runs, one table for matrix, one for parameters, one for tags, and you have to join everything to do a query. So basically, it could consume a lot of memory. So there were also these other pull requests done by MLflow core developers to reduce the memory usage. And basically the idea is to load the attributes lazily. And as an example of optimization we did on the SQLAlchemy Store, was the latest metrics. So in MLflow, every metric added timestamps, and you can add time-series, but, most of the time in the UI, you only see the latest metric, and the computation of the latest metric was done in Python. So same issue as before, it was slow. So we moved the computation in SQL, that’s the first pull request. But after there was another pull request that improve this solution, done by Mlflow core developers, basically, because these latest metrics were used a lot in the UI, it was meaningful to create a dedicated table. And it in brings an important speedup. And more recently we did another optimization, and this one is not yet merged in MLflow. So we have some experiments with a lot of columns, so, columns is metrics, tags, and parameters. But most of the time, a user only needs a few of them. For example, if they want to plot, you need only two of these columns. So we introduce a new parameter in the search_run function with the column, to whitelist the column we want, to add the values. So basically, this change modifies the complexity of the method, and now the complexity becomes proportional to the number of columns. So there’s a plot. If you ask for 10 columns, you can get your result immediately. And if you want all the columns, it will take seconds. So it improves a lot the user experience with Mlflow. So here’s an example of how we implemented this feature. Basically we, the problem is, in SQLAlchemy, you cannot get only a subset of attributes. So you have to create, by hand, the SQL run object. So basically you have to filter by your, as your first line, the columns. Then, combine this object with the run. Base the run ID on that term, for every run, recreate the SQL object. And basically, that’s why this feature has not been merged, because we have to use the API, and probably is not very easy to maintain. So now we don’t have a problem with scalability with our servers. So I will explain to you now some features we did this year to help us to maintain our MLflow service. So first, we added monitoring on the SLA, we implemented an SLA for MLflow. So we added the /metric endpoint in MLflow for Prometheus. We have a Prometheus capture that reads this endpoint, get metrics, and export them to graphviz, and also, we can add also alerts. And basically in this endpoint, we monitor number calls to a zone point, plus the time, to, the latency of these endpoints. And enable us to implement this SLA. So we, we check the availability. So it was enabled us to understand the performance and the latency. You can see also the disk usage of our SQL server, plus periodic jobs. And we’ll explain to you later, what are these periodic jobs. But before, another important choice we did is to reimplement the Gunicorn application. There there is one provided with MLflow, but we decided to implement the Gunicorn application. So it’s quite easy. Here in this slide there is a minimal, the minimal code to add your own Gunicorn application. But it brings a lot of values, because, first, we can implement hooks in Gunicorn, for example. We can close nicely a SQL connection on exit. But also because we have access to the Flask application, we can extend it, add new endpoints with some Criteo business logic. And also we can activate some Flask expansions, authentication. And we explain to you, how we add the authentication data that also causes a way to create some JavaScript applications that are able to query the MLflow API. So it’s nice to extend also the UI. And the point is, we would like to automatize, as much as possible, things. And remove all human action in production server. And one point was database migration. If, in the code, there is a change in the database schema, we have to run a remote script. So there is a command in MLflow that we would like the server themself to run this command. The problem with that is we have to guarantee that only one server is running the command. That’s why we implemented a mutex in SQL, and it guarantees that only one server will do the migration. Because everything is automatic, we have to test before, to migrate. So we have some integration tests based on Docker, so in Docker, we set up a SQL server, try the migration. And also we have a preprod cluster, so every day we have, we deploy, you know, a preprod cluster, and see if the servers are starting. The benefits of that is we can add continuous deployment, at least one release per week. We deploy the master branch of MLflow. And there is no more human action on production servers. So we also have maintenance jobs. For example, some jobs that will clean the database, reduce the number of files in the HDFS, using Hadoop archives, or kill stuck jobs. And all the jobs we would like to automate them, so, it’s also our MLflow servers that run the jobs periodically. So we reused the mutex we implemented for migration to run these jobs. So we have some periodic jobs, that do the maintenance automatically. And, last point is how we integrate the authentication in MLflow. So we integrate, we did the integration with our SSO. So, it’s quite difficult to do a pull request with this feature, because there are a lot of specific Criteo-specific code, but I will explain to you the steps to add the authentication in your service. So first, you have to manage a token in the API. And in our case, we use a JWT token. You must also add a new endpoint in Mlflow, to redirect to the SSO service. If a user has a valid token, or doesn’t have a token, to create a new token. And then as a JavaScript, must embed the token and remembering the session, as a user token. So in JavaScript, I will show you the change is quite light. First, we store the token in the local storage. So if there is a token in the local storage, we can add it to the payload, but if there is any issue with authentication in the servers, that’s very hard. For the one, we redirect to the SSO to get a valid token. So the SSO, we ask the user the best form, and if the best form is okay, we go to valid JWT token. As I said at the beginning, we use a Hadoop cluster for our machine learning jobs. and the Python or Java clients is running on this Hadoop cluster. And in the Hadoop cluster, authentication is done by Kerberos. So we need a mechanism to translate the Kerberos token into a JWT token, to discuss with MLflow. So we changed the MLflow client to be a Kerberos client. So MLflow can negotiate with KDC. KDC is the Kerberos server for authentication. Can negotiate and authenticate a user with Kerberos authentication. With the valid authentication, we call an internal service called JTC. So it’s a Kerberos service, it’s protected by Kerberos, but this service can generate JWT tokens. So with a valid Kerberos token, I can do a HTTP/SPNego call to the service, if this call is valid, it returns a JWT token and I can reuse this token with MLflow. And it’s very nice, decentralization is very nice because it usually works also with a non-human user, like a service account, because they have a valid Kerberos token, I can use this token to translate into a JWT token. So we don’t need a password or something like that, in this case. So, last part. I will just show you some addition we did this year on MLflow. So the first one is a UI, UI ones. So it’s about to ease the way to write queries, to filter runs. So you can, there is a syntax in Mlflow. But what we did, we activate the search box in the grid, and, so it’s just a boolean to enable this box. That’s a problem, if you just enable the boxes, the filtering is done in your browser, so client-side. And what we would like is to filter on the server side. So we change actions of the kinds of boxes to generate a query, the query. So basically a user can just enter with this box, and it will automatically generate a query in the search, the textbox, and triggers a search on the server-side. Then we developed also several plugins, two plugins this year. So the first one is an Mlflow-yarn. So MLProject is a way to package your project in MLflow. And after you can run it, on several kinds of, types of clusters. In Mlflow, there is only three execution backends supported, local execution, Docker and Kubernetes, and Databricks if you have a Databricks account. In our case, we would like to run on a Hadoop cluster. That’s why we created this execution plugin to submit MLProject to a Hadoop cluster. So it’s based up on skein, it’s a Python package developed by Jim Christ. And so it can communicate with yarn, our source manager on Hadoop. And cluster-pack, so cluster-pack is another project we developed at Criteo. There is a presentation this year Python. And basically cluster-pack enables you to ship your Python dependencies into a cluster. So it’s based on text and content pack. And it guarantees everything is consistent in your cluster. And there are some extra features like caching. So basically you can test, if you have a Hadoop cluster, You can test it by just installing, pip install mlflow-yarn. You after, in an MLflow run command, you can do ‘-b yarn’ and it will be submitted to your Hadoop cluster. The second plugin we developed this summer, it was an internship, is one related to Elasticsearch. We tried to… We tested to replace the SQL server by Elasticsearch. And basically, we had a first a good result, as you can see in the plot, but it’s still in experimental support. We don’t use it yet in production, but we are open to contribution, so please visit the Gitter, there are some open tickets. And you can test, and if you want to help, you’re welcome. So, conclusion. First, I will just make a summary about the code organization. So we have this fork, off Mlflow. We deploy in production the criteo-master branch. And this branch contains the contribution I explained in this presentation so you can test it. We have also a private, private git, with our Gunicorn application configuration and everything that is Criteo-specific. And we have our two plugin repos. And, just a final conclusion. So, we scale up the SQLAlchemy server by delegating, we delegate as much as possible to SQL computation. We tried to automatize as much as possible to remove all human action on the prod server. Create your own a Gunicorn app is a very nice way to extend MLflow. MLflow is open. You can contribute, you can extend the Flask application, you can create your own plugins, so, very extensible. It was an exciting journey to setup this instance. We covered a lot of topics. It was a very fun, and it was great to collaborate with the community. And that’s all. Thank you.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Jean-Denis Lesage


I am a software engineer at Criteo AI Lab. I hold a PhD from Grenoble University (France) on Parallel Computing. My main interests are high performance computing and distributed applications development. I joined Criteo in 2018. My team develops tools to ease Machine Learning experimentation.