50k runs, millions of metrics, parameters or tags, some bursts at 20k QPS. That’s the volume of data managed by our MLflow tracking servers this year at Criteo. In this talk, you will learn how we set up a shared instance of MLflow at company scale. We will present our contributions to the SQLAlchemyStore to make it responsive at this scale. We will present you how we turned MLflow to a production-ready system. How we scaled horizontally a shared instance on a mesos cluster ? Our monitoring system based on prometheus. Integration to the company Single Sign-On (SSO) authentication. And how our data scientists register their runs from the largest hadoop cluster in Europe.
Speaker: Jean-Denis Lesage
I am a software engineer at Criteo AI Lab. I hold a PhD from Grenoble University (France) on Parallel Computing. My main interests are high performance computing and distributed applications development. I joined Criteo in 2018. My team develops tools to ease Machine Learning experimentation.