Chongguang Liu - Databricks

Chongguang Liu

Technical lead, Société Générale

Chongguang is a technical lead at Soci̩t̩ G̩n̩rale. He has developed and deployed several applications using Spark on the production environment of the Bank. He also contributed back to the Spark on github. Chongguang currently leads the initiative of productionization machine learning for the company. He has installed MLflow onto the production infrastructure of the bank and built a CI/CD pipeline for deploying ML applications.

UPCOMING SESSIONS

PAST SESSIONS

Machine Learning at Scale with MLflow and Apache SparkSummit Europe 2019

Societe Generale is one of the major banks in France and has many data science teams across the globe. After years of explorations and prototyping, it is time for the company to really deploy machine learning projects at scale to the production environment.

To achieve that goal, we have been working hard to define a standard process of collaboration between data engineers and data scientists. And we also designed and deployed an infrastructure for productionizing machine learning.

During this presentation, you will be looking at the following points of our adventure:
1. Difficulties that we had for putting ML applications into production, such as lack of model registry; hard to deploy ML libraries to our Hadoop cluster; collaboration between data scientists and data engineers etc. ?
2. How did we deploy MLflow as a key technical component to our production hadoop environment given different security constraints.
3. How did we build a CI/CD pipeline to deploy ML applications automatically. MLflow plays an important role in this piepline.
4. A first and concrete production project developed on top of this infrastructure with MLflow, Spark streaming, Sklearn and CI/CD.

The key takeaways of this presentation would be:
1. To productionize machine learning in a big structure like Société Générale, a process of collaboration should be clearly defined.
2. A ML model registry is key to ML productionization. MLflow is the best solution we found.
3. A CI/CD pipeline is essential to the success of a machine learning application.