Enabling Scalable Data Science Pipeline with Mlflow at Thermo Fisher Scientific

Download Slides

Thermo Fisher Scientific has one of the most extensive product portfolios in the industry, ranging from reagents to capital instruments across customers in biotechnology, pharmaceuticals, academic, and more. The amount of data needed to construct a comprehensive view of customer needs is massive and in order to build a data science ecosystem capable in dealing with this amount of data, each step from data engineering, model development, and delivery in the ecosystem has to be scalable.

With scalability in mind, Thermo Fisher partnered up with Databricks to build an end-to-end data science pipeline with CI/CD standards, and further augmenting our capabilities through use of the latest technologies such as Mlflow, Spark ML, and Delta Lake. This talk is a summary of our journey from past to current state, as well as looking ahead to the future of our platform.

Key takeaways:

  • Utilizing big data for machine learning requires not just machine learning knowledge but also technical infrastructure to support continuous development, deployment and delivery for machine learning models.
  • How you can build a scalable data science pipeline with latest Databricks technologies.

 
Try Databricks
« back
About Allison Wu

Thermo Fisher

Allison is a data scientist in the Intelligence Generation team within Thermo Fisher’s Data Science Center of Excellence. The Thermo Fisher Data Science Center of Excellence establishes data science best practices and drives end-to-end data science model development. She graduated from UCSD with a Ph.D. in Bioinformatics and Systems Biology in 2016 and started her data science journey in Global Strategic Pricing in 2018 at Thermo Fisher. Specialized in machine learning, she has developed models in various fields such as imaging analysis, pricing optimization, and customer behavior prediction. Aside from developing machine learning models, her current focus is enabling end-to-end data science pipelines from development and deployment to delivery and management in production environment using technologies such as Mlflow, PySpark, Delta Lake, and Git.