Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas - Databricks

Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas

Does more data always improve ML models? Is it better to use distributed ML instead of single node ML?

In this talk I will show that while more data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video more data does not significantly improve high bias problem spaces where traditional ML is more appropriate. Additionally, even in the deep learning domain, single node models can still outperform distributed models via transfer learning.

Data scientists have pain points running many models in parallel automating the experimental set up. Getting others (especially analysts) within an organization to use their models Databricks solves these problems using pandas udfs, ml runtime and MLflow.



« back
About Thunder Shiviah

Databricks Solutions Architect and ex-McKinsey Machine Learning Engineer focused on productionizing machine learning at scale.