Thousands of data science jobs are going unfilled today as global demand for the talent greatly outstrips supply. Every day, businesses pay the price of the data scientist shortage in missed opportunities and slow innovation. For organizations to realize the full potential of machine learning, data teams have to build hundreds of predictive models a year. For most enterprises, only a fraction of that number is actually achieved due to understaffed data science teams.
Databricks can help data science teams be more productive by automating various steps of the data science workflow – including feature engineering, hyperparameter tuning, model search, and deployment – for a fully controlled and transparent augmented ML experience. This goes well beyond just automated model search, which is commonly referred to as AutoML.
Today's blog summarizes new and existing capabilities available on the Unified Analytics Platform enabling all levels of expertise, specifically:
Databricks Labs is a collection of projects created by engineers in the field to solve problems we see over and over again with our customers. With the AutoML Toolkit, the goal is to automate the building of ML pipelines from feature transformations to hyperparameter tuning, model search, and finally inference while still providing fine grain control in the process.
This Databricks Labs project is an experimental end-to-end supervised learning solution for automating:
This solution can be implemented with no-code or fine tuned by experts as they see fit.
Data scientists looking at accelerating their workflows can also benefit from deeper integrations between Hyperopt, MLlib, and MLflow in the Databricks Runtime for ML for optimized and distributed hyperparameter and model search.
See for example how to track the results from hyperparameter tuning at scale on Databricks with enhanced Hyperopt and MLflow integration:
https://www.youtube.com/watch?v=b2KxgBjpe8M
Here are some additional resources to learn more:
More advanced users also have the ability to run all AutoML steps on Databricks, from ETL to model training and inference, by leveraging the extensibility and built-in optimizations of the Unified Analytics Platform with popular open source libraries.
The Databricks Runtime for ML also provides a reliable and secure distribution of the most popular open source ML frameworks (e.g. TensorFlow, Keras, PyTorch, XGBoost, scikit-learn,...) with out of the box optimizations and integrations with Horovod for distributed deep learning as well as MLflow for built-in experiment and visualization tracking for hyperparameter tuning.
Below are additional resources to dive deeper:
Watch Automating Predictive Modeling at Zynga with Pandas UDFs for an example of a custom-based solution running on Databricks.
Visit https://www.databricks.com/product/automl to learn more and start a free trial of Databricks.