Databricks FAQ

AutoML on Databricks

Automating Machine Learning pipelines at scale

About Hyperopt and MLlib Integrations



Databricks AutoML features are available in both Azure Databricks and AWS. To get started, please follow our instructions to sign up for a free trial.

The accuracy benefit from performing hyperparameter tuning depends on the model, hyperparameters, and other factors. You can expect to see the largest gains from initial hyperparameter tuning, with diminishing returns as you spend more time tuning. E.g., the jump in accuracy from running Hyperopt with max_eval=50 will likely be much larger than the jump you would see in increasing max_eval from 50 to 100.

Yes. Our Distributed Hyperopt + MLflow feature applies to single-node machine learning training code and is agnostic to the underlying ML library. Hyperopt can take in a user function containing single-machine scikit-learn, TensorFlow, or other ML code. Note that, for distributed machine learning training, please consider using Apache Spark MLlib, which is automatically tracked in MLflow in Databricks.

We are in the middle of open-sourcing distributed hyperopt using Apache Spark via “SparkTrials.” Automated tracking to MLflow remains a Databricks-specific feature.

Conditional hyperparameter tuning refers to tuning in which the search for some hyperparameters depends on the values of other hyperparameters.  For example, when tuning regularization for a linear model, one might search over one range of the regularization parameter “lambda” for L2 regularization but a different range of “lambda” for L1 regularization.  This technique helps with model search since different models have different hyperparameters. For instance, for a classification problem you may consider choosing between logistic regression or random forests. In the same Hyperopt search, you could test both algorithms, searching over different hyperparameters associated with each algorithm, for example regularization for logistic regression and the number of trees for random forests.

Our integrations of MLflow with MLlib and Hyperopt automatically choose the best model and structure runs with the parent-child hierarchy.  To be very clear, there are 2 parts to this integration which handle different aspects: (a) MLflow is being used simply for logging and tracking, whereas (b) MLlib and Hyperopt contain the tuning logic which chooses the best model.  Therefore, it is MLlib and Hyperopt which compare models, select a best model, and decide how to track models as MLflow runs.

Yes, but the logic for slowing down would need to be in your custom ML code.  Some Deep Learning libraries support shrinking learning rates: e.g., https://www.tensorflow.org/api_docs/python/tf/train/exponential_decay.

The features will not be logged by default, but you could add custom MLflow logging code to log feature names.  To do that, we recommend logging feature names under the main run for tuning. If the features are logged as a long list of names, it will be best to log them as an MLflow tag or artifact since those support larger/longer values than MLflow params in Databricks.

You can easily install third-party libraries such as Featuretools to automatically feature engineering, and log the generated features to MLflow.

There are many types of Transfer Learning, so it’s hard to give a single answer.  The most relevant type of Transfer Learning for this topic is using the results from tuning hyperparameters for one model to warmstart tuning for another model.  MLflow can help with this by providing a knowledge repository for understanding past hyperparameters and performance, helping a user to select reasonable hyperparameters and ranges to search over in the future.  This application of past results to new tuning runs must be done manually currently.

Not at the moment.

Not at the moment. We will add it if there is enough customer demand.

 

About Databricks + DataRobot Integration


Yes, the Databricks + Datarobot integration is API driven – which means that you can automate the entire pipeline to prep new data on a schedule and kick off the re-training of models via Databrick’s built in scheduler and automated Jobs capabilities.

The deployment of the model is not real time. Can you also show a real time example?

It should be possible to export the model via API and get it deployed as soon as training is finished.

You can use MLeap to export/import python models and use them with Databricks

Yes, the integration API allows for a dataframe to be passed directly to Datarobot. An example notebook in our documentation

Ready to get started?

TRY DATABRICKS FOR FREE