FAQ – Databricks

Machine Learning FAQ

Built on top of Databricks Runtime, Databricks Runtime ML inherits all the capability of Databricks Runtime. Databricks Runtime ML is our recommended environment for developing Machine Learning / Deep Learning applications. It offers pre-configured and optimized a list of popular libraries and frameworks.

It is easy to get started. You simply select a version of Databricks Runtime ML from the drop-down list when you create a new cluster in Databricks. You would be able to start developing your ML applications right away, using the list of popular ML libraries that are preinstalled.

The Databricks Runtime ML includes a variety of popular Machine Learning packages. The packages are updated regularly to include new features and fixes.
A subset of popular packages is marked as top-tier packages. For these packages, Databricks provides a faster update cadence, updating to the latest upstream package
releases with each Runtime release (barring dependency conflicts). Databricks also provides advanced support, testing, and embedded optimizations for top-tier packages.
Databricks Runtime 5.3 for Machine Learning includes the following packages:

Top-tier packages:

  • Tensorflow / TensorBoard
  • spark-tensorflow-connector
  • PyTorch
  • Horovod / HorovodRunner
  • GraphFrames

Other provided packages:

  • Keras
  • spark-xgboost
  • MLeap
  • scikit-learn
  • Pandas
  • Deep Learning Pipelines for Apache Spark
  • TensorFrames
  • Etc.
  • You can find a detailed list of libraries + their versions in the release notes.

    You can sign up for the free standard trial for 14 days or our community edition at http://databricks.com/try

    Yes, additional libraries can be installed by following the steps in our documentation. We are also working to allow users to install Python packages via Conda.
    Conda is the industry-standard Python library manager that we used in Databricks Runtime ML.

    There is no difference. Our strategy is to provide seamless experience across clouds, and thus we are committed to product parity across Azure and AWS.

    There are two primary reasons. First, since we introduced Databricks Runtime for Machine Learning in preview in June 2018, we’ve witnessed exponential adoption in terms of both total workloads and the number of users. Close to 1000 organizations have tried Databricks Runtime ML preview versions over the past ten months. Total workloads more than tripled in the first two months of 2019. Second, to meet the rapidly growing demand, we continued to improve our integration and testing to arrive at a robust cadence of updating and adding packages in Databricks Runtime ML. The combination of positive feedback and having a robust release cadence led us to make Databricks Runtime for Machine Learning generally available (GA) for our customers.

    There are three key benefits to Databricks Runtime for Machine Learning: usability, reliability, and performance.

    First, all of the packages identified in “Supported Packages in DBR ML” come pre-configured in Databricks Runtime ML. Users can start developing machine learning applications right
    away without the need to configure the environments themselves. In addition, we are constantly evaluating popular ML libraries and looking for ways to make them easier to use,
    especially to run in a distributed system. HorovodRunner,
    which we released in Databricks Runtime 5.0 ML, is one example.

    Second, Databricks Runtime ML provides a stable environment for developing ML applications. Databricks engineering runs daily integration tests against Databricks Runtime ML and stress-tests
    all new libraries before integrating or updating existing libraries in Databricks Runtime ML. We continued to expand our test suites and incorporated feedback from almost 1000 organizations to
    make ML workflows run smoothly.

    Lastly, Databricks Runtime ML includes performance improvements beyond what is available in “off the shelf” open-source versions of several libraries. When running Apache Spark MLlib logistic
    regression and tree classifiers in Databricks Runtime for ML, we observed ~40% speed-up in Spark Performance Tests compared to Apache Spark 2.4.0. In addition, GraphFrames in Databricks Runtime
    ML runs 2-4 times faster and supports even bigger graphs compared to open-source GraphFrames. The improved performance is only available in Databricks.

    Each Databricks Runtime for ML supports both Python 2 and Python 3. Users simply need to select from the dropdown list when they create a new cluster. We expect to phase out support for Python 2 over time, but it will not happen before Databricks Runtime 6.0 ML (expected to be released in the second half of 2019). It is certain that Databricks Runtime ML versions before 6.0 will continue to support both Python 2 and Python 3.

    As of Databricks Runtime 5.3 ML, CUDA Toolkit 9.2 is installed. The preinstalled GPU-accelerated libraries include CUDNN 7.2.1 and NCCL 2.2. Databricks Runtime 5.3 ML also includes GPU version of TensorFlow, xgboost, and Pytorch that use the GPU as an accelerator to increase performance.

    Currently Databricks Runtime ML does not support Docker. However, supporting Docker is an active area of development. We expect to roll out this feature in Q3 2019.

    Ready to get started?