Ray is a prominent compute framework for running scalable AI and Python workloads, offering a variety of distributed machine learning tools, large-scale hyperparameter tuning capabilities, reinforcement learning algorithms, model serving, and more. Similarly, Apache Spark™ provides a wide variety of high-performance algorithms for distributed machine learning through Spark MLlib and deep integrations with machine learning frameworks including XGBoost, TensorFlow, and PyTorch. In order to build the best models, machine learning practitioners frequently need to explore multiple algorithms, often requiring the use of multiple platforms including both Ray and Spark. Today, with the release of Ray version 2.3.0, we are excited to announce that Ray workloads are now supported on Databricks and Spark standalone clusters, dramatically simplifying model development across both platforms.
Create a Ray cluster on Databricks or Spark
To start Ray on your Databricks or Spark cluster, simply install the latest version of Ray and call the ray.util.spark.setup_ray_cluster()
function, specifying the number of Ray workers and the compute resource allocation. Any Databricks cluster with Databricks Runtime version 12.0 or above is supported, as well as any Spark cluster running version 3.3 or above. For example, the following code installs Ray in a Databricks notebook and initializes a Ray cluster with two worker nodes:
# Install Ray with the ‘default’, ‘rllib’, and 'tune' extensions for
# Ray dashboard, reinforcement learning, and tuning support
%pip install ray[default,rllib,tune]>=2.3.0
from ray.util.spark import setup_ray_cluster
setup_ray_cluster(num_worker_nodes=2)
With just a few lines of code, you have created a Ray cluster and are ready to start training models.
Train models with Ray Train and Ray RLlib
Now that you've started a Ray cluster, it's time to harness the power of distributed machine learning to build a model. All Ray applications and Ray-integrated machine learning algorithms are supported on Databricks and Spark clusters without any modifications. For example, you can use the Ray Train API in your Databricks notebook to easily distribute your XGBoost model training, reducing training time and improving model accuracy:
# Install xgboost-ray for distributed XGBoost training on Ray
%pip install xgboost-ray
import pandas as pd
import ray.data
from ray.air.config import ScalingConfig
from ray.train.xgboost import XGBoostTrainer
from sklearn.datasets import fetch_california_housing
housing_dataset = fetch_california_housing(as_frame=True)
housing_df = pd.concat(
[housing_dataset.data, housing_dataset.target], axis=1
)
trainer = XGBoostTrainer(
scaling_config=ScalingConfig(num_workers=2),
label_column="MedHouseVal",
num_boost_round=20,
params={
"objective": "reg:squarederror",
"eval_metric": ["logloss", "error"],
},
datasets={"train": ray.data.from_pandas(housing_df)}
)
training_result = trainer.fit()
Ray also provides native support for reinforcement learning. For example, you can run the following Ray RLlib code in your Databricks notebook to train a PPO reinforcement learning algorithm in the Taxi Gymnasium environment:
from ray.rllib.algorithms.ppo import PPOConfig
config = ( # 1. Configure the algorithm,
PPOConfig()
.environment("Taxi-v3")
.rollouts(num_rollout_workers=2)
.framework("tf2")
.training(model={"fcnet_hiddens": [64, 64]})
.evaluation(evaluation_num_workers=1)
)
algo = config.build() # 2. build the algorithm,
for _ in range(3):
print(algo.train()) # 3. train it,
algo.evaluate() # 4. and evaluate it.
For additional model training information and examples, check out the Ray Train documentation and the Ray RLlib documentation.
Find optimal models with Ray Tune
To improve the quality of your models, you can also leverage Ray Tune to explore thousands of model parameter configurations in parallel at scale. For example, the following code uses Ray Tune to optimize a scikit-learn classification model:
# Install the scikit-learn integration for Ray Tune
%pip install tune-sklearn
from sklearn.datasets import load_iris
from sklearn.linear_model import SGDClassifier
from ray.tune.sklearn import TuneGridSearchCV
X, y = load_iris(return_X_y=True)
parameter_grid = {"alpha": [1e-4, 1e-1, 1], "epsilon": [0.01, 0.1]}
tune_search = TuneGridSearchCV(
SGDClassifier(), parameter_grid, max_iters=10
)
tune_search.fit(X, y)
best_model = tune_search.best_estimator
More information and examples about model tuning on Ray, including the use of Ray with MLflow, is available in the Ray Tune documentation.
View the Ray dashboard
Throughout model development, you can monitor the progress of your Ray machine learning tasks and the health of your Ray nodes using the Ray dashboard. When you create your Ray cluster, the ray.util.spark.setup_ray_cluster()
displays a link to the Ray dashboard.
The Ray dashboard provides a comprehensive view of Ray cluster's nodes, actors, metrics, and event logs. You can easily view resource utilization metrics for individual nodes and aggregate metrics across all nodes. For more information about the Ray dashboard, visit the Ray dashboard documentation.
Get started with Ray on Databricks or Spark today
With the availability of Ray 2.3.0, you can start running Ray applications on your Databricks or Spark clusters today. If you're a Databricks customer, simply create a Databricks cluster with version 12.0 or higher of the Databricks Runtime and check out the Ray on Databricks documentation to get started. Finally, instructions for launching Ray on a standalone Spark cluster are provided in the Ray on Spark documentation, and you can visit https://docs.ray.io/en/latest/ to learn more about machine learning on Ray.
We are very excited about this step forward in interoperability for distributed machine learning and look forward to powering your Ray applications on Apache Spark™ and Databricks!