Moritz Meister is a Software Engineer at Logical Clocks AB, the developers of Hopsworks. Moritz has a background in Econometrics and is holding MSc degrees in Computer Science from Politecnico di Milano and Universidad Politecnica de Madrid. He has previously worked as a Data Scientist on projects for Deutsche Telekom and Deutsche Lufthansa in Germany, helping them to productionize machine learning models to improve customer relationship management.
Utilizing accelerators in Apache Spark presents opportunities for significant speedup of ETL, ML and DL applications. In this deep dive, we give an overview of accelerator aware task scheduling, columnar data processing support, fractional scheduling, and stage level resource scheduling and configuration. Furthermore, we dive into the Apache Spark 3.x RAPIDS plugin, which enables applications to take advantage of GPU acceleration with no code change. An explanation of how the Catalyst optimizer physical plan is modified for GPU aware scheduling is reviewed. The talk touches upon how the plugin can take advantage of the RAPIDS specific libraries, cudf, cuio and rmm to run tasks on the GPU. Optimizations were also made to the shuffle plugin to take advantage of the GPU using UCX, a unified communication framework that addresses GPU memory intra and inter node. A roadmap for further optimizations taking advantage of RDMA and GPU Direct Storage are mentioned. Industry standard benchmarks and runs on production datasets will be shared.
For the past two years, the open-source Hopsworks platform has used Spark to distribute hyperparameter optimization tasks for Machine Learning. Hopsworks provides some basic optimizers (gridsearch, randomsearch, differential evolution) to propose combinations of hyperparameters (trials) that are run synchronously in parallel on executors as map functions. However, many such trials perform poorly, and we waste a lot of CPU and harware accelerator cycles on trials that could be stopped early, freeing up the resources for other trials.
In this talk, we present our work on Maggy, an open-source asynchronous hyperparameter optimization framework built on Spark that transparently schedules and manages hyperparameter trials, increasing resource utilization, and massively increasing the number of trials that can be performed in a given period of time on a fixed amount of resources. Maggy is also used to support parallel ablation studies using Spark. We have commercial users evaluating Maggy and we will report on the gains they have seen in reduced time to find good hyperparameters and improved utilization of GPU hardware. Finally, we will perform a live demo on a Jupyter notebook, showing how to integrate maggy in existing PySpark applications.