Xianyang is a software engineer at Intel. He has over 4 years of experience with big data and distributed systems. He is also an active contributor for Apache Spark, Ray, and other distributed frameworks. Now he works on building end-to-end solutions of distributed data preprocessing and model training with Spark and Ray.
May 27, 2021 03:50 PM PT
A large-scale end-to-end data analytics and AI pipeline usually involves data processing frameworks such as Apache Spark for massive data preprocessing, and ML/DL frameworks for distributed training on the preprocessed data. A conventional approach is to use two separate clusters and glue multiple jobs. Other solutions include running deep learning frameworks in an Apache Spark cluster, or use workflow orchestrators like Kubeflow to stitch distributed programs. All these options have their own limitations. We introduce Ray as a single substrate for distributed data processing and machine learning. We also introduce RayDP which allows you to start an Apache Spark job on Ray in your python program and utilize Ray's in-memory object store to efficiently exchange data between Apache Spark and other libraries. We will demonstrate how this makes building an end-to-end data analytics and AI pipeline simpler and more efficient.