Kai Huang is a software engineer at Intel. His work mainly focuses on developing and supporting deep learning frameworks on Apache Spark. He has successfully helped many enterprise customers work out optimized end-to-end data analytics and AI solutions on big data platforms. He is a main contributor to open source big data + AI projects Analytics Zoo (https://github.com/intel-analytics/analytics-zoo) and BigDL(https://github.com/intel-analytics/BigDL).
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion. Nevertheless, it is not straightforward for Ray to directly deal with big data, especially the data from real-life production environment. Instead of running big data applications and AI applications on two separate systems, we hereby introduce our work for RayOnSpark, which could gracefully allow users to run Ray programs on big data platforms. In this session, we will discuss our implementation of RayOnSpark in detail. You will have an intuitive understanding on how to run various emerging AI applications (including distributed training of deep neural networks, scalable AutoML for time series prediction, distributed reinforcement learning, etc.) on Apache Hadoop/YARN clusters by utilizing Ray and RayOnSpark. In addition, RayOnSpark allows Ray programs to be seamlessly integrated with Apache Spark data processing pipelines and directly run on in-memory Spark RDDs or DataFrames to eliminate expensive data transfer overhead among different systems.