Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators

Download Slides

CPU technologies have scaled well in past years, by more complex architecture design, more wide execution pipelines, more cores in same processor, and higher frequency. However accelerators show more computational power and higher throughput with lower cost in dedicated area, which leads to more usages in Spark. But when we integrate accelerators in Spark a common case is huge performance promises through micro test with little performance boost actually we get.

One reason is the cost of data transfer between JVM and accelerator. The other reason is the accelerator lack the information how it’s used in Spark. In this research, we investigate the usage of apache arrow based dataframe as the unified data sharing and transferring way between CPU and accelerators, and make it dataframe aware when we design hardware and software stack. In this way we seamlessly integrate Spark and Accelerators design and get close to promised performance.

 

Try Databricks
See More Spark + AI Summit in San Francisco 2019 Videos


« back
About Binwei Yang

Binwei Yang is a big data analytics architect at Intel, focusing on performance optimization of big data software, accelerator design and utilization in big data framework, as well as the big data and HPC framework integration. Prior to the big data role, Binwei worked in intel micro architecture team and focusing on performance simulation and analysis.

About Carson Wang

Carson Wang is a software engineering manager in Intel data analytics software group, where he focuses on optimizing popular big data and machine learning frameworks, driving the efforts of building converged big data and AI platform. He had created and led a few open source projects, such as RayDP - Spark on Ray, OAP MLlib - a highly optimized Spark MLlib, Spark adaptive query execution engine, Hibench - a big data micro benchmark suite, and more.