High-Performance Python On Spark

Download Slides

This talk will examine ongoing work to more closely integrate the Spark and Python ecosystems to enable more accessible, scalable, and fast analytics for Python users. In particular, I will look at performance and usability questions in scaling single-machine workloads built on Python libraries like pandas and scikit-learn to larger scales with Spark DataFrames, MLLib, and other Spark tools. With regard to data access and computation performance, I will look at efforts to use the Apache Arrow in-memory columnar memory layout to better take advantage of pandas’s optimized data wrangling algorithms and Python’s general high-performance computing tools in a Spark context.