Internals of Speeding up PySpark with Arrow

Download Slides

Back in the old days of Apache Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, likewise did the constant improvement of the optimisers (Catalyst and Tungsten). But, after Spark 2.3, PySpark has sped up tremendously thanks to the addition of the Arrow serialisers. In this talk you will learn how the Spark Scala core communicates with the Python processes, how data is exchanged across both sub-systems and the development efforts present and underway to make it as fast as possible.


Try Databricks
See More Spark + AI Summit Europe 2019 Videos

« back
About Ruben Berenguel

Hybrid Theory

Ruben Berenguel is the lead data engineer at Hybrid Theory, as well as an occasional contributor for Spark (especially PySpark). PhD in Mathematics, he moved to data engineering where he works mostly with Scala, Python and Go designing and implementing big data pipelines in London and Barcelona.