Getting The Best Performance With PySpark - Databricks

Getting The Best Performance With PySpark

Download Slides

This talk assumes you have a basic understanding of Spark and takes us beyond the standard intro to explore what makes PySpark fast and how to best scale our PySpark jobs. If you are using Python and Spark together and want to get faster jobs – this is the talk for you. This talk covers a number of important topics for making scalable Apache Spark programs – from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.

Learn more:

  • Developing Custom Machine Learning Algorithms in PySpark
  • Best Practices for Running PySpark
  • Introducing Pandas UDF for PySpark