Join operations in Apache Spark is often a biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining dataframes – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
A ShuffleHashJoin is the most basic way to join tables in Spark – we’ll diagram how Spark shuffles the dataset to make this happen.
A BroadcastHashJoin is also a very common way for Spark to join two tables under the special condition that one of your tables is small.
Dealing with Key Skew in a ShuffleHashJoin
Key Skew is a common source of slowness for a Shuffle Hash Join – we’ll describe what this is and how you might work around this.
Cartesian Joins is a hard problem – we’ll describe why it’s difficult as well as what you need to do to make that work and what to look out for.
One to Many Joins
When a single row in one table can match to many rows in your other table, the total number of output rows in your joined table can be really high. We’ll let you know how to deal with this.
If you aren’t joining two tables strictly by key, but instead checking on a condition for your tables, you may need to provide some hints to Spark SQL to get this to run well. We’ll describe what you can do to make this work.
Vida is currently a Solutions Engineer at Databricks where her job is to onboard and support customers using Spark on Databricks Cloud. In her past, she worked on scaling Square's Reporting Analytics System. She first began working with distributed computing at Google, where she improved search rankings of mobile-specific web content and built and tuned language models for speech recognition using a year's worth of Google search queries. She's passionate about accelerating the adoption of Apache Spark to bring the combination of speed and scale of data processing to the mainstream.