Skewed data is the enemy when joining tables using Spark. It shuffles a large proportion of the data onto a few overloaded nodes, bottlenecking Spark’s parallelism and resulting in out of memory errors. The go-to answer is to use broadcast joins; leaving the large, skewed dataset in place and transmitting a smaller table to every machine in the cluster for joining. But what happens when your second table is too large to broadcast, and does not fit into memory? Or even worse, when a single key is bigger than the total size of your executor? Firstly, we will give an introduction into the problem. Secondly, the current ways of fighting the problem will be explained, including why these solutions are limited. Finally, we will demonstrate a new technique – the iterative broadcast join – developed while processing ING Bank’s global transaction data. This technique, implemented on top of the Spark SQL API, allows multiple large and highly skewed datasets to be joined successfully, while retaining a high level of parallelism. This is something that is not possible with existing Spark join types.
Session hashtag: #EUde11
Rob is a Solution Architect and developer for Big Data applications, with 10 years experience working in financial services and counter-fraud domains. He's currently working with ING Netherlands in their Advanced Analytics team. The team is an "internal start up," which aims to change the way the bank operates via data driven analytics and machine learning.
Fokko Driesprong is a data engineer at GoDataDriven. What he enjoys most is writing scalable code using functional languages (preferably Scala) and loves to play with big data processing platforms. Besides being a consultant he contributes to open source projects, among others Apache Spark, Apache Flink, Apache Airflow and Druid.