There are many different aggregate operators in Spark SQL. They range from the very basic groupBy and not so basic groupByKey that shines bright in Apache Spark Structured Streaming’s stateful aggregations, including the more advanced cube, rollup and pivot to my beloved windowed aggregations. It’s unbelievable how different the performance characteristic they have, even for the same use cases.
What is particularly interesting is the comparison of the simplicity and performance of windowed aggregations vs groupBy. And that’s just Spark SQL alone. Then there is Spark Structured Streaming that has put groupByKey operator at the forefront of stateful stream processing (and to my surprise as the performance might not be that satisfactory).
This deep-dive talk is going to show all the different use cases for the aggregate operators and functions as well as their performance differences in Spark SQL 2.2 and beyond. Code and fun included!
Session hashtag: #EUdd5
Jacek Laskowski, an independent consultant, software engineer and trainer focusing exclusively on Apache Spark and Apache Kafka (with Scala and sbt, and as much as necessary with Apache Mesos, Hadoop YARN, and DC/OS). He is best known by the gitbooks at https://jaceklaskowski.gitbooks.io about Apache Spark, Spark Structured Streaming, and Apache Kafka. Find me at https://twitter.com/jaceklaskowski.