Spark Structured Streaming in Apache Spark 2.2 comes with quite a few unique Catalyst operators, most notably stateful streaming operators and three different output modes. Understanding how Spark Structured Streaming manages intermediate state between triggers and how it affects performance is paramount. After all you use Apache Spark for processing huge amount of data that alone can be tricky to get right, and Spark Structured Streaming adds the additional streaming factor that given a structured query can make the data even bigger due to state management.
This deep-dive talk is going to show you what is included in execution diagrams, logical and physical plans, and metrics in SQL tab’s Details for Query page.
The talk will also explain the other parts of SQL tab and the subpages with details for streaming queries.
The talk is going to answer the following questions:
* What do blue boxes represent in Details for Query page in SQL tab?
* What does the black popup window tell me when hovering over a blue box in Details for Query page in SQL tab?
* What’s under Details section at the bottom in Details for Query page in SQL tab?
* Why does a single streaming query execute many queries as shown in SQL tab?
* What are the Spark jobs in Spark Jobs page in Jobs tab?
* Why would a single query execution lead to zero or more Spark jobs? How does the translation happen?
* Why are the shuffles/exchanges in an execution plan for a streaming aggregation query?
* and more!
Jacek Laskowski, an independent consultant, software engineer and trainer focusing exclusively on Apache Spark and Apache Kafka (with Scala and sbt, and as much as necessary with Apache Mesos, Hadoop YARN, and DC/OS). He is best known by the gitbooks at https://jaceklaskowski.gitbooks.io about Apache Spark, Spark Structured Streaming, and Apache Kafka. Find me at https://twitter.com/jaceklaskowski.