In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution. The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
David is a data scientist at Socialbakers. He is working with Spark on daily basis processing data on different scales from few GBs up to tens of TBs. He also does query optimizations and helps with productionalizing of various ETL pipelines. David enjoys preparing and lecturing Spark trainings and workshops and trained in Spark some company teams such as data engineers, analysts and researchers. David received his Ph.D. from Charles University in Prague in 2015.