Implementing efficient Spark application with the goal of having maximal performance often requires knowledge that goes beyond official documentation. Understanding Spark’s internal processes and features may help to design the queries in alignment with internal optimizations and thus achieve high efficiency during execution. In this talk we will focus on some internal features of Spark SQL which are not well described in official documentation with a strong emphasis on explaining these features on some basic examples while sharing some performance tips along the way.
We want to demystify Spark’s behavior in various situations where the documentation does not provide sufficient explanation. The content is based mostly on the knowledge gathered from studying Spark source code and on our experience from our daily data processing. We will talk about topics and pitfalls that we encountered and solved either in our real-life queries or when helping the community by answering questions on stackoverflow. The talk is intended for anyone who wants to learn how Spark SQL works under the hood and how to use that knowledge to achieve better performance of Spark queries.
Speaker: David Vrba
David is a senior machine learning engineer at Socialbakers. He is working with Spark on a daily basis processing data on different scales from few GBs up to tens of TBs. He also does query optimizations with the goal to achieve maximal performance and helps with productionalizing of various ETL pipelines and ML applications. David enjoys preparing and lecturing Spark trainings and workshops and trained in Spark several teams such as data engineers, analysts and researchers. David received his Ph.D. from Charles University in Prague in 2015.