David Vrba

Senior Machine Learning Engineer, Socialbakers a.s.

David is a senior machine learning engineer at Socialbakers. He is working with Spark on a daily basis processing data on different scales from few GBs up to tens of TBs. He also does query optimizations with the goal to achieve maximal performance and helps with productionalizing of various ETL pipelines and ML applications. David enjoys preparing and lecturing Spark trainings and workshops and trained in Spark several teams such as data engineers, analysts and researchers. David received his Ph.D. from Charles University in Prague in 2015.

Past sessions

Summit Europe 2020 Spark SQL Beyond Official Documentation

November 17, 2020 04:00 PM PT

Implementing efficient Spark application with the goal of having maximal performance often requires knowledge that goes beyond official documentation. Understanding Spark's internal processes and features may help to design the queries in alignment with internal optimizations and thus achieve high efficiency during execution. In this talk we will focus on some internal features of Spark SQL which are not well described in official documentation with a strong emphasis on explaining these features on some basic examples while sharing some performance tips along the way.

We want to demystify Spark's behavior in various situations where the documentation does not provide sufficient explanation. The content is based mostly on the knowledge gathered from studying Spark source code and on our experience from our daily data processing. We will talk about topics and pitfalls that we encountered and solved either in our real-life queries or when helping the community by answering questions on stackoverflow. The talk is intended for anyone who wants to learn how Spark SQL works under the hood and how to use that knowledge to achieve better performance of Spark queries.

Speaker: David Vrba

Summit Europe 2019 Physical Plans in Spark SQL—continues

October 15, 2019 05:00 PM PT

In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.

The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.

Summit Europe 2019 Physical Plans in Spark SQL

October 15, 2019 05:00 PM PT

In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.

The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.