Jacek Laskowski - Databricks

Jacek Laskowski

Independent Consultant, 4Quant / ETH Zurich

Jacek is an independent consultant who offers development and training services for Apache Spark (and Scala, sbt with a bit of Hadoop YARN, Apache Kafka, Apache Hive, Apache Mesos, Akka Actors/Stream/HTTP, and Docker). He leads Warsaw Scala Enthusiasts and Warsaw Spark meetups. The latest project is to get in-depth understanding of Apache Spark in https://jaceklaskowski.gitbooks.io/mastering-apache-spark/.

UPCOMING SESSIONS

Bucketing in Spark SQL 2.3

Bucketing is an optimization technique in Spark SQL that uses buckets and bucketing columns to determine data partitioning. When applied properly bucketing can lead to join optimizations by avoiding shuffles (aka exchanges) of tables participating in the join. The talk will give you the necessary information so you can use bucketing to optimize Spark SQL structured queries.

Deep Dive into Query Execution in Spark SQL 2.3Summit Europe 2018

If you want to get even slightly better performance of your structured queries (regardless whether they are batch or streaming) you have to peek at the foundations of Dataset API starting with QueryExecution. That's where any structured query ends at and my talk starts from. The talk will show you what stages a structured query has to go through before execution in Spark SQL. I'll be talking about the different phases of query execution and the logical and physical optimizations. I'll show the different optimizations in Spark SQL 2.3 and how to write one yourself (in Scala).

Deep Dive into Query Execution in Spark SQL 2.3Summit 2018

If you want to get even slightly better performance of your structured queries (regardless whether they are batch or streaming) you have to peek at the foundations of Dataset API starting with QueryExecution. That's where any structured query ends at and my talk starts from. The talk will show you what stages a structured query has to go through before execution in Spark SQL. I'll be talking about the different phases of query execution and the logical and physical optimizations. I'll show the different optimizations in Spark SQL 2.3 and how to write one yourself (in Scala).

PAST SESSIONS

What Lies Beneath Apache Spark’s RDD API (Using Spark-shell and WebUI)Summit East 2016

The talk is aimed at introducing Spark from near-low-level details of RDD and jobs that are triggered by actions. It's a deep dive into what happens after a simple spark-shell execution and how Spark distributes tasks amongst executors. It is also going to demonstrate the difference between Spark's local mode and clusters, how stages are created given a Spark user program with Spark shell and UI. It should be as useful for developers as administrators who would like to dig deeper into Apache Spark under the surface of RDD API. The approach is to demonstrate what is behind a simple ''spark-shell -master'', and learning Spark from another non-API perspective. The talk is a sort of a summary of what I learnt about the architecture of Apache Spark from reviewing Spark's source code and writing the notes at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/.

Deep Dive into Monitoring Spark Applications (Using Web UI and SparkListeners)Summit Europe 2016

During the presentation you will learn about the architecture of Spark's web UI and the different SparkListeners that sit behind it to support its operation. You will learn what information about Spark applications the Spark UI presents and how to read them to understand performance of your Spark applications. This talk will demo sample Spark snippets (using spark-shell) to showcase the hidden gems of Spark UI like queues in FAIR scheduling mode, SQL queries or Streaming jobs.

From Basic to Advanced Aggregate Operators in Apache Spark SQL 2.2 by Examples and their Catalyst Optimizations – continuesSummit Europe 2017

There are many different aggregate operators in Spark SQL. They range from the very basic groupBy and not so basic groupByKey that shines bright in Apache Spark Structured Streaming’s stateful aggregations, including the more advanced cube, rollup and pivot to my beloved windowed aggregations. It’s unbelievable how different the performance characteristic they have, even for the same use cases. What is particularly interesting is the comparison of the simplicity and performance of windowed aggregations vs groupBy. And that’s just Spark SQL alone. Then there is Spark Structured Streaming that has put groupByKey operator at the forefront of stateful stream processing (and to my surprise as the performance might not be that satisfactory). This deep-dive talk is going to show all the different use cases for the aggregate operators and functions as well as their performance differences in Spark SQL 2.2 and beyond. Code and fun included! Session hashtag: #EUdd5

From Basic to Advanced Aggregate Operators in Apache Spark SQL 2.2 by Examples and their Catalyst OptimizationsSummit Europe 2017

There are many different aggregate operators in Spark SQL. They range from the very basic groupBy and not so basic groupByKey that shines bright in Apache Spark Structured Streaming’s stateful aggregations, including the more advanced cube, rollup and pivot to my beloved windowed aggregations. It’s unbelievable how different the performance characteristic they have, even for the same use cases. What is particularly interesting is the comparison of the simplicity and performance of windowed aggregations vs groupBy. And that’s just Spark SQL alone. Then there is Spark Structured Streaming that has put groupByKey operator at the forefront of stateful stream processing (and to my surprise as the performance might not be that satisfactory). This deep-dive talk is going to show all the different use cases for the aggregate operators and functions as well as their performance differences in Spark SQL 2.2 and beyond. Code and fun included! Session hashtag: #EUdd5

BoF Discussion-Apache Spark Meetup OrganizersSummit Europe 2017

Today, we have 625 and 430K spark meetups and members respectively around the globe. How can we work, share, collaborate, and promote speakers and sessions?This BoF is for anyone who's Spark Meetup Orangizer, attendee, speaker, or anyone interested to share ideas for better sharing and collaborating on tech-talks and content.

Monitoring Structured Streaming Applications Using Web UISummit Europe 2017

Spark Structured Streaming in Apache Spark 2.2 comes with quite a few unique Catalyst operators, most notably stateful streaming operators and three different output modes. Understanding how Spark Structured Streaming manages intermediate state between triggers and how it affects performance is paramount. After all you use Apache Spark for processing huge amount of data that alone can be tricky to get right, and Spark Structured Streaming adds the additional streaming factor that given a structured query can make the data even bigger due to state management. This deep-dive talk is going to show you what is included in execution diagrams, logical and physical plans, and metrics in SQL tab's Details for Query page. The talk will also explain the other parts of SQL tab and the subpages with details for streaming queries. The talk is going to answer the following questions: - What do blue boxes represent in Details for Query page in SQL tab? - What does the black popup window tell me when hovering over a blue box in Details for Query page in SQL tab? - What’s under Details section at the bottom in Details for Query page in SQL tab? - Why does a single streaming query execute many queries as shown in SQL tab? - What are the Spark jobs in Spark Jobs page in Jobs tab? - Why would a single query execution lead to zero or more Spark jobs? How does the translation happen? - Why are the shuffles/exchanges in an execution plan for a streaming aggregation query? and more!

Learn more:
  • Deep Dive into Monitoring Spark Applications (Using Web UI and SparkListeners)
  • A Deep Dive Into Structured Streaming
  • Structured Streaming In Apache Spark