Jacek Laskowski

Independent Consultant,

Jacek Laskowski, an independent consultant, software engineer and trainer focusing exclusively on Apache Spark and Apache Kafka (with Scala and sbt, and as much as necessary with Apache Mesos, Hadoop YARN, and DC/OS). He is best known by the gitbooks at https://jaceklaskowski.gitbooks.io about Apache Spark, Spark Structured Streaming, and Apache Kafka. Find me at https://twitter.com/jaceklaskowski.

SESSIONS

What Lies Beneath Apache Spark’s RDD API (Using Spark-shell and WebUI)

The talk is aimed at introducing Spark from near-low-level details of RDD and jobs that are triggered by actions. It's a deep dive into what happens after a simple spark-shell execution and how Spark distributes tasks amongst executors. It is also going to demonstrate the difference between Spark's local mode and clusters, how stages are created given a Spark user program with Spark shell and UI. It should be as useful for developers as administrators who would like to dig deeper into Apache Spark under the surface of RDD API. The approach is to demonstrate what is behind a simple ''spark-shell -master'', and learning Spark from another non-API perspective. The talk is a sort of a summary of what I learnt about the architecture of Apache Spark from reviewing Spark's source code and writing the notes at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/.

Deep Dive into Monitoring Spark Applications (Using Web UI and SparkListeners)

During the presentation you will learn about the architecture of Spark's web UI and the different SparkListeners that sit behind it to support its operation. You will learn what information about Spark applications the Spark UI presents and how to read them to understand performance of your Spark applications. This talk will demo sample Spark snippets (using spark-shell) to showcase the hidden gems of Spark UI like queues in FAIR scheduling mode, SQL queries or Streaming jobs.

From Basic to Advanced Aggregate Operators in Apache Spark SQL 2.2 by Examples and their Catalyst Optimizations – continues

There are many different aggregate operators in Spark SQL. They range from the very basic groupBy and not so basic groupByKey that shines bright in Apache Spark Structured Streaming’s stateful aggregations, including the more advanced cube, rollup and pivot to my beloved windowed aggregations. It’s unbelievable how different the performance characteristic they have, even for the same use cases. What is particularly interesting is the comparison of the simplicity and performance of windowed aggregations vs groupBy. And that’s just Spark SQL alone. Then there is Spark Structured Streaming that has put groupByKey operator at the forefront of stateful stream processing (and to my surprise as the performance might not be that satisfactory). This deep-dive talk is going to show all the different use cases for the aggregate operators and functions as well as their performance differences in Spark SQL 2.2 and beyond. Code and fun included! Session hashtag: #EUdd5

From Basic to Advanced Aggregate Operators in Apache Spark SQL 2.2 by Examples and their Catalyst Optimizations

There are many different aggregate operators in Spark SQL. They range from the very basic groupBy and not so basic groupByKey that shines bright in Apache Spark Structured Streaming’s stateful aggregations, including the more advanced cube, rollup and pivot to my beloved windowed aggregations. It’s unbelievable how different the performance characteristic they have, even for the same use cases. What is particularly interesting is the comparison of the simplicity and performance of windowed aggregations vs groupBy. And that’s just Spark SQL alone. Then there is Spark Structured Streaming that has put groupByKey operator at the forefront of stateful stream processing (and to my surprise as the performance might not be that satisfactory). This deep-dive talk is going to show all the different use cases for the aggregate operators and functions as well as their performance differences in Spark SQL 2.2 and beyond. Code and fun included! Session hashtag: #EUdd5

BoF Discussion-Apache Spark Meetup Organizers

Today, we have 625 and 430K spark meetups and members respectively around the globe. How can we work, share, collaborate, and promote speakers and sessions?This BoF is for anyone who's Spark Meetup Orangizer, attendee, speaker, or anyone interested to share ideas for better sharing and collaborating on tech-talks and content.

Monitoring Structured Streaming Applications Using Web UI

Spark Structured Streaming in Apache Spark 2.2 comes with quite a few unique Catalyst operators, most notably stateful streaming operators and three different output modes. Understanding how Spark Structured Streaming manages intermediate state between triggers and how it affects performance is paramount. After all you use Apache Spark for processing huge amount of data that alone can be tricky to get right, and Spark Structured Streaming adds the additional streaming factor that given a structured query can make the data even bigger due to state management. This deep-dive talk is going to show you what is included in execution diagrams, logical and physical plans, and metrics in SQL tab's Details for Query page. The talk will also explain the other parts of SQL tab and the subpages with details for streaming queries. The talk is going to answer the following questions: * What do blue boxes represent in Details for Query page in SQL tab? * What does the black popup window tell me when hovering over a blue box in Details for Query page in SQL tab? * What’s under Details section at the bottom in Details for Query page in SQL tab? * Why does a single streaming query execute many queries as shown in SQL tab? * What are the Spark jobs in Spark Jobs page in Jobs tab? * Why would a single query execution lead to zero or more Spark jobs? How does the translation happen? * Why are the shuffles/exchanges in an execution plan for a streaming aggregation query? * and more!