Vida is currently a Solutions Engineer at Databricks where her job is to onboard and support customers using Spark on Databricks Cloud. In her past, she worked on scaling Square’s Reporting Analytics System. She first began working with distributed computing at Google, where she improved search rankings of mobile-specific web content and built and tuned language models for speech recognition using a year’s worth of Google search queries. She’s passionate about accelerating the adoption of Apache Spark to bring the combination of speed and scale of data processing to the mainstream.
Spark can analyze data stored on files in many different formats: plain text, JSON, XML, Parquet, and more. But just because you can get a Spark job to run on a given data input format doesn’t mean you’ll get the same performance with all of them. Actually, the performance difference can be quite substantial. This talk will cover some common data input formats and nuances about working with that format. The goal for the talk is to help Spark programmers make more conscientious and smart decisions about how to store their data. Here is an example of the topics that will be covered in the talk: – Issues you’ll encounter when processing excessively large XML input files. – Why choose parquet files for Spark SQL? – How coalescing many small files may give you better performance.
This session will cover a series of problems that are adequately solved with Apache Spark, as well as those that are require additional technologies to implement correctly. Here's an example outline of some of the topics that will be covered in the talk: Problems that are perfectly solved with Apache Spark: 1) Analyzing a large set of data files. 2) Doing ETL of a large amount of data. 3) Applying Machine Learning & Data Science to a large dataset. 4) Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally. Examples of problems that Apache Spark is not optimized for: 1) Random access, frequent inserts, and updates of rows of SQL tables. Databases have better performance for these use cases. 2) Supporting Incremental updates of Databases into Spark. It's not performant to update your Spark SQL tables backed by files. Instead, you can use message queues and Spark Streaming or doing an incremental select to make sure your Spark SQL tables stay up to date with your production databases. 3) External Reporting with many concurrent requests. While Spark's ability to cache your data in memory will allow you to get back to fast interactive querying, Spark is not meant to be optimal for supporting many concurrent requests. It's better to use Spark to ETL your data to summary tables or some other format into a traditional database to serve your reports if you have many concurrent users to support. 4) Searching content. A Spark job can certainly be written to filter or search for any content you'd like. ElasticSearch is a specialized engine designed to return search results quicker than Spark.
Join operations in Apache Spark is often a biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining dataframes - to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast! This session will cover different ways of joining tables in Apache Spark. ShuffleHashJoin A ShuffleHashJoin is the most basic way to join tables in Spark - we’ll diagram how Spark shuffles the dataset to make this happen. BroadcastHashJoin A BroadcastHashJoin is also a very common way for Spark to join two tables under the special condition that one of your tables is small. Dealing with Key Skew in a ShuffleHashJoin Key Skew is a common source of slowness for a Shuffle Hash Join - we’ll describe what this is and how you might work around this. CartesianJoin Cartesian Joins is a hard problem - we’ll describe why it’s difficult as well as what you need to do to make that work and what to look out for. One to Many Joins When a single row in one table can match to many rows in your other table, the total number of output rows in your joined table can be really high. We’ll let you know how to deal with this. Theta Joins If you aren’t joining two tables strictly by key, but instead checking on a condition for your tables, you may need to provide some hints to Spark SQL to get this to run well. We’ll describe what you can do to make this work.
Efficient data access is one of the key factors for having a high performance data processing pipeline. Determining the layout of data values in the filesystem often has fundamental impacts on the performance of data access. In this talk, we will show insights on how data layout affects the performance of data access. We will first explain how modern columnar file formats like Parquet and ORC work and explain how to use them efficiently to store data values. Then, we will present our best practice on how to store datasets, including guidelines on choosing partitioning columns and deciding how to bucket a table. Session hashtag: #SFexp20