Rose Toomey - Databricks

Rose Toomey

Software Engineer, Coatue Management

Rose Toomey is a software engineer at Coatue Management. Rose is responsible for the care and feeding of Spark data pipelines. Previously, Rose was the Lead API Developer at Gemini Trust and a Director of Engineering at Novus Partners.

UPCOMING SESSIONS

Care and Feeding of Catalyst OptimizerSummit 2020

You've seen the technical deep dives on Spark's Catalyst query optimizer. You understand how to fix joins, how to find common traps in a logical query plan. But what happens when you're alone with Spark UI and the cluster goes idle for 40 minutes? How can you diagnose what's gone wrong with your query and fix it? Spark SQL's ease of use can have a deceptively steep operational curve. Queries can look innocent but cause issues that require a sophisticated understanding of Spark internals to diagnose and solve. A tour through puzzles and edge cases, this talk challenges us to a better practical understanding of Spark's Catalyst Optimizer:

  • Everything about how you - and the optimizer - reason about UDFs is based on the idea they're cheap to run. What if they're not? Betrayed by salt, a surprising source of skew!
  • What do you do when Spark's codegen stage generates a method that exceeds 64k? What's really going on, and is it possible to fix it other than just disabling whole stage codegen?
  • How can tuning the JVM code cache improve your Spark application's performance?

PAST SESSIONS

Apache Spark At Scale in the CloudSummit Europe 2019

Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.

You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It's frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.

By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:

  • Sizing the cluster based on your dataset (shuffle partitions)
  • Ingestion challenges - well begun is half done (globbing S3, small files)
  • Managing memory (sorting GC - when to go parallel, when to go G1, when offheap can help you)
  • Shuffle (give a little to get a lot - configs for better out of box shuffle) - Spill (partitioning for the win)
  • Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
  • Caching and persistence (it's the cost of doing business, so what are your options?)
  • Fault tolerance (blacklisting, speculation, task reaping)
  • Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
  • Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)