Care and Feeding of Catalyst Optimizer

You’ve seen the technical deep dives on Spark’s Catalyst query optimizer. You understand how to fix joins, how to find common traps in a logical query plan. But what happens when you’re alone with Spark UI and the cluster goes idle for 40 minutes? How can you diagnose what’s gone wrong with your query and fix it? Spark SQL’s ease of use can have a deceptively steep operational curve. Queries can look innocent but cause issues that require a sophisticated understanding of Spark internals to diagnose and solve. A tour through puzzles and edge cases, this talk challenges us to a better practical understanding of Spark’s Catalyst Optimizer:

  • Everything about how you – and the optimizer – reason about UDFs is based on the idea they’re cheap to run. What if they’re not? Betrayed by salt, a surprising source of skew!
  • What do you do when Spark’s codegen stage generates a method that exceeds 64k? What’s really going on, and is it possible to fix it other than just disabling whole stage codegen?
  • How can tuning the JVM code cache improve your Spark application’s performance?

Register Now
« back
About Rose Toomey


Rose Toomey joined Bloomberg as a senior software developer in the AI Group in April 2020. Previously, she worked as a senior software engineer at Coatue Management, Lead API Developer at Gemini Trust, and a Director of Engineering at Novus Partners.