Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn - Databricks

Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn

Download Slides

We all dread “Lost task” and “Container killed by YARN for exceeding memory limits” messages in our scaled-up spark yarn applications. Even answering the question “How much memory did my application use?” is surprisingly tricky in the distributed yarn environment. Sqrrl has developed a testing framework for observing vital statistics of spark jobs including executor-by-executor memory and CPU usage over time for both the JDK and python portions of pyspark yarn containers. This talk will detail the methods we use to collect, store, and report spark yarn resource usage. This information has proved to be invaluable for performance and regression testing of the spark jobs in Sqrrl Enterprise.

Learn more:

  • Spark on YARN: a Deep Dive
  • Spark-­on-­YARN: The Road Ahead
  • Productionizing a 24/7 Spark Streaming service on YARN
  • About Ruslan Vaulin

    Ruslan Vaulin is a Senior Data Scientist at Sqrrl Data Inc. He is an expert in timeseries analysis, anomaly detection, machine learning, and Bayesian statistics. At Sqrrl Data, he develops algorithms for detecting cyber-security threats and cyber attacks. Prior joining Sqrrl Data Inc, Ruslan Vaulin was a research scientist at the MIT-LIGO Laboratory, Massachusetts Institute of Technology developing algorithms for detecting gravitational-wave signals. He is a co-author on the LIGO's discovery of the first gravitational-wave signal paper (,

    About Ed Barnes

    Edwin Barnes is the performance architect at Sqrrl Data. He leads efforts in design and deployment of large scale performance testing systems for high-throughput distributed data processing and analytics applications. Previously he worked at a range of top tech and data companies including Vertica Systems, Unidesk and Dataupia. He has a vast experience in cloud computing and is a leading expert in quality assurance and performance optimization of distributed systems.