Ruslan Vaulin is a Senior Data Scientist at Sqrrl Data Inc. He is an expert in timeseries analysis, anomaly detection, machine learning, and Bayesian statistics. At Sqrrl Data, he develops algorithms for detecting cyber-security threats and cyber attacks.
Prior joining Sqrrl Data Inc, Ruslan Vaulin was a research scientist at the MIT-LIGO Laboratory, Massachusetts Institute of Technology developing algorithms for detecting gravitational-wave signals. He is a co-author on the LIGO’s discovery of the first gravitational-wave signal paper (https://www.ligo.caltech.edu/news/ligo20160211, http://journals.aps.org/prl/abstract/10.1103/PhysRevLett.116.241103).
We all dread “Lost task” and “Container killed by YARN for exceeding memory limits” messages in our scaled-up spark yarn applications. Even answering the question “How much memory did my application use?” is surprisingly tricky in the distributed yarn environment. Sqrrl has developed a testing framework for observing vital statistics of spark jobs including executor-by-executor memory and CPU usage over time for both the JDK and python portions of pyspark yarn containers. This talk will detail the methods we use to collect, store, and report spark yarn resource usage. This information has proved to be invaluable for performance and regression testing of the spark jobs in Sqrrl Enterprise.