Venkata Krishnan Sowrirajan

Architect, LinkedIn

Venkata is a Senior Software Engineer at LinkedIn working on a big data platform based on Apache Spark. Previously worked at Qubole building big data platforms primarily on public service clouds using Apache Spark. Before starting to work in the industry, graduated from Arizona State University with an MS in Computer Science.

Past sessions

Apache Spark has the ‘speculative execution’ feature to handle the slow tasks in a stage due to environment issues like slow network, disk etc. If one task is running slowly in a stage, Spark driver can launch a speculation task for it on a different host.  Between the regular task and its speculation task, Spark system will later take the result from the first successfully completed task and kill the slower one.

 

When we first enabled the speculation feature for all Spark applications by default on a large cluster of 10K+ nodes at LinkedIn, we observed that the default values set for Spark’s speculation configuration parameters did not work well for LinkedIn’s batch jobs. For example, the system launched too many fruitless speculation tasks (i.e. tasks that were killed later).  Besides, the speculation tasks did not help shorten the shuffle stages. In order to reduce the number of fruitless speculation tasks, we tried to find out the root cause, enhanced Spark engine, and tuned the speculation parameters carefully. We analyzed the number of speculation tasks launched, number of fruitful versus fruitless speculation tasks, and their corresponding cpu-memory resource consumption in terms of gigabytes-hours.  We were able to reduce the average job response times by 13%, decrease the standard deviation of job elapsed times by 40%, and lower total resource consumption by 24% in a heavily utilized multi-tenant environment on a large cluster.  In this talk, we will share our experience on enabling the speculative execution to achieve good job elapsed time reduction at the same time keeping a minimal overhead.

In this session watch:
Venkata Krishnan Sowrirajan, Architect, LinkedIn
Ron Hu, Senior Staff Software Engineer, LinkedIn

[daisna21-sessions-od]

Summit Europe 2019 Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters

October 15, 2019 05:00 PM PT

Adding nodes at runtime (Upscale) to already running Spark-on-Yarn clusters is fairly easy. But taking away these nodes (Downscale) when the workload is low at some later point of time is a difficult problem. To remove a node from a running cluster, we need to make sure that it is not used for compute as well as storage.

But on production workloads, we see that many of the nodes can't be taken away because:

  1. Nodes are running some containers although they are not fully utilized i.e., containers are fragmented on different nodes. Example. - each node is running 1-2 containers/executors although they have resources to run 4 containers.
  2. Nodes have some shuffle data in the local disk which will be consumed by Spark application running on this cluster later. In this case, the Resource Manager will never decide to reclaim these nodes because losing shuffle data could lead to costly recomputation of stages.

In this talk, we will talk about how we can improve downscaling in Spark-on-YARN clusters under the presence of such constraints. We will cover changes in scheduling strategy for container allocation in YARN and Spark task scheduler which together helps us achieve better packing of containers. This makes sure that containers are defragmented on fewer set of nodes and thus some nodes don't have any compute. In addition to this, we will also cover enhancements to Spark driver and External Shuffle Service (ESS) which helps us to proactively delete shuffle data which we already know has been consumed. This makes sure that nodes are not holding any unnecessary shuffle data - thus freeing them from storage and hence available for reclamation for faster downscaling.