Eric Kaczmarek is a Senior Java Performance Architect in the Software Solution Group. He has been at Intel for over 20 years. For the better part of the last 10 years, he focused on optimizing the Java Virtual Machine for Intel Architectures. Because of his deep and broad Java Virtual Machine expertise, Eric leads the effort to enable and optimize Big Data frameworks such as Hadoop* and HBase* for Intel based platforms. He earned a BS degree in Computer Science and Engineering for the University of California Los Angeles (UCLA).
Spark nodes are shifting from commodity hardware to more powerful systems with higher memory environments (200GB+). As an in-memory computing framework, popular wisdom has it that large Java heaps result in long garbage collection pauses slowing down Spark’s overall throughput. Through several case studies using large Java heaps, we will show it is possible to maintain low GC pauses for better application throughput. In this presentation, we introduce the Hotspot G1 collector as the best GC for Spark solutions running in large memory environments. We first discuss Hotspot G1 internal operations and several tuning flags. Those flags can be used to set desired GC pause target, change adaptive GC thresholds, and adjust GC activities at runtime. We will provide several case studies from Spark graph computing application running 80GB+ heap to show how we can tune those flags to remove unpredicted and protracted GC pauses for better application throughput.
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.