Yingqi (Lucy) Lu is a Senior Software Performance Engineer in the Software Solution Group. She has been at Intel for over 8 years working on performance optimizations of Virtualization, Power Efficiency, Webservers and Java Virtual Machine. She is currently focusing on enabling and optimizing Big Data frameworks such as Hadoop* and Spark* for Intel Architecture. She earned a MS degree in Computer Science from University of Colorado at Boulder.
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.