Kira Lindke is a mathematician turned application developer and data specialist for the Enterprise Performance Management project at IBM. Her primary project is an application for automation of data validation against various systems. Her day-to-day work involves extending and optimizing Spark jobs and working with IBM’s financial analysts to better understand the data and ways to enhance it. Prior to joining IBM, Kira worked for the Department of Defense as a data scientist.
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job. We'll dive into some best practices extracted from solving real world problems, and steps taken as we added additional resources. garbage collector selection, serialization, tweaking number of workers/executors, partitioning data, looking at skew, partition sizes, scheduling pool, fairscheduler, Java heap parameters. Reading sparkui execution dag to identify bottlenecks and solutions, optimizing joins, partition. By spark sql for rollups best practices to avoid if possible