Kaushik Tadikonda is a software engineer for Enterprise Performance Management at IBM, where he builds applications that identify problems with ETL pipelines. His day-to-day work involves optimizing Spark jobs and deploying, monitoring, and designing infrastructure. He is frequently interested in understanding how things work on a low level, which often causes more problems than it solves.
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job. We'll dive into some best practices extracted from solving real world problems, and steps taken as we added additional resources. garbage collector selection, serialization, tweaking number of workers/executors, partitioning data, looking at skew, partition sizes, scheduling pool, fairscheduler, Java heap parameters. Reading sparkui execution dag to identify bottlenecks and solutions, optimizing joins, partition. By spark sql for rollups best practices to avoid if possible