This talk will present a technical “”deep-dive”” into Spark that focuses on its internal architecture. The content will be geared towards those already familiar with the basic Spark API who want to gain a deeper understanding of how it works and become advanced users or Spark developers.
This talk will walk through the major internal components of Spark: The RDD data model, the scheduling subsystem, and Spark’s internal block-store service. For each component we’ll describe its architecture and role in job execution. We’ll also provide examples of how higher level libraries like SparkSQL and MLLib interact with the core Spark API.
Throughout the talk we’ll cover advanced topics like data serialization, RDD partitioning, and user-defined RDD’s, with a focus on actionable advice that users can apply to their own workloads.
Aaron Davidson is an Apache Spark committer and software engineer at Databricks. His Spark contributions include standalone master fault tolerance, shuffle file consolidation, Netty-based block transfer service, and the external shuffle service. At Databricks, he leads the Performance and Storage team, working on the Databricks File System (DBFS) and automating the cloud infrastructure.