Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community contributors have continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of Spark 2.3 features:
Kubernetes Scheduler Backend
PySpark Performance and Enhancements
Continuous Structured Streaming Processing
DataSource v2 APIs
Spark History Server Performance Enhancements
Session hashtag: #DevSAIS16
Sameer Agarwal is a Spark Committer and Tech Lead in the Data Platform team at Facebook where he works on building distributed systems and databases that scale across geo-distributed clusters of tens of thousands of machines. Before Facebook, Sameer led the open-source Apache Spark team at Databricks. He received his PhD in Databases from UC Berkeley AMPLab where he worked on BlinkDB, an approximate query engine for Spark.