Anton is a Spark contributor and a Software Engineer at Apple. He has been dealing with the internals of Spark for the last 3 years. At Apple, Anton is working on an elastic, on-demand, secure, and fully managed Spark as a service. Prior to joining Apple, he optimized and extended a proprietary Spark distribution at SAP. Anton holds a Master’s degree in Computer Science from RWTH Aachen University.
May 26, 2021 11:30 AM PT
More and more companies adopt Spark 3 to benefit from various enhancements and performance optimizations like adaptive query execution and dynamic partition pruning. During this process, organizations consider migrating their data sources to the newly added Catalog API (aka Data Source V2), which provides a better way to develop reliable and efficient connectors. Unfortunately, there are a few limitations that prevent unleashing the full potential of the Catalog API. One of them is the inability to control the distribution and ordering of incoming data that has a profound impact on the performance of data sources.
This talk is going to be useful for developers and data engineers that either develop their own or work with existing data sources in Spark. The presentation will start with an overview of the Catalog API introduced in Spark 3, followed by its benefits and current limitations compared to the old Data Source API. The main focus will be on an extension to the Catalog API developed in SPARK-23889, which lets implementations control how Spark distributes and orders incoming records before passing them to the sink.
The extension not only allows data sources to reduce the memory footprint during writes but also to co-locate data for faster queries and better compression. Apart from that, the introduced API paves the way for more advanced features like partitioned joins.
April 24, 2019 05:00 PM PT
Apple leverages Apache Spark for processing large datasets to power key components of Apple's production services. The majority of users rely on Spark SQL to benefit from state-of-the-art optimizations in Catalyst and Tungsten. As there are multiple APIs to interact with Spark SQL, users have to make a wise decision which one to pick. While DataFrames and SQL are widely used, they lack type safety so that the analysis errors will not be detected during the compile time such as invalid column names or types. Also, the ability to apply the same functional constructions as on RDDs is missing in DataFrames. Datasets expose a type-safe API and support for user-defined closures at the cost of performance.
This talk will explain cases when Spark SQL cannot optimize typed Datasets as much as it can optimize DataFrames. We will also present an effort to use bytecode analysis to convert user-defined closures into native Catalyst expressions. This helps Spark to avoid the expensive conversion between the internal format and JVM objects as well as to leverage more Catalyst optimizations. A consequence, we can bridge the gap in performance between Datasets and DataFrames, so that users do not have to sacrifice the benefits of Datasets for performance reasons.