More and more companies adopt Spark 3 to benefit from various enhancements and performance optimizations like adaptive query execution and dynamic partition pruning. During this process, organizations consider migrating their data sources to the newly added Catalog API (aka Data Source V2), which provides a better way to develop reliable and efficient connectors. Unfortunately, there are a few limitations that prevent unleashing the full potential of the Catalog API. One of them is the inability to control the distribution and ordering of incoming data that has a profound impact on the performance of data sources.
This talk is going to be useful for developers and data engineers that either develop their own or work with existing data sources in Spark. The presentation will start with an overview of the Catalog API introduced in Spark 3, followed by its benefits and current limitations compared to the old Data Source API. The main focus will be on an extension to the Catalog API developed in SPARK-23889, which lets implementations control how Spark distributes and orders incoming records before passing them to the sink.
The extension not only allows data sources to reduce the memory footprint during writes but also to co-locate data for faster queries and better compression. Apart from that, the introduced API paves the way for more advanced features like partitioned joins.
Anton is a Spark contributor and a Software Engineer at Apple. He has been dealing with the internals of Spark for the last 3 years. At Apple, Anton is working on an elastic, on-demand, secure, and fu...