Defining customized scalable aggregation logic is one of Apache Spark’s most powerful features. User Defined Aggregate Functions (UDAF) are a flexible mechanism for extending both Spark data frames and Structured Streaming with new functionality ranging from specialized summary techniques to building blocks for exploratory data analysis. And yet as powerful as they are, UDAFs prior to Spark 3.0 have had subtle flaws that can undermine both performance and usability.
In this talk Erik will tell the story about how he met UDAFs and fell in love with their powerful features. He’ll describe how he faced challenges with the UDAF design and its performance properties and how, with the help of the Apache Spark community, he eventually fixed the UDAF design in Spark 3.0 and fell in love all over again. Along the way you’ll learn about how User Defined Aggregation works in Spark, how to write your own UDAF library and how Spark’s newest UDAF features improve both usability and performance. You’ll also hear how Spark’s code review process made these new features even better and learn tips for successfully shepherding a large feature into the Apache Spark upstream community.
Erik Erlandson is a Software Engineer at Red Hat, where he investigates analytics use cases and scalable deployments for Apache Spark in the cloud. He also consults on internal data science and analytics projects. Erik is a contributor to Apache Spark and other open source projects in the Spark ecosystem, including the Spark on Kubernetes community project, Algebird and Scala.