Over the years, Facebook has used Hive as the primary query engine to be used by our data engineers. Since Hive uses SQL-like query language called HQL, the list of built-in User Defined Functions (UDFs) did not always satisfy our customer requirements and as a result, an extensive list of custom UDFs was developed over time. As we started migrating pipelines from Hive to Spark SQL, a number of custom UDFs appeared incompatible with Spark, and many others showed bad performance. In this talk will first take a deep dive into how Hive UDFs work with Spark. We will then share what challenges we overcame on the way to support 99.99% of the custom UDFs in Spark.
Sergey Makagonov is a software engineer in Big Compute team at Facebook. Sergey is passionate about building large-scale distributed systems to solve real world problems. Prior to Facebook, he worked as a software engineer at Ipsy, where he scaled personalization platform of the subscription service using Apache Spark. Sergey obtained a Master's degree in Computer Science from Kazakh-British Technical University.
Xin Yao is a tech lead at Pinterest. Xin is passionate about building and scaling distributed systems. Xin was priorly at Facebook Spark team, focusing on improving spark performance and efficiency. Before that, Xin worked as a Senoir Software Engineer at Hulu, where he maintained spark, built and scaled multiple batch/realtime pipelines. Xin received his master from Beijing University of Posts and Telecommunications in 2013.