Over the years, Facebook has used Hive as the primary query engine to be used by our data engineers. Since Hive uses SQL-like query language called HQL, the list of built-in User Defined Functions (UDFs) did not always satisfy our customer requirements and as a result, an extensive list of custom UDFs was developed over time. As we started migrating pipelines from Hive to Spark SQL, a number of custom UDFs appeared incompatible with Spark, and many others showed bad performance. In this talk will first take a deep dive into how Hive UDFs work with Spark. We will then share what challenges we overcame on the way to support 99.99% of the custom UDFs in Spark.
Sergey Makagonov is a software engineer in Big Compute team at Facebook. Sergey is passionate about building large-scale distributed systems to solve real world problems. Prior to Facebook, he worked as a software engineer at Ipsy, where he scaled personalization platform of the subscription service using Apache Spark. Sergey obtained a Master's degree in Computer Science from Kazakh-British Technical University.
Xin Yao is a Software Engineer at Facebook Spark team. Before Facebook, Xin worked as a Senoir Software Engineer at Hulu, where he built the realtime ETL pipeline and scaled data warehouse. Xin received his master from Beijing University of Posts and Telecommunications in 2013.