Shivangi is a Senior Engineering Leader at Informatica focused on utilizing Spark for Data Engineering.
She has 10+ years of experience with Distributed processing, Database systems, Hadoop, Spark and other Big Data technologies. More recently, she has been focussed on optimizing Spark for processing Informatica Data pipelines. This involves among other things enhancing open source Spark technology and creating Spark++. She has a Masters in Computer Science from University of Southern California.
November 17, 2020 04:00 PM PT
User Defined Functions is an important feature of Spark SQL which helps extend the language by adding custom constructs. UDFs are very useful for extending spark vocabulary but come with significant performance overhead. These are black boxes for Spark optimizer, blocking several helpful optimizations like WholeStageCodegen, Null optimization etc. They also come with a heavy processing cost associated with String functions requiring UTF-8 to UTF-16 conversions which slows down spark jobs and increases memory requirements. In this talk, we will go over how at Informatica we optimized UDFs to be as performant as Spark native functions both in terms of time and memory and allow these functions to participate in spark optimization steps.
Speaker: Shivangi Srivastava