Hyukjin is a software engineer at Databricks, working on many different areas in Spark such as Spark SQL, PySpark, SparkR, etc. He is an Apache Spark committer and mainly focuses on the on the open source community in Apache Spark such as helping discuss and review many features and changes.
In the past several years, the pandas UDFs are perhaps the most important changes to Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. In Spark 3.0, the pandas UDFs was redesigned by leveraging type hints. By using Python type hints, you can naturally express pandas UDFs without requiring such as the evaluation type. Also, pandas UDFs are now more 'Pythonic' and let themselves define what the UDF is supposed to input and output with the clear definition. Moreover, it allows many benefits such as easier static analysis. In this talk, I will introduce the redesigned pandas UDFs with type hints in Spark 3.0 with a technical overview.
Apache Spark already has a vectorization optimization in many operations, for instance, internal columnar format, Parquet/ORC vectorized read, Pandas UDFs, etc. Vectorization improves performance greatly in general. In this talk, the performance aspect of SparkR will be discussed and vectorization in SparkR will be introduced with technical details. SparkR vectorization allows users to use the existing codes as are but boost the performance around several thousand present faster when they execute R native functions or convert Spark DataFrame to/from R DataFrame.