Hyukjin is a Databricks software engineer, Apache Spark PMC member and committer, working on many different areas in Apache Spark such as Spark SQL, PySpark, SparkR, etc. He is also one of the top contributors in Koalas. He mainly focuses on development, helping discussions, and reviewing many features and changes in Apache Spark and Koalas.
In the past several years, the pandas UDFs are perhaps the most important changes to Apache Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. In Apache Spark 3.0, the pandas UDFs were redesigned by leveraging type hints. By using Python type hints, you can naturally express pandas UDFs without requiring such as the evaluation type. Also, pandas UDFs are now more ‘Pythonic’ and let themselves define what the UDF is supposed to input and output with the clear definition. Moreover, it allows many benefits such as easier static analysis. In this talk, I will introduce the redesigned pandas UDFs with type hints in Apache Spark 3.0 with a technical overview.
Apache Spark already has a vectorization optimization in many operations, for instance, internal columnar format, Parquet/ORC vectorized read, Pandas UDFs, etc. Vectorization improves performance greatly in general. In this talk, the performance aspect of SparkR will be discussed and vectorization in SparkR will be introduced with technical details. SparkR vectorization allows users to use the existing codes as are but boost the performance around several thousand present faster when they execute R native functions or convert Spark DataFrame to/from R DataFrame.