Apache Spark already has a vectorization optimization in many operations, for instance, internal columnar format, Parquet/ORC vectorized read, Pandas UDFs, etc. Vectorization improves performance greatly in general. In this talk, the performance aspect of SparkR will be discussed and vectorization in SparkR will be introduced with technical details. SparkR vectorization allows users to use the existing codes as are but boost the performance around several thousand present faster when they execute R native functions or convert Spark DataFrame to/from R DataFrame.
Hyukjin is a software engineer at Databricks, working on many different areas in Spark such as Spark SQL, PySpark, SparkR, etc. He is an Apache Spark committer and mainly focuses on the on the open source community in Apache Spark such as helping discuss and review many features and changes.