This talk will summarize recent activities in Apache Spark developer’s community to enhance columnar storage in Spark 2.3. Columnar storage is known as an efficient format for keeping consecutive fields of a column. On the other hand, previous versions of Spark used columnar storage in a few places. Columnar storage was an internal data structure. Spark 2.3 published an abstract class ColumnVector as a public API. Then, Spark 2.3 uses ColumnVector to effectively support several columnar storages with huge performance improvements. Pre-Spark 2.3 uses columnar storages for reading Apache Parquet and creating table cache in a program written in SQL, DataFrame, or Dataset e.g. df.cache.
These columnar storages are accessed using different internal APIs. This difference led to performance inefficiency of table cache. Spark 2.3 defined ColumnVector as a public API. Then, Spark 2.3 can read data in Apache Arrow and Apache ORC thru ColumnVector without extra data conversion and data copy. While PySpark in pre-Spark 2.3 had huge overhead regarding serialization and desterilization, Spark 2.3 eliminated this overhead by using to use pandas with Apache Arrow. Thus, Spark 2.3 improves performance of PySpark. Spark 2.3 accesses columnar storage for table cache thru ColumnVector without data copy. Spark 2.3 also improves performance of table cache.
Here are takeaways of this talk:
(1) ColumnVector in Spark 2.3 is a public API of columnar storage to exchange data with other columnar storages.
(2) Spark 2.3 uses ColumnVector to exchange famous columnar storages Apache Arrow and Apache ORC with low overhead, and improves performance.
(3) Spark 2.3 and later versions improve performance of PySpark by using Pandas.
(4) Spark 2.3 and later versions use ColumnVector for table cache and improved performance.