Koalas is an open-source project that aims at bridging the gap between big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with pandas library in Python. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. With Koalas, data scientist can use the same APIs as pandas’ but at scale with PySpark. In this talk, I introduce Koalas and its updates, and also show some comparisons between pandas and Koalas, then deep-dive into its internal structures and how it works with Spark.
Takuya Ueshin is a software engineer at Databricks, and an Apache Spark committer and a PMC member. His main interests are in Spark SQL internal, a.k.a. Catalyst, and also PySpark. He is one of the major contributors of the Koalas project.