Koalas: How Well Does Koalas Work?

May 26, 2021 04:25 PM (PT)

Download Slides

Koalas is an open source project that provides pandas APIs on top of Apache Spark. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark.

There are also many libraries trying to scale pandas APIs, such as Vaex, Modin, and so on. Dask is one of them and very popular among pandas users, and also works on its own cluster similar to Koalas which is on top of Spark cluster.In this talk, we will introduce Koalas and its current status, and the comparison between Koalas and Dask, including benchmarking.

In this session watch:
Takuya Ueshin, Software Engineer, Databricks
Xinrong Meng, Developer, Databricks

 

Takuya Ueshin

Takuya Ueshin is a software engineer at Databricks, and an Apache Spark committer and a PMC member. His main interests are in Spark SQL internal, a.k.a. Catalyst, and also PySpark. He is one of the...
Read more

Xinrong Meng

Xinrong is a software engineer at Databricks. Her main interests are in Koalas and PySpark. She is one of the major contributors of the Koalas project.
Read more