Takuya Ueshin

Software Engineer, Databricks

Takuya Ueshin is a software engineer at Databricks, and an Apache Spark committer and a PMC member. His main interests are in Spark SQL internal, a.k.a. Catalyst, and also PySpark. He is one of the major contributors of the Koalas project.

Past sessions

Summit 2021 Koalas: How Well Does Koalas Work?

May 26, 2021 04:25 PM PT

Koalas is an open source project that provides pandas APIs on top of Apache Spark. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark.

There are also many libraries trying to scale pandas APIs, such as Vaex, Modin, and so on. Dask is one of them and very popular among pandas users, and also works on its own cluster similar to Koalas which is on top of Spark cluster.In this talk, we will introduce Koalas and its current status, and the comparison between Koalas and Dask, including benchmarking.

In this session watch:
Takuya Ueshin, Software Engineer, Databricks
Xinrong Meng, Developer, Databricks

[daisna21-sessions-od]

Summit Europe 2020 Koalas: Interoperability Between Koalas and Apache Spark

November 18, 2020 04:00 PM PT

Koalas is an open source project that provides pandas APIs on top of Apache Spark. pandas is a Python package commonly used among data scientists, but it does not scale out in a distributed manner. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark. Koalas is useful for not only pandas users but also PySpark users. For example, PySpark users can visualize their data directly from their PySpark DataFrame via the Koalas plotting APIs such as plotting. In addition, Koalas users can leverage PySpark specific APIs such as higher-order functions and a rich set of SQL APIs. In this talk, we will focus on the PySpark aspect and the interaction between PySpark and Koalas in order for PySpark users to leverage their knowledge of Apache Spark in Koalas.

Speakers: Takuya Ueshin and Haejoon Lee

Koalas is an open-source project that aims at bridging the gap between big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with pandas library in Python. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. With Koalas, data scientist can use the same APIs as pandas' but at scale with PySpark. In this talk, I introduce Koalas and its updates, and also show some comparisons between pandas and Koalas, then deep-dive into its internal structures and how it works with Spark.

Summit Europe 2019 Koalas: Making an Easy Transition from Pandas to Apache Spark

October 15, 2019 05:00 PM PT

In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.

Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.

When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.

Through live demonstrations and code samples, you will understand: - how to effectively leverage both pandas and Spark inside the same code base - how to leverage powerful pandas concepts such as lightweight indexing with Spark - technical considerations for unifying the different behaviors of Spark and pandas