Pandas is the de facto standard (single-node) Data Frame implementation in Python. However, as data grows larger, pandas no longer works very well due to performance reasons. On the other hand, Spark has become a very popular choice for analyzing large dataset in the past few years. However, there is an API gap between pandas and Spark, and as a result, when users switch from pandas to Spark, they often need to rewrite their programs. Ibis is a library designed to bridge the gap between local execution (pandas) and cluster execution (BigQuery, Impala, etc). In this talk, we will introduce a Spark backend for ibis and demonstrate how users can go between pandas and Spark with the same code.
Two Sigma Investments
Li Jin is a software engineer at Two Sigma. Li focuses on building high performance data analysis tools with Python and Spark for financial data. Li is a co-creator of Flint: a time series analysis library on Spark. Previously, Li worked on building large scale task scheduling system. In his spare time, Li loves hiking, traveling and winter sports.