Koalas: Pandas on Apache Spark - Databricks

Koalas: Pandas on Apache Spark

In this tutorial we will present Koalas, a new open source project that we announced at the Spark + AI Summit in April. Koalas is an open-source Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.

We will demonstrate Koalas’ new functionalities since its initial release, discuss its roadmaps, and how we think Koalas could become the standard API for large scale data science.

What you will learn:

  • How to get started with Koalas
  • Easy transition from Pandas to Koalas on Apache Spark
  • Similarities between Pandas and Koalas APIs for DataFrame transformation and feature engineering
  • Single machine Pandas vs distributed environment of Koalas

Prerequisites:

  • A fully-charged laptop (8-16GB memory) with Chrome or Firefox
  • Python 3 and pip pre-installed
  • pip install koalas from PyPI
  • Read koalas docs


« back
About Tim Hunter

Tim Hunter is a software engineer at Databricks and contributes to the Apache Spark MLlib project, as well as the GraphFrames, TensorFrames and Deep Learning Pipelines libraries. He has been building distributed Machine Learning systems with Spark since version 0.2, before Spark was an Apache Software Foundation project.

About Brooke Wenig

Brooke Wenig is the Machine Learning Practice Lead at Databricks. She advises and implements machine learning pipelines for customers, as well as educates them on how to use Spark for Machine Learning and Deep Learning. She received an MS in Computer Science from UCLA with a focus on distributed machine learning. She speaks Mandarin Chinese fluently and enjoys cycling.