Make your PySpark Data Fly with Arrow! - Databricks

Make your PySpark Data Fly with Arrow!

Download Slides

In the big data world, it’s not always easy for Python users to move huge amounts of data around. Apache Arrow defines a common format for data interchange, while Arrow Flight introduced in version 0.11.0, provides a means to move that data efficiently between systems. Arrow Flight is a framework for Arrow-based messaging built with gRPC. It enables data microservices where clients can produce and consume streams of Arrow data to share it over the wire. In this session, I’ll give a brief overview of Arrow Flight from a Python perspective, and show that it’s easy to build high performance connections when systems can talk Arrow. I’ll also cover some ongoing work in using Arrow Flight to connect PySpark with TensorFlow – two systems with great Python APIs but very different underlying internal data.

« back
About Bryan Cutler

Bryan Cutler is a software engineer at IBM's Spark Technology Center, where he works on big data analytics and machine learning systems. He is a contributor to Apache Spark in the areas of ML, SQL, Core and Python and a committer for the Apache Arrow project. His interests are in pushing the boundaries of software to build high performance tools that are also a snap to use.