Spark + Parquet In Depth - Databricks

Spark + Parquet In Depth

Download Slides

What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.

At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.

We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).

Learn more:

  • Reading Parquet Files
  • Spark SQL: Another 16x Faster After Tungsten
  • About Robbie Strickland

    Robbie has been involved in the big data community for the last seven years, and he was an early Spark adopter back in 2014. He has contributed to a number of projects, including Apache Cassandra and the Cassandra Spark connector, and is the author of Cassandra High Availability. At IBM, Robbie leads a group that includes the Spark Technology Center, as well as Big Insights and other data processing technologies that power the Watson Data Platform.

    About Emily Curtin

    Emily is a Software Engineer at The Weather Company (now IBM) working on the data engineering platform team. She lives in her hometown of Atlanta, GA with her husband where she can often be found on the Chattahoochee river in a kayak.