What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Robbie has been involved in the big data community for the last seven years, and he was an early Spark adopter back in 2014. He has contributed to a number of projects, including Apache Cassandra and the Cassandra Spark connector, and is the author of Cassandra High Availability. At IBM, Robbie leads a group that includes the Spark Technology Center, as well as Big Insights and other data processing technologies that power the Watson Data Platform.
Emily is a Software Engineer at the IBM Spark Technology Center. She lives in her hometown of Atlanta, GA with her husband where she can often be found on the Chattahoochee river in a kayak.