Emily Curtin - Databricks

Emily Curtin

Software Engineer, The Weather Company / IBM

Emily is a Software Engineer at The Weather Company (now IBM) working on the data engineering platform team. She lives in her hometown of Atlanta, GA with her husband where she can often be found on the Chattahoochee river in a kayak.



Spark + Parquet In DepthSummit East 2017

What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet. At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it. We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).

Learn more:
  • Reading Parquet Files
  • Spark SQL: Another 16x Faster After Tungsten

  • Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, BenchmarkSummit Europe 2017

    spark-bench is an open-source benchmarking tool, and it's also so much more. spark-bench is a flexible system for simulating, comparing, testing, and benchmarking Spark applications and Spark itself. spark-bench originally began as a benchmarking suite to get timing numbers on very specific algorithms mostly in the machine learning domain. Since then it has morphed into a highly configurable and flexible framework suitable for many use cases. This talk will discuss the high level design and capabilities of spark-bench before walking through some major, practical use cases. Use cases include, but are certainly not limited to: regression testing changes to Spark; comparing performance of different hardware and Spark tuning options; simulating multiple notebook users hitting a cluster at the same time; comparing parameters of a machine learning algorithm on the same set of data; providing insight into bottlenecks through use of compute-intensive and i/o-intensive workloads; and, yes, even benchmarking. In particular this talk will address the use of spark-bench in developing new features for Spark core. Session hashtag: #EUeco8