Ryan Blue

, Netflix

Ryan Blue works on open source projects, including Spark, Avro, and Parquet, at Netflix.

Past sessions

Summit 2019 Improving Apache Spark’s Reliability with DataSourceV2

April 23, 2019 05:00 PM PT

DataSourceV2 is Spark's new API for working with data from tables and streams, but "v2" also includes a set of changes to SQL internals, the addition of a catalog API, and changes to the data frame read and write APIs. This talk will cover the context for those additional changes and how "v2" will make Spark more reliable and predictable for building enterprise data pipelines. This talk will include: * Problem areas where the current behavior is unpredictable or unreliable * The new standard SQL write plans (and the related SPIP) * The new table catalog API and a new Scala API for table DDL operations (and the related SPIP) * Netflix's use case that motivated these changes

Summit 2019 Migrating to Apache Spark at Netflix

April 23, 2019 05:00 PM PT

In the last two years, Netflix has seen a mass migration to Spark from Pig and other MR engines. This talk will focus on the challenges of that migration and the work that has made it possible. This will include contributions that Netflix has made to Spark to enable wider adoption and on-going projects to make Spark appeal to a broader range of analysts, beyond data and ML engineers.

Summit 2017 Improving Apache Spark with S3

June 6, 2017 05:00 PM PT

Netflix's Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. At this scale, output committers that create extra copies or can't handle task failures are no longer practical. This talk will explain the problems that are caused by the available committers when writing to S3, and show how Netflix solved the committer problem.
In this session, you'll learn:
- Some background about Spark at Netflix
- About output committers, and how both Spark and Hadoop handle failures
- How HDFS and S3 differ, and why HDFS committers don't work well
- A new output committer that uses the S3 multi-part upload API
- How you can use this new committer in your Spark applications to avoid duplicating data

Session hashtag: #SFdev7