Ryan Blue works on open source projects, including Spark, Avro, and Parquet, at Netflix.
DataSourceV2 is Spark's new API for working with data from tables and streams, but "v2" also includes a set of changes to SQL internals, the addition of a catalog API, and changes to the data frame read and write APIs. This talk will cover the context for those additional changes and how "v2" will make Spark more reliable and predictable for building enterprise data pipelines. This talk will include: * Problem areas where the current behavior is unpredictable or unreliable * The new standard SQL write plans (and the related SPIP) * The new table catalog API and a new Scala API for table DDL operations (and the related SPIP) * Netflix's use case that motivated these changes
In the last two years, Netflix has seen a mass migration to Spark from Pig and other MR engines. This talk will focus on the challenges of that migration and the work that has made it possible. This will include contributions that Netflix has made to Spark to enable wider adoption and on-going projects to make Spark appeal to a broader range of analysts, beyond data and ML engineers.
Netflix's Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. At this scale, output committers that create extra copies or can't handle task failures are no longer practical. This talk will explain the problems that are caused by the available committers when writing to S3, and show how Netflix solved the committer problem.
In this session, you'll learn:
- Some background about Spark at Netflix
- About output committers, and how both Spark and Hadoop handle failures
- How HDFS and S3 differ, and why HDFS committers don't work well
- A new output committer that uses the S3 multi-part upload API
- How you can use this new committer in your Spark applications to avoid duplicating data
Session hashtag: #SFdev7