If you are running Apache Spark in cloud environments, Object Stores -such as Amazon S3 or Azure WASB- are a core part of your system. What you can’t do is treat them like “just another filesystem” -do that and things will, eventually, go horribly wrong. This talk looks at the object stores in the cloud infrastructures, including underlying architectures., compares them to what a “real filesystem” is expected to do and shows how to use object stores efficiently and safely as sources of and destinations of data. It goes into depth on recent “S3a” work, showing how including improvements in performance, security, functionality and measurement -and demonstrating how to use make best use of it from a spark application. If you are planning to deploy Spark in cloud, or doing so today: this is information you need to understand. The performance of you code and integrity of your data depends on it.
Steve Loughran works at Hortonworks on leading-edge Hadoop applications, most recently in high-performance Amazon's S3 storage support in Hadoop and Spark, as well as long-lived Yarn Service He's the author of Ant in Action, a member of the Apache Software Foundation, and a committer on the Hadoop core since 2009. Prior to joining Hortonworks in 2012, he was a Research Scientist at HP Laboratories. He lives and works in Bristol, England. For fun he falls of bicycles in the local woodland.