Robust and Scalable ETL Over Cloud Storage with Spark

Download Slides

The majority of reported Spark deployments are now in the Cloud. In such an environment, it is preferable for Spark to access data directly from services such as Amazon S3, thereby decoupling storage and compute. However, there are limitations to object stores such as S3. Chained or concurrent ETL jobs often run into issues on S3 due to inconsistent file listings and the lack of atomic rename support. Metadata performance also becomes an issue when running jobs over many thousands to millions of files.