Organizations commonly use Apache Spark to gain actionable insight from their large amounts of data. Often, these analytics are in the form of data processing pipelines, where there are a series of processing stages, and each stage performs a particular function, and the output of one stage is the input of the next stage. There are several examples of pipelines, such as log processing, IoT pipelines, and machine learning. The common attribute among different pipelines is the sharing of data between stages. It is also common for Spark pipelines to process data stored in the public cloud, such as Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage. The global availability and cost effectiveness of these public cloud storage services make them the preferred storage for data. However, running pipeline jobs while sharing data via cloud storage can be expensive in terms of increased network traffic, and slower data sharing and job completion times. Using Alluxio, a memory speed virtual distributed storage system, enables sharing data between different stages or jobs at memory speed. By reading and writing data in Alluxio, the data can stay in memory for the next stage of the pipeline, and this result in great performance gains. In this talk, we discuss how Alluxio can be deployed and used with a Spark data processing pipeline in the cloud. We show how pipeline stages can share data with Alluxio memory for improved performance benefits, and how Alluxio can improves completion times and reduces performance variability for Spark pipelines in the cloud.
Session hashtag: #EUde5
Gene Pang is one of PMCs and maintainers of the Alluxio open source project and a founding member at Alluxio, Inc. He recently graduated with a Ph.D. from the AMPLab at UC Berkeley, working on distributed database systems. Before starting at Berkeley, he worked at Google and has an M.S. from Stanford University, and a B.S. from Cornell University.