IndexedRDD: Efficient Fine-Grained Updates for RDDs

Download Slides

Spark’s core abstraction is the RDD, an immutable distributed dataset. Spark requires immutability to enable dataset reuse, fault tolerance, and straggler mitigation. But new Spark applications like streaming aggregation and incremental graph processing seem to need mutation: a new tweet requires updating a user’s tweet count; a new movie rating requires updating a small number of predictions. Existing solutions sacrifice either flexibility or efficiency. Bulk transformations are wasteful for small updates. Direct mutation sacrifices fault tolerance. Even complex solutions, such as storing data in a durable, atomically-updated external database, encounter problems with dataset reuse and complex dependency graphs. This talk will introduce IndexedRDD, our solution for fine-grained RDD updates that retains all of Spark’s advantages. IndexedRDD uses a range of techniques from functional programming and versioned databases. We will describe its implementation, its solutions to GC overhead and memory constraints, and its performance.