Random Walks on Large Scale Graphs with Apache Spark

Download Slides

Random Walks on graphs is a useful technique in machine learning, with applications in personalized PageRank, representational learning and others. This session will describe a novel algorithm for enumerating walks on large-scale graphs that benefits from the several unique abilities of Apache Spark. The algorithm generates a recursive branching DAG of stages that separates out the “closed” and “open” walks. Spark’s shuffle file management system is ingeniously used to accumulate the walks while the computation is progressing. In-memory caching over multi-core executors enables moving the walks several “steps” forward before shuffling to the next stage. See performance benchmarks, and hear about LinkedIn’s experience with Spark in production clusters. The session will conclude with an observation of how Spark’s unique and powerful construct opens new models of computation, not possible with state-of-the-art, for developing high-performant and scalable algorithms in data science and machine learning. Session hashtag: #SFds11

« back
About Min Shen

Min Shen is a tech lead at LinkedIn. His team's focus is to build and scale LinkedIn's general purpose batch compute engine based on Apache Spark. The team empowers multiple use cases at LinkedIn ranging from data explorations, data engineering, to ML model training. Prior to this, Min mainly worked on Apache YARN. He holds a PhD degree in Computer Science from University of Illinois at Chicago.