Random Walks on graphs is a useful technique in machine learning, with applications in personalized PageRank, representational learning and others. This session will describe a novel algorithm for enumerating walks on large-scale graphs that benefits from the several unique abilities of Apache Spark. The algorithm generates a recursive branching DAG of stages that separates out the “closed” and “open” walks. Spark’s shuffle file management system is ingeniously used to accumulate the walks while the computation is progressing. In-memory caching over multi-core executors enables moving the walks several “steps” forward before shuffling to the next stage. See performance benchmarks, and hear about LinkedIn’s experience with Spark in production clusters. The session will conclude with an observation of how Spark’s unique and powerful construct opens new models of computation, not possible with state-of-the-art, for developing high-performant and scalable algorithms in data science and machine learning. Session hashtag: #SFds11
Min Shen is an engineer on LinkedIn’s Hadoop infrastructure development team, where he builds services and tools to tackle scaling challenges in operating large-scale multi-tenancy Hadoop deployment. Recently, he has been helping with creating tools to support operating Spark at scale as well as developing and running Spark jobs easily at LinkedIn.