Rolf Jagerman - Databricks

Rolf Jagerman

PhD Student, University of Amsterdam

Rolf is a PhD candidate at the University of Amsterdam working on Online Machine-learned Ranking and Information Retrieval. He previously obtained a M.Sc. in Computer Science from ETH Zürich, where he wrote his thesis on Web-scale topic modeling with Spark using an asynchronous parameter server.



Glint: An Asynchronous Parameter Server for SparkSummit Europe 2016

Glint is an asynchronous parameter server implementation for Spark. A parameter server provides a shared interface to the values of a distributed vector or matrix. Users can query and update values, without worrying about locking schemes, network communication and synchronization. Parameter servers are widely used in large-scale machine learning tasks where model sizes become too large to fit on a single machine (e.g. Topic Modeling, Deep Learning, etc.). Glint is designed specifically to interact with Spark. The parameter servers are easy to set up and use. It is possible to spawn parameter servers on your existing Spark cluster with just two lines of code. We demonstrate this ease-of-use by running a live demo in the Spark shell, outlining some of the basic functionality of the library. By creating a Spark-compatible parameter server we are able to implement LightLDA, a state-of-the-art LDA inference algorithm, in Spark. The resulting architecture allows for the computation of topic models that are several orders of magnitude larger, in both dataset size and number of topics, than what was achievable using existing Spark Mllib implementations. Key takeaways: Glint is an easy-to-use parameter server implementation designed specifically for Spark. It is useful for large-scale machine learning tasks where model sizes are too large to fit on a single machine. Our experimental results on LDA topic modeling show increased scalability, in both data set size and number of topics, compared to the default Mllib implementations.