Han Wang is a Staff Engineeri at Lyft, leading the company’s Spark Machine Learning projects in Lyft. Before Lyft, he was working at Quantlab, Amazon, Microsoft and Hudson River Trading, focusing on distributed computation and Machine Learning problems.
While struggling to choose among different computing and machine learning frameworks such as Spark, Dask, Scikit-learn, Tensorflow, etc. for your ETL and machine learning projects, have you thought about unifying them into one ecosystem to use? In this talk, we will present such a framework we developed - Fugue. It’s an abstraction layer on top of different frameworks, also providing a SQL-like language that can represent your pipelines from end to end, which is highly extendable by Python. With the Fugue framework, it’s a lot easier and faster to create reliable, performant and portable pipelines than using native Spark, especially for non-expert users.
In this talk we will demonstrate how we implemented the Node2Vec algorithm on top of Fugue, so it can run on different computing frameworks and can process graphs with 100 million vertices and 3 billion edges in a few hours using Spark as the backend.
We have also built a unified interactive environment based on Kubernetes, Spark and Fugue, and will demonstrate great performance improvement on the projects migrated into this system. We will also talk about the future plan of the Fugue Project including Fugue ML and Fugue Streaming. Our goal is to create a unified ecosystem for distributed computing and machine learning.