Han Wang is a Staff Engineeri at Lyft, leading the company’s Spark Machine Learning projects in Lyft. Before Lyft, he was working at Quantlab, Amazon, Microsoft and Hudson River Trading, focusing on distributed computation and Machine Learning problems.
While struggling to choose between Apache Spark, Tensorflow, Druid, Dask, Flink, etc for your ETL and Machine Learning projects, have you thought about unifying them into one ecosystem to use? In this talk, we will talk about such a framework we developed -- Fugue. It's an abstraction layer on top of different solutions, providing a SQL-like language that can represent your pipelines from end to end, which is also highly extendable by Python. With the Fugue framework, it's a lot easier and faster to create reliable, performant and maintainable pipelines than using native Spark, especially for entry-level users. Within a unified environment based on K8S Spark, we will show how a user can use the custom environment for interactive development (Jupyter notebook), batch processing or near real-time streaming jobs in their dedicated workspace.
We will demo how to instantly update dependency, instantly start and stop an on-demand Spark K8s cluster in our environment, even with a large number of pods. We also developed Fugue extensions for Kinesis, and Kafka so that developers can connect to those resources and write Spark streaming pipelines in the same framework. Moreover, real-time services using Flink are also compatible with this framework. Fugue also provides an abstraction layer on top of Spark ML, Scikit Learn, XGboost, Tensorflow, Hyperopt, etc for machine learning pipelines. Users can do distributed training, hyperparameter tuning and inference on deep learning and classical models through the same interfaces. Finally, we will talk about our extensive testing on Spark 3.0 (preview version) and how we achieved significant performance gain without compatibility issues.