Dr. Jintao Zhang has obtained PhD in machine learning, and has been working in various internet companies with extensive experience on delivering end-to-end solutions on large-scale machine learning problems. His interest focuses on research, development, and infrastructure on distributed systems, big data, deep learning, and machine learning.
May 28, 2021 11:40 AM PT
When machine learning models are productionized, they are commonly formed as workflows with multiple tasks, managed by a task scheduler such as Airflow, Prefect. Traditionally each task within the same workflow uses similar computing frameworks (e.g. Python, Spark, and PyTorch) in the same backend computing environment (e.g. AWS EMR, Google DataProc) with globally fixed settings (e.g. instances, cores, memory).
In complicated use cases, such traditional workflows create large resource and runtime inefficiency, hence it is highly desired to use different computing frameworks in the same workflow in different computing environments. Such workflows can be named as superworkflows. Fugue is an open-sourced abstraction layer on top of different computing frameworks and creates uniform interfaces to use these frameworks without dealing with the complexities associated with them. To this end, Fugue can be viewed as a superframework.
In addition, Kubernetes (K8S) is a container orchestration system, and it is easy to create different computing environments (e.g. Spark, PyTorch) with different docker images as everything is containerized in K8S. It is natural to combine K8S and Fugue to create superworkflows for complicated machine learning problems. In this talk, we use a popular graph neural network named Node2Vec as an example to illustrate how to create an efficient superworkflow using Fugue and K8S on very large graphs with hundreds of millions of vertices and edges.
We also demonstrate how to partition the whole Node2Vec process into multiple tasks based on their complexities and parallelism. Benchmark testing is conducted for comparing performance and resource efficiency. Finally, it is easy to generalize this superworkflow concept to other deep learning problems.
June 25, 2020 05:00 PM PT
While struggling to choose among different computing and machine learning frameworks such as Spark, Dask, Scikit-learn, Tensorflow, etc. for your ETL and machine learning projects, have you thought about unifying them into one ecosystem to use? In this talk, we will present such a framework we developed - Fugue. It’s an abstraction layer on top of different frameworks, also providing a SQL-like language that can represent your pipelines from end to end, which is highly extendable by Python. With the Fugue framework, it’s a lot easier and faster to create reliable, performant and portable pipelines than using native Spark, especially for non-expert users.
In this talk we will demonstrate how we implemented the Node2Vec algorithm on top of Fugue, so it can run on different computing frameworks and can process graphs with 100 million vertices and 3 billion edges in a few hours using Spark as the backend.
We have also built a unified interactive environment based on Kubernetes, Spark and Fugue, and will demonstrate great performance improvement on the projects migrated into this system. We will also talk about the future plan of the Fugue Project including Fugue ML and Fugue Streaming. Our goal is to create a unified ecosystem for distributed computing and machine learning.