Jintao Zhang - Databricks

Jintao Zhang

Software Engineer, Machine Learning, Lyft Inc.

Dr. Jintao Zhang achieved his PhD in machine learning from University of Kansas in August 2012. Since then he has been working in various companies as data science, machine learning, and engineering roles. Dr. Zhang has extensive experience on using Spark and other distributed computing platform to productionize machine learning pipelines end to end.


Unifying Apache Spark and non-Spark Ecosystems for full-spectrum Big Data AnalyticsSummit 2020

While struggling to choose between Apache Spark, Tensorflow, Druid, Dask, Flink, etc for your ETL and Machine Learning projects, have you thought about unifying them into one ecosystem to use? In this talk, we will talk about such a framework we developed -- Fugue. It's an abstraction layer on top of different solutions, providing a SQL-like language that can represent your pipelines from end to end, which is also highly extendable by Python. With the Fugue framework, it's a lot easier and faster to create reliable, performant and maintainable pipelines than using native Spark, especially for entry-level users. Within a unified environment based on K8S Spark, we will show how a user can use the custom environment for interactive development (Jupyter notebook), batch processing or near real-time streaming jobs in their dedicated workspace.

We will demo how to instantly update dependency, instantly start and stop an on-demand Spark K8s cluster in our environment, even with a large number of pods. We also developed Fugue extensions for Kinesis, and Kafka so that developers can connect to those resources and write Spark streaming pipelines in the same framework. Moreover, real-time services using Flink are also compatible with this framework. Fugue also provides an abstraction layer on top of Spark ML, Scikit Learn, XGboost, Tensorflow, Hyperopt, etc for machine learning pipelines. Users can do distributed training, hyperparameter tuning and inference on deep learning and classical models through the same interfaces. Finally, we will talk about our extensive testing on Spark 3.0 (preview version) and how we achieved significant performance gain without compatibility issues.