Andy Feng is a VP Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He’s architected major platforms for personalization, ads serving, NoSQL, and cloud infrastructure.
TensorFlowOnSpark (TFoS) was open sourced in Q1 2017, and it has gained strong adoption within the Spark community for running TensorFlow training and inferencing jobs on Spark clusters. At Spark Summit 2017, we explained how TFoS enables Python applications to conduct distributed TensorFlow training and inference efficiently by leveraging key built-in capabilities of PySpark and TensorFlow. In this talk, we cover the major enhancements of TFoS in recent months. We will introduce a new Scala API for users who want to integrate previously trained models into an existing Scala/Spark workflow. We will describe a new Python API for Spark ML pipelines to train all types of TensorFlow models, and conduct inference/featurization without any custom code. Additionally, we will cover the support for TensorFlow Keras API, and TensorFlow Datasets.
Yahoo’s Audience Expansion (AEX) uses advanced custom Perform A-Like (PAL) modeling to find new, incremental users who exhibit online behavior that is similar to an advertiser’s existing customers or converters. Over the years, Yahoo has developed a sophisticated AEX pipeline based on Hadoop streaming, and refreshes audience models periodically. In this talk, we will present our recent effort to migrate AEX pipeline from Hadoop streaming to Spark. We aim to reduce audience model to be refreshed at least 2x faster. We came up an innovative migration solution that requires no code changes at all. We’ll present some Spark scalability enhancements to enable having over 50k partitions to process 3TB data without mapper memory issue or shuffle IO issue. Our Spark enhancements reduce memory consumption in reducer by 30x with minimum compression overhead. We have also improved Spark to tolerate faulures better, and identified several useful performance tuning tricks for improving resource utilization (CPU, memory and IO).
Deep learning is a critical capability for gaining intelligence from datasets. Many existing frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline. The separated clusters require large datasets to be transferred between clusters, and introduce unwanted system complexity and latency for end-to-end learning. Yahoo introduced CaffeOnSpark to alleviate those pain points and bring deep learning onto Hadoop and Spark clusters. By combining salient features from deep learning framework Caffe and big-data framework Apache Spark, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. The framework is complementary to non-deep learning libraries MLlib and Spark SQL, and its data-frame style API provides Spark applications with an easy mechanism to invoke deep learning over distributed datasets. Its server-to-server direct communication (Ethernet or InfiniBand) achieves faster learning and eliminates scalability bottleneck. As a distributed extension of Caffe, CaffeOnSpark supports neural network model training, testing, and feature extraction. Caffe users can now perform distributed learning using their existing LMDB data files and minorly adjusted network configuration. Our early benchmark indicates 18x speedup for deep networks. CaffeOnSpark has been in use by Yahoo for image search, content classification and several other use cases. Recently, we have released CaffeOnSpark at github.com/yahoo/CaffeOnSpark under Apache 2.0 License. In this talk, we will provide a technical overview of CaffeOnSpark, its API and deployment on a private cloud or public cloud (AWS EC2). We will share our experience on applying CaffeOnSpark to various use cases, and discuss potential areas of community collaboration.
In recent releases, TensorFlow has been enhanced for distributed learning and HDFS access. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. There are several community projects wiring TensorFlow onto Apache Spark clusters. While these approaches are a step in the right direction, they are limited to support synchronous distributed learning only, and don’t allow TensorFlow servers to communicate with each other directly. This session will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, which will be open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training and inferencing on Spark clusters. It supports all TensorFlow functionalities, including synchronous & asynchronous learning, model & data parallelism and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow (pushing vs. pulling) and network protocols (gRPC and RDMA) for server-to-server communication. Its Python API makes the integration with existing Spark libraries like MLlib easy. The speakers will walk through multiple examples to outline these key capabilities, and share benchmark results about scalability. Learn how, with a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application. You'll also be given tangible takeaways on how deep learning could be easily conducted on cloud or on-premise with a new framework. Session hashtag: #SFdev9