Jason Dai is a senior principal engineer and CTO of Big Data Technologies at Intel, responsible for leading the global engineering teams (in both Silicon Valley and Shanghai) on the development of advanced data analytics and machine learning. He is the creator of BigDL and Analytics Zoo, a founding committer and PMC member of Apache Spark, and a mentor of Apache MXNet. For more details, please see https://jason-dai.github.io/
This talk presents how we accelerated deep learning processing from preprocessing to inference and training on Apache Spark in SK Telecom. In SK Telecom, we have half a Korean population as our customers. To support them, we have 400,000 cell towers, which generates logs with geospatial tags. With these logs, we can analyze network quality for a certain cell tower and estimate real-time population in the region by counting the number of connected devices to the cell tower. In order to predict network quality for a cell and population for a certain region, we developed a deep learning based prediction model, which requires to process almost 2 million logs every second and produce prediction results for each cell tower and region. To efficiently handle this huge computation, we focused on optimizing deep learning data pipeline.
First, we tried to optimize deep learning preprocessing by using a new in-memory data store for Apache Spark called FlashBase. Preprocessing is done by reading the ingested data from FlashBase and main operations are processed as Spark's RDD transformation operations, while some of the aggregation operations are pushed down to FlashBase and these operations are accelerated by using vector processing with Intel's MKL and AVX-512.
Second, the preprocessed results as Spark's RDD format are directly delivered to an open source Analytics and AI Platform called Analytics Zoo without any data conversion. Lastly, Analytics Zoo takes the RDD as its input and executes deep learning inference and training operations using TensorFlow models (within Spark's executors in parallel). These operations are processed by using Intel's MKL and AVX-512 vectorized operations. By doing so, we could create orders of magnitude faster data pipeline for deep learning based on Spark and Intel Cascade-lake CPUs than the legacy architecture with pure Pandas and Tensorflow.
While GraphX provides nice abstractions and dataflow optimizations for parallel graph processing on top of Spark, there are still many challenges in applying it to an Internet-scale, production setting (e.g., graph algorithms and underlying frameworks optimized for billions of graph edges and 1000s of iterations). In this talk, we will present our efforts in building real-world, large-scale graph analysis applications using GraphX for some of the largest organizations/websites in the world, including both algorithm level and framework level optimizations (e.g., minimizing graph state replications, optimizing long RDD lineages, etc.)
There are increasing interest and applications for running deep learning on Apache Spark platform (e.g., BigDL, TensorFrames, Caffe/TensorFlow-on-Spark, etc.) in the community. In this BoF discussion, we would like to cover related topics such as experience and wish list, best practices and pitfalls, architectural tradeoffs, etc., for running deep learning on Spark.
BigDL is a distributed deep learning framework for Apache Spark open sourced by Intel. BigDL helps make deep learning more accessible to the Big Data community, by allowing them to continue the use of familiar tools and infrastructure to build deep learning applications. With BigDL, users can write their deep learning applications as standard Spark programs, which can then directly run on top of existing Spark or Hadoop clusters. In this session, we will introduce BigDL, how our customers use BigDL to build End to End ML/DL applications, platforms on which BigDL is deployed and also provide an update on the latest improvements in BigDL v0.1, and talk about further developments and new upcoming features of BigDL v0.2 release (e.g., support for TensorFlow models, 3D convolutions, etc.).