Jason Dai is a senior principal engineer and CTO of Big Data Technologies at Intel, responsible for leading the global engineering teams (in both Silicon Valley and Shanghai) on the development of advanced data analytics and machine learning. He is the creator of BigDL and Analytics Zoo, a founding committer and PMC member of Apache Spark, and a mentor of Apache MXNet. For more details, please see https://jason-dai.github.io/.
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion. Nevertheless, it is not straightforward for Ray to directly deal with big data, especially the data from real-life production environment. Instead of running big data applications and AI applications on two separate systems, we hereby introduce our work for RayOnSpark, which could gracefully allow users to run Ray programs on big data platforms. In this session, we will discuss our implementation of RayOnSpark in detail. You will have an intuitive understanding on how to run various emerging AI applications (including distributed training of deep neural networks, scalable AutoML for time series prediction, distributed reinforcement learning, etc.) on Apache Hadoop/YARN clusters by utilizing Ray and RayOnSpark. In addition, RayOnSpark allows Ray programs to be seamlessly integrated with Apache Spark data processing pipelines and directly run on in-memory Spark RDDs or DataFrames to eliminate expensive data transfer overhead among different systems.
This talk presents how we accelerated deep learning processing from preprocessing to inference and training on Apache Spark in SK Telecom. In SK Telecom, we have half a Korean population as our customers. To support them, we have 400,000 cell towers, which generates logs with geospatial tags. With these logs, we can analyze network quality for a certain cell tower and estimate real-time population in the region by counting the number of connected devices to the cell tower. In order to predict network quality for a cell and population for a certain region, we developed a deep learning based prediction model, which requires to process almost 2 million logs every second and produce prediction results for each cell tower and region. To efficiently handle this huge computation, we focused on optimizing deep learning data pipeline.
First, we tried to optimize deep learning preprocessing by using a new in-memory data store for Apache Spark called FlashBase. Preprocessing is done by reading the ingested data from FlashBase and main operations are processed as Spark's RDD transformation operations, while some of the aggregation operations are pushed down to FlashBase and these operations are accelerated by using vector processing with Intel's MKL and AVX-512.
Second, the preprocessed results as Spark's RDD format are directly delivered to an open source Analytics and AI Platform called Analytics Zoo without any data conversion. Lastly, Analytics Zoo takes the RDD as its input and executes deep learning inference and training operations using TensorFlow models (within Spark's executors in parallel). These operations are processed by using Intel's MKL and AVX-512 vectorized operations. By doing so, we could create orders of magnitude faster data pipeline for deep learning based on Spark and Intel Cascade-lake CPUs than the legacy architecture with pure Pandas and Tensorflow.
Time Series Forecasting is widely used in real world applications, such as network quality analysis in Telcos, log analysis for data center operations, predictive maintenance for high-value equipment, etc. Classical time series forecasting methods (such as autoregression and exponential smoothing) often involve making assumptions the underlying distribution of the data, while new machine learning methods, especially neural networks often perceive time series forecasting as a sequence modeling problem and have recently been applied to these problems with success (e.g.,  and ). However, building the machine learning applications for time series forecasting can be a laborious and knowledge-intensive process. In order to provide an easy-to-use time series forecasting toolkit, we have applied Automated Machine Learning (AutoML) to time series forecasting. The toolkit is built on top of Ray (a distributed framework for emerging AI applications open-sourced by UC Berkeley RISELab), so as to automate the process of feature generation and selection, model selection and hyper-parameter tuning in a distributed fashion. In this talk we will share how we build the AutoML toolkit for time series forecasting, as well as real-world experience and 'war stories' of earlier users (such as Tencent). References:
While GraphX provides nice abstractions and dataflow optimizations for parallel graph processing on top of Spark, there are still many challenges in applying it to an Internet-scale, production setting (e.g., graph algorithms and underlying frameworks optimized for billions of graph edges and 1000s of iterations). In this talk, we will present our efforts in building real-world, large-scale graph analysis applications using GraphX for some of the largest organizations/websites in the world, including both algorithm level and framework level optimizations (e.g., minimizing graph state replications, optimizing long RDD lineages, etc.)
There are increasing interest and applications for running deep learning on Apache Spark platform (e.g., BigDL, TensorFrames, Caffe/TensorFlow-on-Spark, etc.) in the community. In this BoF discussion, we would like to cover related topics such as experience and wish list, best practices and pitfalls, architectural tradeoffs, etc., for running deep learning on Spark.
BigDL is a distributed deep learning framework for Apache Spark open sourced by Intel. BigDL helps make deep learning more accessible to the Big Data community, by allowing them to continue the use of familiar tools and infrastructure to build deep learning applications. With BigDL, users can write their deep learning applications as standard Spark programs, which can then directly run on top of existing Spark or Hadoop clusters. In this session, we will introduce BigDL, how our customers use BigDL to build End to End ML/DL applications, platforms on which BigDL is deployed and also provide an update on the latest improvements in BigDL v0.1, and talk about further developments and new upcoming features of BigDL v0.2 release (e.g., support for TensorFlow models, 3D convolutions, etc.).