A Ph.D. and software engineer in data engineering for big data and machine learning systems. Published top tier conference and journal papers including VLDB, IEEE TKDE, and Information Systems. The Project Leader of FlashBase (distributed in-memory DBMS optimized for DRAM/SSDs) in SKT Software R&D Center
This talk presents how we accelerated deep learning processing from preprocessing to inference and training on Apache Spark in SK Telecom. In SK Telecom, we have half a Korean population as our customers. To support them, we have 400,000 cell towers, which generates logs with geospatial tags. With these logs, we can analyze network quality for a certain cell tower and estimate real-time population in the region by counting the number of connected devices to the cell tower. In order to predict network quality for a cell and population for a certain region, we developed a deep learning based prediction model, which requires to process almost 2 million logs every second and produce prediction results for each cell tower and region. To efficiently handle this huge computation, we focused on optimizing deep learning data pipeline.
First, we tried to optimize deep learning preprocessing by using a new in-memory data store for Apache Spark called FlashBase. Preprocessing is done by reading the ingested data from FlashBase and main operations are processed as Spark's RDD transformation operations, while some of the aggregation operations are pushed down to FlashBase and these operations are accelerated by using vector processing with Intel's MKL and AVX-512.
Second, the preprocessed results as Spark's RDD format are directly delivered to an open source Analytics and AI Platform called Analytics Zoo without any data conversion. Lastly, Analytics Zoo takes the RDD as its input and executes deep learning inference and training operations using TensorFlow models (within Spark's executors in parallel). These operations are processed by using Intel's MKL and AVX-512 vectorized operations. By doing so, we could create orders of magnitude faster data pipeline for deep learning based on Spark and Intel Cascade-lake CPUs than the legacy architecture with pure Pandas and Tensorflow.
In this talk, we will present how we analyze, predict, and visualize network quality data, as a spark AI use case in a telecommunications company. SK Telecom is the largest wireless telecommunications provider in South Korea with 300,000 cells and 27 million subscribers. These 300,000 cells generate data every 10 seconds, the total size of which is 60TB, 120 billion records per day.
In order to address previous problems of Spark based on HDFS, we have developed a new data store for SparkSQL consisting of Redis and RocksDB that allows us to distribute and store these data in real time and analyze it right away, We were not satisfied with being able to analyze network quality in real-time, we tried to predict network quality in near future in order to quickly detect and recover network device failures, by designing network signal pattern-aware DNN model and a new in-memory data pipeline from spark to tensorflow.
In addition, by integrating Apache Livy and MapboxGL to SparkSQL and our new store, we have built a geospatial visualization system that shows the current population and signal strength of 300,000 cells on the map in real time.
-The architecture of our How we utilize Redis & RocksDB in order to store tremendous data in an efficient way.
-The architecture of Spark Data Source for Redis: filter out irrelevant Redis keys using filter pushdown.
-How we reduce memory usage of Spark driver and prevent its OutOfMemoryError.
-Better prediction model for network quality prediction than RNN.
-How we train a prediction model for network quality of 300,000 cells each of which has different signal patterns.
-How we visualize in geospatial data: Customized logical plan for spatial query aggregation & pushdown
-How we optimize Spatial query: aggregation pushdown and vectorized aggregation using SIMD