Graduated as a BS in Seoul National University, Dooyoung Hwang worked in LG electronics as a senior engineer for 8 years and is contributor for Android Frameworks. Now as an software engineer for SK Telecom, his work is focusing on optimizing spark queries and building ETL infrastructure for network analysis system.
In this talk, we will present how we analyze, predict, and visualize network quality data, as a spark AI use case in a telecommunications company. SK Telecom is the largest wireless telecommunications provider in South Korea with 300,000 cells and 27 million subscribers. These 300,000 cells generate data every 10 seconds, the total size of which is 60TB, 120 billion records per day.
In order to address previous problems of Spark based on HDFS, we have developed a new data store for SparkSQL consisting of Redis and RocksDB that allows us to distribute and store these data in real time and analyze it right away, We were not satisfied with being able to analyze network quality in real-time, we tried to predict network quality in near future in order to quickly detect and recover network device failures, by designing network signal pattern-aware DNN model and a new in-memory data pipeline from spark to tensorflow.
In addition, by integrating Apache Livy and MapboxGL to SparkSQL and our new store, we have built a geospatial visualization system that shows the current population and signal strength of 300,000 cells on the map in real time.
-The architecture of our How we utilize Redis & RocksDB in order to store tremendous data in an efficient way.
-The architecture of Spark Data Source for Redis: filter out irrelevant Redis keys using filter pushdown.
-How we reduce memory usage of Spark driver and prevent its OutOfMemoryError.
-Better prediction model for network quality prediction than RNN.
-How we train a prediction model for network quality of 300,000 cells each of which has different signal patterns.
-How we visualize in geospatial data: Customized logical plan for spatial query aggregation & pushdown
-How we optimize Spatial query: aggregation pushdown and vectorized aggregation using SIMD