Sridhar Alla currently works as the Director of Big Data Solutions and Architecture at Comcast, where he has delivered several key solutions, such as the XFinity personalization platform, ClickthruAnalytics, Correlation platform, etc. Sridhar started his career in network appliances on NAS and caching technologies. He also served as the CTO of security company eIQNetworks, where he merged the concepts of big data and security products. He holds patents on topics of very large scale processing algorithms and caching.
Comcast provides personalized recommendations to its customers on the X1 Platform. Our initial implementation was built on the Hadoop map-reduce framework using a batch computation model. When we wanted to explore how we can offer real-time recommendations, we looked to Spark because of its increased computational efficiency and the ease to develop both streaming and batch processing solutions using the same code base. In this talk, we will be describing how we re-implemented our recommendation data pipeline using the Spark framework to support use cases where we need to integrate incoming streams of data in real-time with a latency of seconds. Specifically, at Comcast we are dealing with billions of machine generated events amounting to 100s of GB per day and to quickly compute the recommendations for users with low latency we needed a faster system than the batch oriented map-reduce framework. Spark allowed us to consume the events quickly taking advantage of the intermittent state of results due to the in-memory caching performed. As a result, we no longer had to rerun the complete pipeline every few hours which became unfeasible given that the number of events is increasing with time. In summary, our experience shows that Spark allows us to compute recommendation results much faster due to in-memory caching of Spark while also accelerating the development process significantly.
Businesses are accumulating a lot of data from disparate sources and storing it in Hadoop for further exploration, data mining and deterministic and predictive analysis using a variety of approaches and algorithms. However, leveraging the rich, validated open source libraries in R is a challenge due to the massive dataset sizes in Hadoop. We discuss how we solved a common anomaly detection problem on petabytes of data using Hidden Markov Models using R on Hadoop. Abstract: Comcast collects significant amounts of data ranging from customer usage clickstreams to customer contact events like telesales, emails etc. The sheer volume and variety of the data at the velocity it comes in makes straightforward data science algorithms impractical and make it hard to keep up with the goals of the organization. In this talk we will discuss how we are tackling one of the challenging topics of anomaly detection: 1. Anomaly detection is often tied to fraud detection but the use cases go well beyond this. At Comcast, we use anomaly detection to identify internet usage patterns, customer activity anomalies, changes and errors in the hardware supporting the backbone etc. R over Hadoop is used to quickly build models and analyze PetaBytes of data in our Hadoop cluster. 2. Spark has significantly changed the hadoop paradigm to build faster more scalable applications and analytical tools. SparkR especially has changed how we approached the problem. At Comcast, we are using Spark to provide faster flexible processing of the data and enable us to do anomaly detection for a wide variety of use cases.
Almost all organizations now have a need for data science and, as such, the main challenge after determining the algorithm is to scale it up and make it operational. Comcast uses several tools and technologies such as Python, R, SaS, H2O and so on. In this session, they'll show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees, Clustering, NLP, etc. Apache Spark has several machine learning algorithms built in and has excellent scalability. Hence, at Comcast, they built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs, so as to abstract most users from the rigor of writing (repeating) code, instead focusing on the actual requirements. Learn how they solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production. They'll also showcase their use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500-node Spark clusters. Session hashtag: #SFeco19