Petabyte Scale Anomaly Detection Using R & Spark – Databricks

Petabyte Scale Anomaly Detection Using R & Spark

Download Slides

Businesses are accumulating a lot of data from disparate sources and storing it in Hadoop for further exploration, data mining and deterministic and predictive analysis using a variety of approaches and algorithms. However, leveraging the rich, validated open source libraries in R is a challenge due to the massive dataset sizes in Hadoop. We discuss how we solved a common anomaly detection problem on petabytes of data using Hidden Markov Models using R on Hadoop. Abstract: Comcast collects significant amounts of data ranging from customer usage clickstreams to customer contact events like telesales, emails etc. The sheer volume and variety of the data at the velocity it comes in makes straightforward data science algorithms impractical and make it hard to keep up with the goals of the organization. In this talk we will discuss how we are tackling one of the challenging topics of anomaly detection: 1. Anomaly detection is often tied to fraud detection but the use cases go well beyond this. At Comcast, we use anomaly detection to identify internet usage patterns, customer activity anomalies, changes and errors in the hardware supporting the backbone etc. R over Hadoop is used to quickly build models and analyze PetaBytes of data in our Hadoop cluster. 2. Spark has significantly changed the hadoop paradigm to build faster more scalable applications and analytical tools. SparkR especially has changed how we approached the problem. At Comcast, we are using Spark to provide faster flexible processing of the data and enable us to do anomaly detection for a wide variety of use cases.

« back
About Sridhar Alla

Sridhar Alla currently works as the Director of Big Data Solutions and Architecture at Comcast, where he has delivered several key solutions, such as the XFinity personalization platform, ClickthruAnalytics, Correlation platform, etc. Sridhar started his career in network appliances on NAS and caching technologies. He also served as the CTO of security company eIQNetworks, where he merged the concepts of big data and security products. He holds patents on topics of very large scale processing algorithms and caching.

About Kiran Muglurmath

Kiran Muglurmath is the Executive Director of Big Data Analytics at Comcast, where he manages a team of data scientists and big data engineers for machine learning, data mining and predictive analytics. Prior to Comcast, Kiran was a consulting big data platform architect and data scientist at T-Mobile and Boeing. He holds an MBA from the Kellogg School at Northwestern University, and a Computer Science degree from Bangalore University.