Kiran Muglurmath is the Executive Director of Big Data Analytics at Comcast, where he manages a team of data scientists and big data engineers for machine learning, data mining and predictive analytics. Prior to Comcast, Kiran was a consulting big data platform architect and data scientist at T-Mobile and Boeing. He holds an MBA from the Kellogg School at Northwestern University, and a Computer Science degree from Bangalore University.
Businesses are accumulating a lot of data from disparate sources and storing it in Hadoop for further exploration, data mining and deterministic and predictive analysis using a variety of approaches and algorithms. However, leveraging the rich, validated open source libraries in R is a challenge due to the massive dataset sizes in Hadoop. We discuss how we solved a common anomaly detection problem on petabytes of data using Hidden Markov Models using R on Hadoop. Abstract: Comcast collects significant amounts of data ranging from customer usage clickstreams to customer contact events like telesales, emails etc. The sheer volume and variety of the data at the velocity it comes in makes straightforward data science algorithms impractical and make it hard to keep up with the goals of the organization. In this talk we will discuss how we are tackling one of the challenging topics of anomaly detection: 1. Anomaly detection is often tied to fraud detection but the use cases go well beyond this. At Comcast, we use anomaly detection to identify internet usage patterns, customer activity anomalies, changes and errors in the hardware supporting the backbone etc. R over Hadoop is used to quickly build models and analyze PetaBytes of data in our Hadoop cluster. 2. Spark has significantly changed the hadoop paradigm to build faster more scalable applications and analytical tools. SparkR especially has changed how we approached the problem. At Comcast, we are using Spark to provide faster flexible processing of the data and enable us to do anomaly detection for a wide variety of use cases.
Almost all organizations now have a need for data science and, as such, the main challenge after determining the algorithm is to scale it up and make it operational. Comcast uses several tools and technologies such as Python, R, SaS, H2O and so on. In this session, they'll show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees, Clustering, NLP, etc. Apache Spark has several machine learning algorithms built in and has excellent scalability. Hence, at Comcast, they built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs, so as to abstract most users from the rigor of writing (repeating) code, instead focusing on the actual requirements. Learn how they solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production. They'll also showcase their use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500-node Spark clusters. Session hashtag: #SFeco19