Detecting outliers and anomalies in data is one of the most common tasks that the working data scientist is asked to do. This is especially common and extra challenging with fast streaming data coming from many IoT sources. Despite this, the library support for problems of this variety are woefully unavailable. Often data scientists are forced to go to research papers and implement their own solutions. This talk will cover using the Spark Streaming coupled with a novel new algorithmic approach to detecting outliers at scale using a composition of distributional sketches as well as more classical techniques along with off-the-shelf UI components to demonstrate how this common but challenging task might be accomplished with for IoT data as well as more traditional streaming data.
I am a committer, data scientist and PMC member on the Apache Metron project in the engineering team at Hortonworks. In the past, I've worked as an architect and senior engineer at a healthcare informatics startup spun out of the Cleveland Clinic, as a developer at Oracle and as a Research Geophysicist in the Oil & Gas industry. I specialize in writing software and solving problems where there are either scalability concerns due to large amounts of traffic or large amounts of data. I have a particular passion for data science problems or any thing mathematical.