In this talk, we’ll discuss the challenges of analyzing large-scale time series data sets and introduce the Spark-TS library. Whether we need to build models over data coming in every second from thousands of sensors of dig into the histories of millions of financial instruments, large scale time series data shows up in a variety of domains. Time series data has an innate structure not found in other data sets, and thus presents both unique challenges and opportunities. The open source Spark-TS library provides both Scala and Python APIs for munging, manipulating, and modeling time series data, on top of Spark. A1:J54 cover its core concepts, like the TimeSeriesRDD and DateTimeIndex, as well as some of the statistical modeling functionality it provides on top of them.
Sandy is a senior data scientist at Cloudera, focusing on Apache Spark and its ecosystem, and an author of the recent O'Reilly publication "Advanced Analytics with Spark." He's a Spark committer and member of the Apache Hadoop project management committee. He graduated Phi Beta Kappa from Brown University.