Sandy is a senior data scientist at Cloudera, focusing on Apache Spark and its ecosystem, and an author of the recent O’Reilly publication “Advanced Analytics with Spark.” He’s a Spark committer and member of the Apache Hadoop project management committee. He graduated Phi Beta Kappa from Brown University.
Want to can prepare a dataset with MapReduce and Pig, query it with Impala, and fit a model to it with Spark? To run these alongside each other and share resources across them in real time? CDH recently added the capability of dynamically scheduling Impala work alongside MapReduce, centrally managed by YARN. Moving beyond static allocations allows users to think of the resources in terms of workloads which may span processing paradigms. It allows cluster operators to allocate portions of their cluster to units within their organization instead of processing frameworks. Incorporating Spark as a data processing framework alongside MapReduce and Impala, we would like to extend this resource management vision to Spark. The existing Spark on YARN work is a strong step in this direction. Allowing Spark applications to fluidly grab and release resources through YARN will require additional work both in Spark and YARN. For example, resizable containers in YARN and off-heap memory in Spark that can be given back to the OS. The talk will discuss the current state of resource management on Hadoop, how Spark fits in currently, the work that needs to be done to share resources fluidly between Spark and other processing frameworks on YARN, and the kinds of pipelines and mixed workloads that this resource sharing will enable.
Spark’s YARN support allows scheduling Spark workloads on Hadoop alongside a variety of other data-processing frameworks. The talk will be a deep dive into the architecture and uses of Spark on YARN. We’ll cover the intersection between Spark and YARNâ€™s resource management models. Attention will also be given to the different supported deploy modes and best operational practices. Finally, we’ll also discuss roadmap items, such as executor container resizing and integration with YARN’s application history store.
Having collected large amounts of data, organizations are keen on data science and big learning. However, transitioning from “the lab” to a production-ready large-scale operational analytics system remains a difficult and ad-hoc endeavor, especially when real-time answers are required. In addition to scalable and performant model-building capabilities, a production system needs to be able to update and serve models reliably and in real time. The Oryx open source project provides simple, large-scale machine learning infrastructure for both building and serving predictive models. While Oryx has previously relied on custom MapReduce jobs for its model-building component, Apache Spark and its ML library provide a promising alternative. The talk will discuss our recent work in transitioning Oryx’s model-building component to run on Spark and leverage algorithms from MLLib, the reasons why Spark is well suited for the task, and the general anatomy of production large-scale machine learning infrastructure.
In this talk, we'll discuss the challenges of analyzing large-scale time series data sets and introduce the Spark-TS library. Whether we need to build models over data coming in every second from thousands of sensors of dig into the histories of millions of financial instruments, large scale time series data shows up in a variety of domains. Time series data has an innate structure not found in other data sets, and thus presents both unique challenges and opportunities. The open source Spark-TS library provides both Scala and Python APIs for munging, manipulating, and modeling time series data, on top of Spark. A1:J54 cover its core concepts, like the TimeSeriesRDD and DateTimeIndex, as well as some of the statistical modeling functionality it provides on top of them.Learn more: