On-Premise Spark-as-a-Service for Swedish Researchers

Download Slides

Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Our platform builds on Hops, a new distribution of Hadoop with a distributed metadata architecture, that includes a frontend called Hopsworks with support for project-based multi-tenancy and first-class datasets. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific storage on HDFS and project-specific Kafka topics. Both project-specific storage and Kafka topics are protected from access by users that are not members of the project. Researchers work in an entirely UI-driven environment on a platform that is open-source. In this talk we will discuss the challenges in building a metered version of Spark-as-a-Service for YARN, experiences with Spark-on-YARN, and some of the possibilities that Hopsworks opens up for building secure, multi-tenant Spark applications on a shared cluster. We will also discuss the experiences of our users (over 100 users as of June 2016): how they manage their YARN and HDFS quotas, patterns for how users share datasets between projects, and our novel solutions for helping researchers debug and optimize Spark applications.

About Jim Dowling

Jim Dowling is a native of Dublin (Ireland) and an Associate Professor at the School of Information and Communications Technology in the Department of Software and Computer Systems at KTH Royal Institute of Technology, a Senior Researcher at SICS RISE, and CEO of Logical Clocks AB. He received his Ph.D. in Distributed Systems from Trinity College Dublin (2005) and worked at MySQL AB (2005-2007). He's a distributed systems researcher and his research interests are in the area of high-performance, large-scale distributed computer systems. He's lead architect of Hops Hadoop (www.hops.io), the world's most scalable Hadoop distribution.