Evan Chan - Databricks

Evan Chan

Data Architect, Apple

Evan loves to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies. He has led the design and implementation of multiple big data platforms based on Storm, Spark, Kafka, Cassandra, and Scala/Akka, including a columnar real-time distributed query engine. He is an active contributor to the Apache Spark project, a Datastax Cassandra MVP, and co-creator and maintainer of the open-source Spark Job Server. He is a big believer in GitHub, open source, and meetups, and have given talks at various conferences including Spark Summit, Cassandra Summit, FOSS4G, and Scala Days.


Spark Query Service (Job Server) at Ooyala

We would like to share with you the innovative ways that we use Spark at Ooyala, together with Apache Cassandra, to tackle interactive analytics and OLAP applications. In particular, we are turning Spark into a Service with our Spark Job Server. The job server has been a big help to our development efforts, providing a single, REST API for:

  • enabling interactive query jobs in long-running SparkContexts with shared RDD data
  • submitting and managing Spark Jobs on both standalone and Mesos clusters
  • tracking and serializing job status, progress, and job results
  • providing a programmatic API for job management scripts and query servers
  • cancelling problematic jobs
We believe the job server could be a significant help to Spark developer productivity everywhere.

Spark Job Server: Easy Spark Job Management

As your enterprise starts deploying more and more Spark jobs, you will discover many common issues: deploying job jars; managing job lifecycles and progress; serializing and processing job results; keeping track of failures, job statuses, and jars; managing Spark contexts for fast interactive jobs. Also, every Job is an application with its own interface and parameters. Submitting and running jobs Hadoop-style just doesn’t work. Our open-source Spark Job Server offers a RESTful API for managing Spark jobs, jars, and contexts, turning Spark into an easy-to-use service, and offering a uniform API for all jobs. We will talk about our job server, its APIs, current and upcoming features in much greater detail. Learn how the Spark Job Server can turn Spark into a easy to use service for your organization. As a developer, learn how the job server can let you focus on the job algorithm instead of on nitty gritty infrastructure details.

Productionizing Spark and the Spark REST Job Server

This is a two-part talk. The first part covers general deployment, configuration, and application running tips for Apache Spark, from my personal experience setting up and running Spark clusters since the early days of version 0.9. Should one deploy using Spark Standalone mode, Mesos, or YARN? What about Datastax DSE, and other options such as EMR? What are important considerations when configuring Spark, and working with jars and dependencies? We will cover all this and more, including tips for running and debugging Spark applications. The Spark Job Server is a leading option for running and managing Spark jobs as a REST service. With it, you get automatic job status and configuration logging to a database. We go into depth into using the Job Server, in particular as a way to share Spark RDDs amongst logical jobs for low-latency queries. Another interesting use case is SQL/DataFrame queries on Spark Streaming data. Learn about productionizing Spark and running it as a REST service with the Spark Job Server!

700 Queries Per Second with Updates: Spark As A Real-Time Web Service

Apache Spark has taken over machine learning and exploratory analytics, but is not often thought of as a platform capable of delivering sub-second / web-speed concurrent queries. Spark DataFrames has in-memory caching, but it cannot be updated and is mostly designed for full table scans. This talk focuses on two important innovations: updatable in-memory columnar storage, and how to enable Spark for concurrent web-speed (sub-second) queries, based on work from the FiloDB project. * Spark SQL has much lower latency than you thought - 15ms and up! * A deep dive into Spark's cached RDDs and cached DataFrames * Re-inventing columnar storage for updates and filtering: learning lessons from the NoSQL world * How in-memory storage changes the game * Flexible and fine-grained filtering in two dimensions * Achieving concurrency with proper data modeling, partitioning/filtering, and the FAIR scheduler * Customizing JOIN query planning to achieve 4-table subsecond JOINs * Speeding up smart city, real-time geospatial, time series, dashboards, and other applications *Key take-away*: Updatable columnar technology provides real benefits for a variety of real-time/streaming/dashboard/consumer apps. Combining storage technology, good data modeling, filtering, fair scheduler, and good deployment practices enables concurrent, web speed use of Spark as a SQL engine.

BoF-Real-time / Low-latency Apache Spark

Let’s get together to discuss usage of Spark for real time / low latency scenarios. Share your experiences and let’s help each other learn!