Apache Spark has taken over machine learning and exploratory analytics, but is not often thought of as a platform capable of delivering sub-second / web-speed concurrent queries. Spark DataFrames has in-memory caching, but it cannot be updated and is mostly designed for full table scans. This talk focuses on two important innovations: updatable in-memory columnar storage, and how to enable Spark for concurrent web-speed (sub-second) queries, based on work from the FiloDB project.
– Spark SQL has much lower latency than you thought – 15ms and up!
– A deep dive into Spark’s cached RDDs and cached DataFrames
– Re-inventing columnar storage for updates and filtering: learning lessons from the NoSQL world
– How in-memory storage changes the game * Flexible and fine-grained filtering in two dimensions
– Achieving concurrency with proper data modeling, partitioning/filtering, and the FAIR scheduler
– Customizing JOIN query planning to achieve 4-table subsecond JOINs
– Speeding up smart city, real-time geospatial, time series, dashboards, and other applications
Key take-away: Updatable columnar technology provides real benefits for a variety of real-time/streaming/dashboard/consumer apps. Combining storage technology, good data modeling, filtering, fair scheduler, and good deployment practices enables concurrent, web speed use of Spark as a SQL engine.
Evan loves to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies. He has led the design and implementation of multiple big data platforms based on Storm, Spark, Kafka, Cassandra, and Scala/Akka, including a columnar real-time distributed query engine. He is an active contributor to the Apache Spark project, a Datastax Cassandra MVP, and co-creator and maintainer of the open-source Spark Job Server. He is a big believer in GitHub, open source, and meetups, and have given talks at various conferences including Spark Summit, Cassandra Summit, FOSS4G, and Scala Days.