Spark and Couchbase—Augmenting the Operational Database with Spark

Download Slides

Spark and Couchbase – Augmenting the Operational Database with Spark For an operational database, Spark is like Batman’s utility belt: it handles a variety of important tasks from data cleanup and migration to analytics and machine learning that make the operational database much more powerful than it would be on its own. In this talk, I’ll describe the Couchbase Spark Connector that lets you easily integrate Spark with Couchbase Server, an open source distributed NoSQL document database that provides low latency data management for large scale, interactive online applications. Both Spark and Couchbase are memory-centric systems, so when used correctly they can be insanely fast. We’ll cover common use cases for Spark and Couchbase, afterwards the basics of creating, persisting and consume RDDs and DataFrames from Couchbase’s key/value and SQL interfaces. Advanced topics include: – Best practices and gotchas working with DataFrames, especially related to schema inferences in Spark and the latest Couchbase N1QL describe / infer – How the Couchbase Spark Connector optimizes work with key/value RDDs and Couchbase’s key/value interfaces – How and why create Spark Streams from Couchbase Database Change Protocol streams (memory to memory streams that are used to replicate data between nodes and services) – Performance tuning: topology awareness in Couchbase and locality in Spark, and SparkSQL, predicate pushdown, and in-memory indexing

About Michael Nitschinger

Michael is a Senior Software Engineer at Couchbase. He is the architect and maintainer of the Couchbase Java SDK, one of the first completely reactive database drivers on the JVM. He also authored and maintains the Couchbase Spark Connector. Michael is active in the open source community, a core member of the Netty project, and also contributes to various other projects like RxJava.