Will is Director of Product Management at Couchbase, where he focuses on analytics, Spark, Kafka, and search. He’s responsible for interoperability between Couchbase and other big data technologies. Previously, he was a product manager in the big data platform team at HP, a Senior Director of Product Management for SAP HANA, and the Senior Director of SAP Research’s global Big Data program focused on big data and machine learning.
For an operational database, Spark is like Batman's utility belt: it handles a variety of important tasks from data cleanup and migration to analytics and machine learning that make the operational database much more powerful than it would be on its own. In this talk, we describe the Couchbase Spark Connector that lets you easily integrate Spark with Couchbase Server, an open source distributed NoSQL document database that provides low latency data management for large scale, interactive online applications. We'll start with common use cases for Spark and Couchbase, then cover the basics of creating, persisting and consume RDDs and DataFrames from Couchbase's key/value and SQL interfaces. Advanced topics include: • Best practices and gotchas working with DataFrames, especially related to schema inferences in Spark and the latest Couchbase N1QL describe / infer • How the Couchbase Spark Connector optimizes work with key/value RDDs and Couchbase's key/value interfaces • How and why create Spark Streams from Couchbase Database Change Protocol streams (memory to memory streams that are used to replicate data between nodes and services) • Performance tuning: topology awareness in Couchbase and locality in Spark • SparkSQL, predicate pushdown, and in-memory indexing