How to Connect Spark to Your Own Datasource

Download Slides

Spark may have taken the big data world by storm by being super fast and easy to use. However, by design, Spark is not a datastore and only supports a limited number of sources for data. So how can you integrate your datasource with Spark? In this talk we’ll look at how to successfully write your own Spark connector. We’ll look in depth at the the lessons learnt writing a new Spark Connector for MongoDB, and how you can apply those lessons to any potential data source as you build your own connector. At the core of Spark is the RDD, so the first step in building your connector is being able to create an RDD and partition data efficiently. Initially, it’s easiest to focus on Scala, but we’ll look at how to expand and support Java at the same time, and why it’s a good idea. We’ll look at how you can test the code and prove the connector works before expanding it to other Spark features. The next step is to expose the connector to Spark’s fastest growing features; Spark Streaming and Spark SQL. Once we have a fully functioning Spark Connector for the JVM, we’ll look at how easy it is to extend it to support Python and R. Finally, we’ll look at how best to publish your connector so the world can find it and use it.

About Ross Lawley

Ross Lawley is a polyglot JVM engineer for MongoDB and focuses on user facing code that connects MongoDB to user applications. Ross maintains the Java and Scala drivers for MongoDB as well as the Spark Connector and other libraries. Previously, Ross worked on modernizing the Java driver, bringing asynchronous support and reactive extensions support to the community. In the real world, he's a massive rugby and real ale fan.