This is a repost of a blog from our friends at MongoDB. Sam is the Product Manager for Developer Experience at MongoDB based in New York.
We’ve added an example of the connector in the Databricks environment as a notebook.
We are delighted to announce general availability of the new, native MongoDB Connector for Apache Spark. It provides higher performance, greater ease of use, and access to more advanced Spark functionality than other connectors. With certification from Databricks, the company founded by the team that started the Spark research project at UC Berkeley that later became Apache Spark, developers can focus on building modern, data driven applications, knowing that the connector provides seamless integration and complete API compatibility between Spark processes and MongoDB.
Written in Scala, Apache Spark’s native language, the Connector provides a more natural development experience for Spark users. The connector exposes all of Spark’s libraries, enabling MongoDB data to be materialized as DataFrames and Datasets for analysis with machine learning, graph, streaming and SQL APIs, further benefiting from automatic schema inference.
The Connector also takes advantage of MongoDB’s aggregation pipeline and rich secondary indexes to extract, filter, and process only the range of data it needs – for example, analyzing all customers located in a specific geography. This is very different from simple NoSQL data stores that do not offer either secondary indexes or in-database aggregations. In these cases, Apache Spark would need to extract all data based on a simple primary key, even if only a subset of that data is required for the Spark process. This means more processing overhead, more hardware, and longer time-to-insight for the analyst.
To maximize performance across large, distributed data sets, the Spark connector is aware of data locality in a MongoDB cluster. RDDs are automatically processed on workers co-located with the associated MongoDB shard to minimize data movement across the cluster. The nearest read preference can be used to route Spark queries to the closest physical node in a MongoDB replica set, thus reducing latency.
“Users are already combining Apache Spark and MongoDB to build sophisticated analytics applications. The new native MongoDB Connector for Apache Spark provides higher performance, greater ease of use, and access to more advanced Apache Spark functionality than any MongoDB connector available today.” Reynold Xin, co-founder and chief architect of Databricks
To demonstrate how to use the connector, we’ve created a tutorial that uses MongoDB together with Apache Spark’s machine learning libraries to build a movie recommendation system. This example presumes you have familiarity with Spark. If you are new to Spark but would like to learn the basics of using Spark and MongoDB together, we encourage you to check out our new MongoDB University Course.
You can explore the tutorial in a Databricks notebook here.