The New MongoDB Connector for Apache Spark In Action: Building a Movie Recommendation Engine

Published: July 21, 2016

Try this notebook in Databricks

This is a repost of a blog from our friends at MongoDB. Sam is the Product Manager for Developer Experience at MongoDB based in New York.

We've added an example of the connector in the Databricks environment as a notebook.

We are delighted to announce general availability of the new, native MongoDB Connector for Apache Spark. It provides higher performance, greater ease of use, and access to more advanced Spark functionality than other connectors. With certification from Databricks, the company founded by the team that started the Spark research project at UC Berkeley that later became Apache Spark, developers can focus on building modern, data driven applications, knowing that the connector provides seamless integration and complete API compatibility between Spark processes and MongoDB.

Written in Scala, Apache Spark’s native language, the Connector provides a more natural development experience for Spark users. The connector exposes all of Spark’s libraries, enabling MongoDB data to be materialized as DataFrames and Datasets for analysis with machine learning, graph, streaming and SQL APIs, further benefiting from automatic schema inference.

The Connector also takes advantage of MongoDB’s aggregation pipeline and rich secondary indexes to extract, filter, and process only the range of data it needs – for example, analyzing all customers located in a specific geography. This is very different from simple NoSQL data stores that do not offer either secondary indexes or in-database aggregations. In these cases, Apache Spark would need to extract all data based on a simple primary key, even if only a subset of that data is required for the Spark process. This means more processing overhead, more hardware, and longer time-to-insight for the analyst.

To maximize performance across large, distributed data sets, the Spark connector is aware of data locality in a MongoDB cluster. RDDs are automatically processed on workers co-located with the associated MongoDB shard to minimize data movement across the cluster. The nearest read preference can be used to route Spark queries to the closest physical node in a MongoDB replica set, thus reducing latency.

To demonstrate how to use the connector, we’ve created a tutorial that uses MongoDB together with Apache Spark’s machine learning libraries to build a movie recommendation system. This example presumes you have familiarity with Spark. If you are new to Spark but would like to learn the basics of using Spark and MongoDB together, we encourage you to check out our new MongoDB University Course.

You can explore the tutorial in a Databricks notebook here.

What's next

What's next?

November 21, 2024/3 min read

How to present and share your Notebook insights in AI/BI Dashboards

December 10, 2024/7 min read

What's next

Never miss a Databricks post

Sign up

What's next?

How to present and share your Notebook insights in AI/BI Dashboards

Batch Inference on Fine Tuned Llama Models with Mosaic AI Model Serving