by Michael Berk and Patrick Leahey
Databricks has joined forces with the Virtue Foundation through Databricks for Good, a grassroots initiative providing pro bono professional services to drive social impact. Through this partnership, the Virtue Foundation will advance its mission of delivering quality healthcare worldwide by optimizing a cutting-edge data infrastructure.
The Virtue Foundation utilizes both static and dynamic data sources to connect doctors with volunteer opportunities. To ensure data remains current, the organization’s data team implemented API-based data retrieval pipelines. While the extraction of basic information such as organization names, websites, phone numbers, and addresses is automated, specialized details like medical specialties and regions of activity require significant manual effort. This reliance on manual processes limits scalability and reduces the frequency of updates. Additionally, the dataset’s tabular format presents usability challenges for the Foundation’s primary users, such as doctors and academic researchers.
In short, the Virtue Foundation aims to ensure its core datasets are consistently up-to-date, accurate, and readily accessible. To realize this vision, Databricks professional services designed and built the following components.
As depicted in the diagram above, we utilize a classic medallion architecture to structure and process our data. Our data sources include a range of API and web-based inputs, which we first ingest into a bronze landing zone via batch Spark processes. This raw data is then refined in a silver layer, where we clean and extract metadata via incremental Spark processes, typically implemented with structured streaming.
Once processed, the data is sent to two production systems. In the first, we create a robust, tabular dataset that contains essential information about hospitals, NGOs, and related entities, including their location, contact information, and medical specialties. In the second, we implement a LangChain-based ingestion pipeline that incrementally chunks and indexes raw text data into a Databricks Vector Search.
From a user perspective, these processed data sets are accessible through vfmatch.org and are integrated into a Retrieval-Augmented Generation (RAG) chatbot, hosted in the Databricks AI Playground, providing users with a powerful, interactive data exploration tool.
The vast majority of this project leveraged standard ETL techniques, however there were a few intermediate and advanced techniques that proved valuable in this implementation.
The Virtue Foundation uses MongoDB as the serving layer for their website. Connecting Databricks to an external database like MongoDB can be complex due to compatibility limitations—certain Databricks operations may not be fully supported in MongoDB and vice versa, complicating the flow of data transformations across platforms.
To address this, we implemented a bidirectional sync that gives us full control over how data from the silver layer is merged into MongoDB. This sync maintains two identical copies of the data, so changes in one platform are reflected in the other based on the sync trigger frequency. At a high level, there are two components:
merge
statement within forEachBatch()
to keep the Databricks tables updated with these changes.This bidirectional setup ensures that data flows seamlessly between Databricks and MongoDB, keeping both systems up-to-date and eliminating data silos.
Thank you Alan Reese for owning this piece!
To streamline data integration, we implemented a GenAI-based approach for extracting and merging hospital information from blocks of website text. This process involves two key steps:
Traditionally, this would have required fuzzy matching techniques and complex rule sets. However, by combining embedding distance with simple deterministic rules, for instance, exact match for country, we were able to create a solution that is both effective and relatively simple to build and maintain.
For the current iteration of the product, we use the following matching criteria:
Thank you Patrick Leahey for the amazing design idea and implementing it end to end!
As mentioned, the broader infrastructure follows standard Databricks architecture and practices. Here’s a breakdown of the key components and the team members who made it all possible:
Through our collaboration with Virtue Foundation, we’re demonstrating the potential of data and AI to create lasting global impact in healthcare. From data ingestion and entity extraction to Retrieval-Augmented Generation, each phase of this project is a step toward creating an enriched, automated, and interactive data marketplace. Our combined efforts are setting the stage for a data-driven future where healthcare insights are accessible to those who need them most.
If you have ideas on similar engagements with other global non-profits, let us know at [email protected].