Training AI models for real-world applications require vast amounts of labeled data, which can be costly, time-consuming, and difficult to obtain at scale. Synthetic data generation in simulated environments offers a powerful alternative by enabling AI models to learn from physically accurate, controlled, and scalable virtual datasets before deployment.
Leveraging Omniverse Replicator, a core extension of Isaac Sim, a reference robotic simulation application, with the Databricks’ Data Intelligence Platform provides an end-to-end workflow for developing domain-specific AI models in industries like manufacturing, logistics, healthcare diagnostics, and robotics. By combining synthetic data generation, automated AI workflows, and scalable cloud infrastructure, organizations can accelerate AI development while reducing data acquisition challenges and improving model accuracy.
This blog explores the technical foundations of this integration, real-world applications, and demonstrates how the collaboration between Databricks and NVIDIA is supercharging machine vision applications. By fusing Databricks’ Data Intelligence Platform with NVIDIA’s unparalleled high-performance computing, enterprises can now build, train, and deploy vision models at speeds previously thought impossible. This blog explores the technical foundations of this integration and its real-world applications.
The technical foundations of the integration start with a reference architecture that defines interfaces, data models, and communication protocols. Below is a generalized workflow that demonstrates the integration of applications developed with NVIDIA Omniverse and the Databricks Data Intelligence Platform to provide an end-to-end AI model training pipeline.
The steps within the workflow are as follows:
Within this architecture, Delta Lake is used as the integration layer between NVIDIA Omniverse and Databricks. We bridge the two platforms by leveraging a prototype, custom writer, which allows an application developed with Omniverse to write synthetic data directly into the Lakehouse. Using this approach, instead of writing the data to disk in the form of PNG and NumPy files, Omniverse powered applications can write the generated synthetic images and corresponding metadata into Delta Lake format. The files land directly into cloud storage and are registered to Unity Catalog where they are further processed using Databricks so they are available for downstream model training.
The NVIDIA Omniverse and Databricks integration establishes a new paradigm for machine vision development encompassing synthetic data generation and easy-to-use, industrial-grade AI. Within manufacturing environments, defect detection models often encounter three primary challenges: identifying new defects, adapting to new products, and performing in diverse real-world environments.
To tackle these challenges, the NVIDIA Omniverse platform enables customers to build custom synthetic generation pipelines. NVIDIA Omniverse enables developers to create entirely new camera angles, lighting conditions, and physical scenarios in their applications, significantly enhancing model robustness and adaptability beyond traditional methods, such as rotating or brightening images.
By automating image generation, the synthetic data generation process becomes a tunable parameter within Databricks’ Managed MLflow. These adjustments can be made alongside traditional hyperparameters like learning rate and batch size. As you identify which variations impact model accuracy, you can refine your training approach to focus on the most effective combinations of synthetic data and hyperparameters while minimizing time spent on less productive configurations.
By having synthetic data as a tunable parameter, new use cases are unlocked for manufacturers without disrupting actual operations:
These approaches enable manufacturers to train a broader variety of machine vision models to solve business problems proactively. Rare defects with data that was previously too sparse to train on can now be augmented with numerous realistic examples, allowing businesses to catch defects before they escape while preparing enterprises for the new age of Data Intelligence.
Siemens Healthineers, a joint healthcare customer of Databricks and NVIDIA inspired this integration architecture after experiencing challenges. The fragmented workflow—with one engineer generating synthetic data through an application developed with NVIDIA Omniverse on-premises and another moving that data to the cloud for ML training and deployment in Databricks—created delays.
By implementing Databricks Unity Catalog to centralize all data, functions, and models under a single governance framework and directly integrating the Omniverse platform’s synthetic data generation capabilities, the organization dramatically reduced model iteration cycles "from weeks to days," improved data integration and traceability, and accelerated time to market.
If you are attending NVIDIA GTC 2025, visit us at our Databricks Booth #1733 or request a Meeting with Databricks at GTC.
For more about NVIDIA Omniverse and the Databrick Data Intelligence Platform please see additional resources below:
NVIDIA Omniverse Website
Databricks Data Intelligence Platform Website
Databricks <> NVDA Partnership Announcement
Databricks - ML Ops Documentation
