Navigating Your Migration to Databricks: Architectures and Strategic Approaches
Summary
- There are two primary approaches to data warehouse migration on Databricks—ETL-First (back-to-front) and BI-First (front-to-back).
- An ETL-first migration involves creating a robust Lakehouse Data Model across the Bronze, Silver, and Gold layers. It ensures data governance and quality and promotes modernization but delays visible business outcomes, as BI and analytics benefits are realized only after the entire pipeline is built.
- A BI-First migration gives users early access to the new platform by modernizing BI systems first. Patterns like "Federate, Then Migrate" or "Replicate, Then Migrate" provide flexibility for phased migrations, ensuring quicker business value and alignment with evolving needs. Databricks features like Lakehouse Federation and LakeFlow Connect facilitate this approach.
In our previous blog, we explored the methodology recommended by our Professional Services teams for executing complex data warehouse migrations to Databricks. We highlighted the intricacies and challenges that can arise during such projects and emphasized the importance of making pivotal decisions during the migration strategy and design phase. These choices significantly influence both the migration's execution and the architecture of your target data platform. In this post, we dive into these decisions and outline the key data points necessary to make informed, effective choices throughout the migration process.
Migration strategy: ETL first or BI first?
Once you’ve established your migration strategy and designed a high-level target data architecture, the next decision is determining which workloads to migrate first. Two dominant approaches are:
- ETL-First Migration (Back-to-Front)
- BI-First Migration (Front-to-Back)
ETL-First Migration: Building the Foundation
The ETL-first, or back-to-front, migration begins by creating a comprehensive Lakehouse Data Model, progressing through the Bronze, Silver, and Gold layers. This approach involves setting up data governance with Unity Catalog, ingesting data with tools like LakeFlow Connect and applying techniques like change data capture (CDC), and converting legacy ETL workflows and stored procedures into Databricks ETL. After rigorous testing, BI reports are repointed, and the AI/ML ecosystem is built on the Databricks Platform.
This strategy mirrors the natural flow of data—producing and onboarding data, then transforming it to meet use case requirements. It allows for a phased rollout of reliable pipelines and optimized Bronze and Silver layers, minimizing inconsistencies and improving the quality of data for BI. This is particularly useful for designing new Lakehouse data models from scratch, implementing Data Mesh, or redesigning data domains.
However, this approach often delays visible results for business users, whose budgets typically fund these initiatives. Migrating BI last means that improvements in performance, insights, and support for predictive analytics and GenAI projects may not materialize for months. Changing business requirements during migration can also create moving goalposts, affecting project momentum and organizational buy-in. The full benefits are only realized once the entire pipeline is completed and key subject areas in the Silver and Gold layers are built.
BI-First Migration: Delivering Immediate Value
The BI-first, or front-to-back, migration prioritizes the consumption layer. This approach gives users early access to the new data platform, showcasing its capabilities while migrating workflows that populate the consumption layer in a phased manner, either by use case or domain.
Key Product Features Enabling BI-First Migration
Two standout features of the Databricks Platform make the BI-first migration approach highly practical and impactful: Lakehouse Federation and LakeFlow Connect. These capabilities streamline the process of modernizing BI systems while ensuring agility, security, and scalability in your migration efforts.
- Lakehouse Federation: Unify Access Across Siloed Data Sources
Lakehouse Federation enables organizations to seamlessly access and query data across multiple siloed enterprise data warehouses (EDWs) and operational systems. It supports integration with major data platforms, including Teradata, Oracle, SQL Server, Snowflake, Redshift, and BigQuery. - LakeFlow Connect:
LakeFlow Connect revolutionizes the way data is ingested and synchronized by leveraging Change Data Capture (CDC) technology. This feature enables real-time, incremental data ingestion into Databricks, ensuring that the platform always reflects up-to-date information.
Patterns for BI-First Migration
By leveraging Lakehouse Federation and LakeFlow Connect, organizations can implement two distinct patterns for BI-first migration:
- Federate, Then Migrate:
Quickly federate legacy EDWs, expose their tables via Unity Catalog, and enable cross-system analysis. Incrementally ingest required data into Delta Lake, perform ETL to build Gold layer aggregates, and repoint BI reports to Databricks. - Replicate, Then Migrate:
Use CDC pipelines to replicate operational and EDW data into the Bronze layer. Transform the data in Delta Lake and modernize BI workflows, unlocking siloed data for ML and GenAI projects.
Both patterns can be implemented use case by use case in an agile, phased approach. This ensures early business value, aligns with organizational priorities, and sets a blueprint for future projects. Legacy ETL can be migrated later, transitioning data sources to their true origins and retiring legacy EDW systems.
Conclusion
These migration strategies provide a clear path to modernizing your data platform with Databricks. By leveraging tools like Unity Catalog, Lakehouse Federation, and LakeFlow Connect, you can align your architecture and strategy with business goals while enabling advanced analytics capabilities. Whether you prioritize ETL-first or BI-first migration, the key is delivering incremental value and maintaining momentum throughout the transformation journey.