Two ways to use this template
- 1. Click "Copy prompt" below
- 2. Paste into Cursor, Claude Code, Codex, or any coding agent
- 3. Your agent builds the app — it asks questions along the way so the result is exactly what you want
Follow the steps below to set things up manually, at your own pace.
Operational Data Analytics
End-to-end setup for analyzing operational database data in the lakehouse: Unity Catalog with external storage, Lakebase provisioning, Lakehouse Sync CDC replication, and a medallion architecture pipeline with silver and gold layers.
Prerequisites
This template creates a Unity Catalog catalog backed by an external S3 bucket and requires AWS-specific privileges both in Databricks and in AWS IAM.
- Databricks CLI authenticated. Run
databricks auth profilesand confirm at least one profile showsValid: YES. If none do, authenticate withdatabricks auth login --host <workspace-url> --profile <PROFILE>. - AWS workspace. This template uses S3 and IAM; it does not apply to Azure or GCP workspaces. Confirm your workspace host is
*.cloud.databricks.com. - Unity Catalog enabled on the workspace. Run
databricks catalogs list --profile <PROFILE>and confirm the command succeeds (the system catalogsmainandsystemshould appear). Anot enablederror means the workspace is not attached to a Unity Catalog metastore. - Metastore privileges to create credentials and catalogs. You need
CREATE STORAGE CREDENTIAL,CREATE EXTERNAL LOCATION, andCREATE_CATALOGon the metastore. If any CLI call in this template returnsPERMISSION_DENIED, ask your metastore admin to grant the missing privilege. - An S3 bucket and IAM role in the same AWS account and region as your workspace. The template walks through pointing a storage credential at the IAM role; you must already have permissions in AWS to create the role and policy.
Verify these Databricks workspace features are enabled before starting. If any check fails, ask your workspace admin to enable the feature.
- Databricks CLI authenticated. Run
databricks auth profilesand confirm at least one profile showsValid: YES. If none do, authenticate withdatabricks auth login --host <workspace-url> --profile <PROFILE>. - Lakebase Postgres available in the workspace. Run
databricks postgres list-projects --profile <PROFILE>and confirm the command succeeds (an empty list is fine — you are about to create the first project). Anot enabledor permission error means Lakebase is not available to this identity.
Verify these Databricks workspace features are enabled before starting. Lakehouse Sync is a Beta feature and has stricter workspace requirements than most other templates.
- Databricks CLI authenticated. Run
databricks auth profilesand confirm at least one profile showsValid: YES. If none do, authenticate withdatabricks auth login --host <workspace-url> --profile <PROFILE>. - AWS workspace in an Autoscaling region. Lakehouse Sync is currently available in Beta on AWS only. Azure is not yet supported. Confirm your workspace host is
*.cloud.databricks.com(AWS) rather than*.azuredatabricks.net(Azure). - A Lakebase Autoscaling project with tables. Run
databricks postgres list-projects --profile <PROFILE>and confirm your Autoscaling project appears. Anot enablederror means Lakebase is unavailable to this identity. This template does not cover project creation — see Create a Lakebase Instance if you need one. - Unity Catalog access. Run
databricks catalogs list --profile <PROFILE>and confirm the destination catalog and schema you want to replicate into are present. You will select them when enabling Lakehouse Sync in Step 3.
This template creates a synced table that mirrors a Unity Catalog table into Lakebase Postgres. Verify these Databricks workspace features are enabled before starting.
- Databricks CLI authenticated. Run
databricks auth profilesand confirm at least one profile showsValid: YES. If none do, authenticate withdatabricks auth login --host <workspace-url> --profile <PROFILE>. - Lakebase Autoscaling available. Run
databricks postgres list-projects --profile <PROFILE>and confirm your Autoscaling project is listed. Anot enablederror means Lakebase is not available to this identity. - Project created via the
/database/API (not the older/postgres/API). Programmatic synced-table creation viadatabricks database create-synced-database-tableonly works on projects created through the newer/database/API. If your Autoscaling project was created via the older/postgres/endpoint, the CLI path in Step 1 is not available yet and you must create synced tables through the Databricks UI (Catalog → source table → Create synced table). This gap is expected to close in a future release. - Unity Catalog source table with a primary key. Run
databricks tables get <CATALOG>.<SCHEMA>.<SOURCE_TABLE> --profile <PROFILE>and confirm at least one column is declared as the table's primary key. Synced tables reject sources without a PK. - External-storage catalog for the source (currently required for Sync Tables). Sync Tables today requires the source UC catalog to use external storage. If your source catalog uses the metastore's default managed storage, complete the Unity Catalog Setup template first and move the source table into an external-storage catalog.
- Change Data Feed enabled on the source table (for Triggered / Continuous mode only). Skip this check if you plan to use Snapshot mode. Otherwise run the
ALTER TABLE <catalog>.<schema>.<table> SET TBLPROPERTIES (delta.enableChangeDataFeed = true);statement from Step 1 against your SQL warehouse.
This template builds a Lakeflow Declarative Pipeline on top of existing Lakehouse Sync CDC history tables. Verify these Databricks workspace features are enabled before starting.
- Databricks CLI authenticated. Run
databricks auth profilesand confirm at least one profile showsValid: YES. If none do, authenticate withdatabricks auth login --host <workspace-url> --profile <PROFILE>. - Lakeflow Declarative Pipelines (serverless) available. Run
databricks pipelines list --profile <PROFILE>and confirm the command succeeds (an empty list is fine). A permission ornot enablederror means Lakeflow Pipelines is not available to this identity. - Unity Catalog access with a writable destination schema. Run
databricks catalogs list --profile <PROFILE>and confirm your destination catalog is listed. You will needUSE_CATALOGon the catalog andUSE_SCHEMA+CREATE_TABLEon the destination schema to publish silver and gold tables. - Bronze CDC history tables already in Unity Catalog. Complete the Lakebase Change Data Feed (Autoscaling, Lakehouse Sync) template first so
lb_<entity>_historytables exist in the bronze schema. This template reads from those tables; it does not create them.
Set Up Unity Catalog with External Storage
Create a Unity Catalog catalog backed by an external S3 bucket. For most use cases, the default metastore-managed storage works fine and requires no extra setup (just databricks catalogs create <CATALOG_NAME>). Use this template when you specifically need external storage.
Note: Sync Tables (syncing data from the lakehouse back to Lakebase) currently requires the source Unity Catalog catalog to use external storage. Default managed storage does not work with synced tables yet. If you plan to sync lakehouse data back to Lakebase, follow this template. This requirement is expected to be removed in a future release.
When to use this
- You plan to use Sync Tables to sync lakehouse data back to Lakebase (external storage is currently required for this)
- You want to control the S3 bucket location, encryption, and lifecycle policies
- You need cross-account or cross-workspace access to the underlying data
1. Create an IAM role for the storage credential
Create an IAM role in AWS that grants Databricks access to your S3 bucket. The trust policy must allow the Databricks account to assume the role.
Minimal IAM policy for the role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": ["arn:aws:s3:::<BUCKET_NAME>", "arn:aws:s3:::<BUCKET_NAME>/*"]
}
]
}
Note the IAM role ARN (e.g., arn:aws:iam::<ACCOUNT_ID>:role/<ROLE_NAME>).
2. Create a storage credential
Register the IAM role as a storage credential in Unity Catalog:
databricks storage-credentials create <CREDENTIAL_NAME> \
--json '{
"aws_iam_role": {
"role_arn": "arn:aws:iam::<ACCOUNT_ID>:role/<ROLE_NAME>"
}
}' --profile <PROFILE>
Verify:
databricks storage-credentials get <CREDENTIAL_NAME> --profile <PROFILE>
3. Create an external location
Map the S3 bucket path to the storage credential:
databricks external-locations create <LOCATION_NAME> \
s3://<BUCKET_NAME>/<PREFIX> \
<CREDENTIAL_NAME> \
--comment "External storage for analytics catalog" \
--profile <PROFILE>
Verify:
databricks external-locations get <LOCATION_NAME> --profile <PROFILE>
4. Create a catalog with external storage
Create the catalog and point its managed storage to the external location:
databricks catalogs create <CATALOG_NAME> \
--storage-root s3://<BUCKET_NAME>/<PREFIX> \
--comment "Catalog for operational and analytics data" \
--profile <PROFILE>
All managed tables created in this catalog store their data in the specified S3 path instead of the metastore default.
Verify:
databricks catalogs get <CATALOG_NAME> --profile <PROFILE>
5. Create a schema
Create a schema within the catalog for your tables:
databricks schemas create <SCHEMA_NAME> <CATALOG_NAME> \
--comment "Schema for lakehouse tables" \
--profile <PROFILE>
Verify:
databricks schemas list <CATALOG_NAME> --profile <PROFILE>
What you end up with
- A storage credential linked to your IAM role for S3 access
- An external location mapping an S3 path to the credential
- A Unity Catalog catalog storing managed table data in your S3 bucket
- A schema ready for tables from Lakebase Change Data Feed, Lakehouse Sync, or Lakeflow Pipelines
Troubleshooting
| Issue | Fix |
|---|---|
PERMISSION_DENIED on credential creation | Confirm you have CREATE STORAGE CREDENTIAL on the metastore |
ACCESS_DENIED on S3 during validation | Verify the IAM role trust policy allows Databricks to assume it |
| Bucket not found | Confirm the bucket exists in the same AWS region as the workspace |
| Catalog creation fails with storage root error | Verify the external location covers the specified S3 path |
References
- Create a storage credential and external location for S3
- Create catalogs
- External locations CLI
- Storage credentials CLI
Create a Lakebase Instance
Provision a managed Lakebase Postgres project on Databricks and collect the connection values needed by downstream templates.
1. Create a Lakebase project
Create a new Lakebase Postgres project. This provisions a managed Postgres cluster with a default branch and endpoint:
databricks postgres create-project <project-name> --profile <PROFILE>
2. Verify the project resources
Confirm the branch, endpoint, and database were created:
databricks postgres list-branches \
projects/<project-name> \
--profile <PROFILE> -o json
databricks postgres list-endpoints \
projects/<project-name>/branches/production \
--profile <PROFILE> -o json
databricks postgres list-databases \
projects/<project-name>/branches/production \
--profile <PROFILE> -o json
3. Note the connection values
Record these values from the command output above. They are required by the Lakebase Data Persistence template and other Lakebase-dependent templates:
| Value | JSON path | Used for |
|---|---|---|
| Endpoint host | ...status.hosts.host | PGHOST, lakebase.postgres.host |
| Endpoint resource path | ...name | LAKEBASE_ENDPOINT, lakebase.postgres.endpointPath |
| Database resource path | ...name | lakebase.postgres.database |
| PostgreSQL database name | ...status.postgres_database | PGDATABASE, lakebase.postgres.databaseName |
References
Replicate Lakebase Tables to Unity Catalog with Lakehouse Sync
Replicate your Lakebase Autoscaling Postgres tables into Unity Catalog as managed Delta tables using Lakehouse Sync. CDC captures every row-level change and writes them as SCD Type 2 history, giving you a full audit trail queryable from the lakehouse.
This template is for Lakebase Autoscaling (projects/branches/endpoints with scale-to-zero).
When to use this
- You want to analyze operational data (orders, user activity, support tickets) in the lakehouse
- You need a historical record of every insert, update, and delete from your Postgres tables
- You want to join operational data with analytics data in Spark, SQL, or BI tools
- You need to feed Lakebase data into downstream pipelines or ML models
How it works
Note: Lakehouse Sync is currently in Beta on AWS only (all Autoscaling regions). Azure support is not yet available. It is a native Lakebase feature with no external compute, pipelines, or jobs required, and there is no incremental charge for replication beyond the underlying Lakebase compute and storage costs.
Lakehouse Sync uses Change Data Capture (CDC) to stream changes from Lakebase Postgres into Unity Catalog. For each synced table, a Delta history table is created:
lb_<table_name>_history
Each row includes metadata columns:
_change_type:insert,update_preimage,update_postimage, ordelete_lsn: Log Sequence Number for ordering changes_commit_timestamp: When the change was captured
1. Verify table replica identity
Lakehouse Sync requires the right replica identity for capturing changes. Connect to your Lakebase database and check:
SELECT n.nspname AS table_schema,
c.relname AS table_name,
CASE c.relreplident
WHEN 'd' THEN 'default'
WHEN 'n' THEN 'nothing'
WHEN 'f' THEN 'full'
WHEN 'i' THEN 'index'
END AS replica_identity
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
AND n.nspname = 'public'
ORDER BY n.nspname, c.relname;
If a table shows default or nothing, set it to FULL:
ALTER TABLE <table_name> REPLICA IDENTITY FULL;
2. Check for unsupported data types
SELECT c.table_schema, c.table_name, c.column_name, c.udt_name AS data_type
FROM information_schema.columns c
JOIN pg_catalog.pg_type t ON t.typname = c.udt_name
WHERE c.table_schema = 'public'
AND c.table_name IN (
SELECT tablename FROM pg_tables WHERE schemaname = c.table_schema
)
AND NOT (
c.udt_name IN (
'bool', 'int2', 'int4', 'int8', 'text', 'varchar', 'bpchar',
'jsonb', 'numeric', 'date', 'timestamp', 'timestamptz',
'real', 'float4', 'float8'
)
OR t.typcategory = 'E'
)
ORDER BY c.table_schema, c.table_name, c.ordinal_position;
If unsupported types appear, restructure those columns before enabling sync.
3. Enable Lakehouse Sync
Note: This step is not yet available via CLI or REST API and must be completed through the Databricks UI:
In Catalog, open your Autoscaling project → branch → Lakehouse Sync → Start Sync, then select the source database/schema, destination catalog/schema, and tables.
4. Monitor sync status
Check active syncs from Postgres (the wal2delta schema only exists after Lakehouse Sync has been enabled in Step 3):
SELECT * FROM wal2delta.tables;
5. Query the history tables
Latest state of each row
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY _lsn DESC) AS rn
FROM <catalog>.<schema>.lb_<table_name>_history
WHERE _change_type IN ('insert', 'update_postimage', 'delete')
)
WHERE rn = 1
AND _change_type != 'delete';
Full change history for a record
SELECT *
FROM <catalog>.<schema>.lb_<table_name>_history
WHERE id = 12345
ORDER BY _lsn;
6. Handle schema changes
If you need to change a synced table's schema in Postgres, use the rename-and-swap pattern:
CREATE TABLE users_v2 (
id INT PRIMARY KEY,
name TEXT,
new_column TEXT
);
ALTER TABLE users_v2 REPLICA IDENTITY FULL;
INSERT INTO users_v2 SELECT *, NULL FROM users;
BEGIN;
ALTER TABLE users RENAME TO users_backup;
ALTER TABLE users_v2 RENAME TO users;
COMMIT;
What you end up with
- Delta history tables in Unity Catalog (
lb_<table_name>_history) with full SCD Type 2 change tracking - Continuous replication. Changes stream from Postgres to Delta automatically.
- No external compute. Lakehouse Sync is a native Lakebase feature.
- Operational data queryable in Spark SQL, notebooks, BI tools, and downstream pipelines
Troubleshooting
| Issue | Fix |
|---|---|
| Table not appearing in sync | Ensure it has a primary key or REPLICA IDENTITY FULL |
| Unsupported data type error | Check column types with the query in Step 2 |
| Sync lag increasing | Check Lakebase endpoint health and compute scaling |
| Missing changes on update/delete | Verify REPLICA IDENTITY FULL. default only captures PK columns |
Limitations
- AWS only. Lakehouse Sync Beta is available in all Autoscaling regions on AWS. Azure support is not yet available.
- No incremental charge. Replication cost is included in your Lakebase compute and storage.
- Works alongside synced tables. You can use Lakehouse Sync in a project/schema that also has synced tables.
References
Sync a Unity Catalog Table to Lakebase
Serve lakehouse data through Lakebase Autoscaling Postgres so your applications can query it with sub-10ms latency. This creates a synced table, a managed copy of your Unity Catalog table in Lakebase that stays up to date automatically.
This template is for Lakebase Autoscaling (projects/branches/endpoints with scale-to-zero). For Lakebase Provisioned (manually scaled instances), see the Provisioned Sync Tables template (coming soon).
When to use this
- Your app needs fast lookup-style queries against analytics data (user profiles, feature values, risk scores)
- You want to serve gold tables, ML outputs, or enriched records through a standard Postgres connection
- You need ACID transactions and sub-10ms reads alongside your operational state
Choose a sync mode
| Mode | Behavior | Best for |
|---|---|---|
| Snapshot | One-time full copy | Source changes >10% of rows per cycle, or source doesn't support CDF (views, Iceberg) |
| Triggered | Incremental updates on demand or on a schedule | Known cadence of changes, good cost/freshness balance |
| Continuous | Real-time streaming (seconds of latency) | Changes must appear in Lakebase near-instantly |
Triggered and Continuous modes require Change Data Feed (CDF) enabled on the source table. If it's not enabled, run:
sqlALTER TABLE <catalog>.<schema>.<table> SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
Sync throughput
Autoscaling CUs are physically 8x smaller than Provisioned CUs, so per-CU throughput differs:
| Mode | Rows/sec per CU |
|---|---|
| Snapshot (initial + full refresh) | ~2,000 |
| Triggered / Continuous (incremental) | ~150 |
A 10x speedup for large-table snapshot sync (writing Postgres pages directly, leveraging separation of storage and compute) is coming for Autoscaling only.
1. Create a synced table
databricks database create-synced-database-table \
--json '{
"name": "<CATALOG>.<SCHEMA>.<SYNCED_TABLE_NAME>",
"database_instance_name": "<INSTANCE_NAME>",
"logical_database_name": "<POSTGRES_DATABASE>",
"spec": {
"source_table_full_name": "<CATALOG>.<SCHEMA>.<SOURCE_TABLE>",
"primary_key_columns": ["<PRIMARY_KEY_COLUMN>"],
"scheduling_policy": "<SNAPSHOT|TRIGGERED|CONTINUOUS>",
"create_database_objects_if_missing": true
}
}' --profile <PROFILE>
If your Lakebase database is registered as a Unity Catalog catalog, you can omit
database_instance_nameandlogical_database_name.
Verify:
databricks database get-synced-database-table <CATALOG>.<SCHEMA>.<SYNCED_TABLE_NAME> --profile <PROFILE>
Important: If your Autoscaling project was created via the
/postgres/API (not/database/), programmatic synced table creation is not yet available via CLI. Use the Databricks UI as a fallback. In Catalog, select the source table → Create synced table, then choose your Lakebase project, branch, sync mode, and pipeline. This gap is expected to close soon.
2. Configure pipeline reuse
How you set up pipelines depends on your sync mode:
| Sync mode | Recommendation | Why |
|---|---|---|
| Continuous | Reuse a pipeline across ~10 tables | Cost-advantageous (e.g., 1 pipeline for 10 tables ≈ $204/table/month vs $2,044/table/month for individual pipelines) |
| Snapshot / Triggered | Separate pipelines per table | Allows re-snapshotting individual tables without impacting others |
3. Schedule ongoing syncs
The initial snapshot runs automatically on creation. For Snapshot and Triggered modes, subsequent syncs need to be triggered.
Note: Table-update triggers for sync pipelines are not yet available via CLI and must be configured through the Databricks UI: Workflows → create/open a job → add a Database Table Sync pipeline task → Schedules & Triggers → add a Table update trigger pointing to your source table.
Trigger a sync update programmatically via the Databricks SDK:
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
table = w.database.get_synced_database_table(
name="<CATALOG>.<SCHEMA>.<SYNCED_TABLE_NAME>"
)
pipeline_id = table.data_synchronization_status.pipeline_id
w.pipelines.start_update(pipeline_id=pipeline_id)
4. Query the synced data in Postgres
Once synced, the table is available in Lakebase Postgres. The Unity Catalog schema becomes the Postgres schema:
SELECT * FROM "<schema>"."<synced_table_name>" WHERE "user_id" = 12345;
Connect with any standard Postgres client (psql, DBeaver, your application's Postgres driver).
What you end up with
- A synced table in Unity Catalog that tracks the sync pipeline
- A read-only Postgres table in Lakebase that your apps can query with sub-10ms latency
- A managed Lakeflow pipeline that keeps the data in sync based on your chosen mode
- Up to 16 connections per sync to your Lakebase database
Important constraints
- Primary key is mandatory. Synced tables always require a primary key. It enables efficient point lookups and incremental updates. Rows with nulls in PK columns are excluded from the sync.
- Duplicate primary keys fail the sync unless you configure a
timeseries_keyfor deduplication (latest value wins per PK). Using a timeseries key has a performance penalty. - Schema changes: For Triggered/Continuous mode, only additive changes (e.g., adding a column) propagate. Dropping or renaming columns requires recreating the synced table.
- FGAC tables: Direct sync of Fine-Grained Access Control tables fails. Workaround: create a view (
SELECT * FROM table), then sync the view in Snapshot mode. Caveat: runs as the sync creator and only sees their visible rows. - Connection limits: Autoscaling supports up to 4,000 concurrent connections (varies by compute size). Each sync uses up to 16 connections.
- Read-only in Postgres: Synced tables should only be read from Postgres. Writing to them interferes with the sync pipeline.
Cost guidance
Cost formula: [Rows / (Speed × CUs × 3600)] × DLT Hourly Rate
Example costs (181M rows, 1 CU, $2.80/hr DLT rate):
| Mode | Monthly cost |
|---|---|
| Snapshot (daily) | ~$2,110 |
| Triggered (daily, 5% changes) | ~$1,407 |
| Continuous (10 tables, 1 pipeline) | ~$204/table |
| Continuous (1 table, 1 pipeline) | ~$2,044 |
Troubleshooting
| Issue | Fix |
|---|---|
| CDF not enabled warning | Run ALTER TABLE ... SET TBLPROPERTIES (delta.enableChangeDataFeed = true) on the source |
| Schema not visible in create dialog | Confirm you have USE_SCHEMA and CREATE_TABLE on the target schema |
| Null bytes in string columns | Clean source data: SELECT REPLACE(col, CAST(CHAR(0) AS STRING), '') AS col FROM table |
| Sync failing | Check the pipeline in the synced table's Overview tab for error details |
| FGAC table sync fails | Create a view over the table and sync the view in Snapshot mode |
| Duplicate primary key failure | Add a timeseries_key to deduplicate (latest wins) |
References
Medallion Architecture from CDC History Tables
Transform CDC history tables produced by Lakehouse Sync into a medallion architecture with bronze, silver, and gold layers using Lakeflow Declarative Pipelines. This turns raw change-data-capture records into clean, business-ready analytics tables in Unity Catalog.
When to use this
- You have Lakehouse Sync CDC history tables (
lb_<table>_history) in Unity Catalog from a Lakebase operational database - You want to build a layered data architecture (bronze → silver → gold) on top of operational data
- You need clean current-state views, deduplication, and business aggregations for BI, ML, or Genie analytics
- You want automated, incremental pipeline refreshes instead of manual SQL queries
How the layers map to CDC data
| Layer | Purpose | Source | Output |
|---|---|---|---|
| Bronze | Raw CDC records with full history | Lakehouse Sync lb_<table>_history tables | No transformation needed; these tables already exist |
| Silver | Current state of each record, deduplicated and cleaned | Bronze history tables | One streaming table per entity with latest state only |
| Gold | Business aggregations and domain-specific metrics | Silver tables | Materialized views with aggregations, joins, and KPIs |
1. Scaffold a pipeline project
Use the Databricks CLI to scaffold a Lakeflow Declarative Pipelines project:
databricks bundle init lakeflow-pipelines \
--config-file <(echo '{"project_name": "operational_analytics", "language": "sql", "serverless": "yes"}') \
--profile <PROFILE> < /dev/null
Enter the project directory:
cd operational_analytics
2. Configure the pipeline catalog and schema
Edit resources/operational_analytics.pipeline.yml to target your Unity Catalog schema:
resources:
pipelines:
operational_analytics:
name: operational_analytics
catalog: <CATALOG_NAME>
schema: <SCHEMA_NAME>
development: true
serverless: true
libraries:
- file:
path: src/
The pipeline publishes all datasets to <CATALOG_NAME>.<SCHEMA_NAME> by default.
3. Build the silver layer: current state from CDC
For each entity, create a SQL file in src/ that extracts the latest state from the bronze CDC history table. The silver layer deduplicates by primary key and excludes deleted records.
Create src/silver_<entity>.sql (e.g., src/silver_orders.sql):
CREATE OR REFRESH MATERIALIZED VIEW silver_<entity>
COMMENT "Current state of <entity> records, deduplicated from CDC history"
AS
SELECT * EXCEPT (rn, _change_type, _lsn, _commit_timestamp)
FROM (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY <primary_key>
ORDER BY _lsn DESC
) AS rn
FROM <CATALOG_NAME>.<BRONZE_SCHEMA>.lb_<entity>_history
WHERE _change_type IN ('insert', 'update_postimage', 'delete')
)
WHERE rn = 1
AND _change_type != 'delete'
Replace <primary_key> with the entity's primary key column(s), <CATALOG_NAME>.<BRONZE_SCHEMA> with the catalog and schema where Lakehouse Sync writes the history tables, and <entity> with the table name.
Repeat for each entity you want in the silver layer.
4. Build the gold layer: business aggregations
Gold layer tables are materialized views that aggregate, join, or reshape silver tables for specific analytics use cases.
Create src/gold_<metric>.sql (e.g., src/gold_daily_order_summary.sql):
CREATE OR REFRESH MATERIALIZED VIEW gold_daily_order_summary
COMMENT "Daily order counts and revenue by status"
AS
SELECT
DATE_TRUNC('day', created_at) AS order_date,
status,
COUNT(*) AS order_count,
SUM(total_amount) AS total_revenue
FROM silver_orders
GROUP BY DATE_TRUNC('day', created_at), status
Gold tables read from silver tables within the same pipeline. Use GROUP BY, JOIN, window functions, or any SQL to build the business view you need.
5. Add data quality expectations (optional)
Add expectations to silver or gold tables to enforce data quality constraints:
CREATE OR REFRESH MATERIALIZED VIEW silver_<entity> (
CONSTRAINT valid_primary_key EXPECT (<primary_key> IS NOT NULL) ON VIOLATION DROP ROW,
CONSTRAINT valid_timestamp EXPECT (created_at IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT "Current state of <entity> records with quality enforcement"
AS
SELECT ...
Expectations catch data issues early and can either warn, drop bad rows, or fail the pipeline update.
6. Deploy and run the pipeline
Validate, deploy, and run:
databricks bundle validate --profile <PROFILE>
databricks bundle deploy -t dev --profile <PROFILE>
databricks bundle run operational_analytics -t dev --profile <PROFILE>
Monitor the pipeline in the Databricks UI under Workflows → Pipelines.
7. Schedule ongoing refreshes
Add a job to refresh the pipeline on a schedule. Create resources/operational_analytics_job.job.yml:
resources:
jobs:
operational_analytics_job:
trigger:
periodic:
interval: 1
unit: HOURS
tasks:
- task_key: refresh_pipeline
pipeline_task:
pipeline_id: ${resources.pipelines.operational_analytics.id}
Deploy the schedule:
databricks bundle deploy -t dev --profile <PROFILE>
8. Query the results
Silver and gold tables are standard Unity Catalog tables. Query them from any connected tool:
-- Current state of an entity
SELECT * FROM <CATALOG_NAME>.<SCHEMA_NAME>.silver_orders WHERE customer_id = 12345;
-- Business aggregation
SELECT * FROM <CATALOG_NAME>.<SCHEMA_NAME>.gold_daily_order_summary ORDER BY order_date DESC;
Use these tables as sources for Genie spaces, dashboards, notebooks, or ML pipelines.
What you end up with
- Bronze layer. Lakehouse Sync CDC history tables (already exist, no pipeline needed).
- Silver layer. Deduplicated current-state materialized views per entity.
- Gold layer. Business aggregations and metrics as materialized views.
- Scheduled pipeline. Lakeflow Declarative Pipeline refreshing silver and gold layers incrementally.
- Unity Catalog tables. All layers queryable via SQL, Spark, BI tools, and Genie.
Agent skill recommendations
For implementing each layer, the following Databricks agent skills provide detailed guidance:
| Skill | Use for |
|---|---|
databricks-pipelines | Lakeflow Declarative Pipeline syntax, dataset types, deployment workflow |
databricks-core | CLI authentication, profile management, data exploration |
databricks-lakebase | Lakebase project and branch management, Postgres access |
Troubleshooting
| Issue | Fix |
|---|---|
| Silver table returns no rows | Verify the bronze history table has data: SELECT COUNT(*) FROM lb_<entity>_history |
TABLE_OR_VIEW_NOT_FOUND for bronze table | Use the fully-qualified name: <CATALOG>.<SCHEMA>.lb_<entity>_history |
| Gold aggregation includes deleted records | Confirm the silver layer filters _change_type != 'delete' |
| Pipeline fails on deploy | Run databricks bundle validate first to catch config errors |
| Incremental refresh not picking up changes | Verify Lakehouse Sync is active and the bronze table is updating |