In the previous article Prescriptive Guidance for Implementing a Data Vault Model on the Databricks Lakehouse Platform, we explained core concepts of data vault and provided guidance of using it on Databricks. We have many customers in the field looking for examples and easy implementation of data vault on Lakehouse.
In this article, we aim to dive deeper on how to implement a Data Vault on Databricks' Lakehouse Platform and provide a live example to load an EDW Data Vault model in real-time using Delta Live Tables.
Here are the high-level topics we will cover in this blog:
- Why Data Vault
- Data Vault in Lakehouse
- Implementing a Data Vault Model in Databricks Lakehouse
- Conclusion
1. Why Data Vault
The main goal for Data Vault is to build a scalable modern data warehouse in today's world. At its core, it uses hubs, satellites and links to model the business world, which enables a stable (Hubs) yet flexible (Satellites) data model and architecture that are resilient to environmental changes. Hubs contain business keys that are unlikely to change unless core business changes and associations between the hubs make the skeleton of the Data Vault Model, while satellites contain contextual attributes of a hub that could be created and extended very easily.
Please refer to below for a high-level design of the Data Vault Model, with 3 key benefits by design:
- Enables efficient parallel loading for the enterprise data warehouse as there is less dependency between the tables of the model, as we could see below, hubs or satellites for the customer, product, order could all be loaded in parallel.
- Preserve a single version of the facts in the raw vault as it recommends insert only and keep the source metadata in the table.
- New Hubs or Satellites could be easily added to the model incrementally, enabling fast to market for data asset delivery.
2. Data Vault in Lakehouse
The Databricks Lakehouse Platform supports Data Vault Model very well, please refer to below for high level architecture of Data Vault Model on Lakehouse. The robust and scalable Delta Lake storage format enables customers to build a raw vault where unmodified data is stored, and a business vault where business rules and transformation are applied if required. Both will align to the design earlier hence get the benefits of a Data Vault Model.
3. Implementing a Data Vault Model in Databricks Lakehouse
Based on the design in the previous section, loading the hubs, satellites and links tables are straightforward. All ETL loads could happen in parallel as they don't depend on each other, for example, customer and product hub tables could be loaded as they both have their only business keys. And customer_product_link table, customer satellite and product satellite could be loaded in parallel as well since they have all the required attributes from the source.
Overall Data Flow
Please refer to the high level data flow demonstrated in Delta Live Table pipeline below. For our example we use the TPCH data that are commonly used for decision support benchmarks. The data are loaded into the bronze layer first and stored in Delta format, then they are used to populate the Raw Vault for each object (e.g. hub or satellites of customer and orders, etc.). Business Vault are built on the objects from Raw Vault, and Data mart objects (e.g. dim_customer, dim_orders, fact_customer_order ) for reporting and analytics consumptions.
Raw Vault
Raw Vault is where we store Hubs, Satellites and Links tables which contain the raw data and maintain a single version of truth. As we could see from below, we create a view raw_customer_vw
based on raw_customer
and use hash function sha1(UPPER(TRIM(c_custkey)))
to create hash columns for checking existence or comparison if required.
-- create raw customer view and add hash columns for checking existence or comparison
CREATE STREAMING LIVE VIEW raw_customer_vw
COMMENT "RAW Customer Data View"
AS SELECT
sha1(UPPER(TRIM(c_custkey))) as sha1_hub_custkey,
sha1(concat(UPPER(TRIM(c_name)),UPPER(TRIM(c_address)),UPPER(TRIM(c_phone)),UPPER(TRIM(c_mktsegment)))) as hash_diff,
current_timestamp as load_ts,
"Customer Source" as source,
c_custkey,
c_name,
c_address,
c_nationkey,
c_phone,
c_acctbal,
c_mktsegment,
c_comment
FROM STREAM(LIVE.raw_customer)
Once the raw customer view is created, we use it to create hub customer and satellite customers respectively with the code example below. In Delta Live Table, you could also easily set up data quality expectation (e.g. CONSTRAINT valid_sha1_hub_custkey EXPECT (sha1_hub_custkey IS NOT NULL) ON VIOLATION DROP ROW
) and use that define how the pipeline will handle data quality issues defined by the expectation. Here we drop all the rows if it does not have a valid business key.
-- create hub customer table from the raw customer view
CREATE OR REFRESH STREAMING LIVE TABLE hub_customer(
sha1_hub_custkey STRING NOT NULL,
c_custkey BIGINT NOT NULL,
load_ts TIMESTAMP,
source STRING
CONSTRAINT valid_sha1_hub_custkey EXPECT (sha1_hub_custkey IS NOT NULL) ON VIOLATION DROP ROW,
CONSTRAINT valid_custkey EXPECT (c_custkey IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT " HUb CUSTOMER TABLE"
AS SELECT
sha1_hub_custkey,
c_custkey,
load_ts,
source
FROM
STREAM(live.raw_customer_vw)
-- create satellite customer table from raw customer view
CREATE OR REFRESH STREAMING LIVE TABLE sat_customer(
sha1_hub_custkey STRING NOT NULL,
c_name STRING,
c_address STRING,
c_nationkey BIGINT,
c_phone STRING,
c_acctbal DECIMAL(18,2),
c_mktsegment STRING,
hash_diff STRING NOT NULL,
load_ts TIMESTAMP,
source STRING NOT NULL
CONSTRAINT valid_sha1_hub_custkey EXPECT (sha1_hub_custkey IS NOT NULL) ON VIOLATION DROP ROW
)
COMMENT " SAT CUSTOMER TABLE"
AS SELECT
sha1_hub_custkey,
c_name,
c_address,
c_nationkey,
c_phone,
c_acctbal,
c_mktsegment,
hash_diff,
load_ts,
source
FROM
STREAM(live.raw_customer_vw)
Hubs and Satellites of other objects are loaded in the similar way. For Link tables, here is an example to populate lnk_customer_orders
based on the raw_orders_vw
.
-- create customer orders table from the raw orders view
CREATE OR REFRESH STREAMING LIVE TABLE lnk_customer_orders
(
sha1_lnk_customer_order_key STRING NOT NULL ,
sha1_hub_orderkey STRING ,
sha1_hub_custkey STRING ,
load_ts TIMESTAMP NOT NULL,
source STRING NOT NULL
)
COMMENT " LNK CUSTOMER ORDERS TABLE "
AS SELECT
sha1_lnk_customer_order_key,
sha1_hub_orderkey,
sha1_hub_custkey,
load_ts,
source
FROM
STREAM(live.raw_orders_vw)
Business Vault
Once the hubs, satellites and links are populated in the Raw Vault, Business Vault objects could be built based on them. This is to apply additional business rules or transformation rules on the data objects and prepare for easier consumption at a later stage. Here is an example of building sat_orders_bv,
with which order_priority_tier
is added as enrichment information of the orders object in the Business Vault.
-- create satellite order table in business vault from the satellite orders table in raw vault
CREATE OR REFRESH LIVE TABLE sat_orders_bv
(
sha1_hub_orderkey STRING NOT NULL ,
o_orderstatus STRING ,
o_totalprice decimal(18,2) ,
o_orderdate DATE,
o_orderpriority STRING,
o_clerk STRING,
o_shippriority INT,
order_priority_tier STRING,
source STRING NOT NULL
)
COMMENT " SAT Order Business Vault TABLE "
AS SELECT
sha1_hub_orderkey AS sha1_hub_orderkey,
o_orderstatus AS o_orderstatus,
o_totalprice AS o_totalprice,
o_orderdate AS o_orderdate,
o_orderpriority AS o_orderpriority,
o_clerk AS o_clerk,
o_shippriority AS o_shippriority,
CASE WHEN o_orderpriority IN ('2-HIGH', '1-URGENT') AND o_totalprice >= 225000 THEN 'Tier-1'
WHEN o_orderpriority IN ('3-MEDIUM', '2-HIGH', '1-URGENT') AND o_totalprice BETWEEN 120000 AND 225000 THEN 'Tier-2'
ELSE 'Tier-3'
END order_priority_tier,
source
FROM
live.sat_orders
Data Mart
Finally, we see customers loading Data Vault Point-in-Time Views and Data marts for easy consumption in the last layer. Here the main focus is ease of use and good performance on read. For most simple tables, it will suffice with creating views on top of the Hubs or Satellites or you can even load a proper star-schema like Dimensional Model in the final layer. Here is an example that creates a customer dimension as a view dim_customer
, and the view could be used by others to simplify their queries.
-- create customer dimension as view in data mart from the hub and satellite customer table, ref nation and ref region table
CREATE LIVE VIEW dim_customer
AS
SELECT
sat.sha1_hub_custkey AS dim_customer_key,
sat.source AS source,
sat.c_name AS c_name ,
sat.c_address AS c_address ,
sat.c_phone AS c_phone ,
sat.c_acctbal AS c_acctbal,
sat.c_mktsegment AS c_mktsegment,
sat.c_nationkey AS c_nationkey,
sat.load_ts AS c_effective_ts,
-- derived
nation.n_name AS nation_name,
region.r_name AS region_name
FROM LIVE.hub_customer hub
INNER JOIN LIVE.sat_customer sat
ON hub.sha1_hub_custkey = sat.sha1_hub_custkey
LEFT OUTER JOIN LIVE.ref_nation nation
ON (sat.c_nationkey = nation.n_nationkey)
LEFT OUTER JOIN LIVE.ref_region region
ON (nation.n_regionkey = region.r_regionkey)
One of the common issues with data vault is that sometimes it ends up with too many joins especially when you have a complex query or fact that requires attributes from many tables. The recommendation from Databricks is to pre-join the tables and stored calculated metrics if required so they don't have to be rebuilt many times on the fly. Here is an example of creating a fact table fact_customer_order
based on multiple joins and storing it as a table for repeatable queries from the business users.
-- create fact customer order table in data mart from the lnk_customer_orders, dim_order, dim_customer, ref_nation and ref_region
CREATE OR REFRESH LIVE TABLE fact_customer_order
AS
SELECT
dim_customer.dim_customer_key,
dim_orders.dim_order_key,
nation.n_nationkey AS dim_nation_key,
region.r_regionkey AS dim_region_key,
dim_orders.o_totalprice AS total_price,
dim_orders.o_orderdate AS order_date
FROM LIVE.lnk_customer_orders lnk
INNER JOIN LIVE.dim_orders dim_orders
ON lnk.sha1_hub_orderkey = dim_orders.dim_order_key
INNER JOIN LIVE.dim_customer dim_customer
ON lnk.sha1_hub_custkey = dim_customer.dim_customer_key
LEFT OUTER JOIN LIVE.ref_nation nation
ON dim_customer.c_nationkey = nation.n_nationkey
LEFT OUTER JOIN LIVE.ref_region region
ON nation.n_regionkey = region.r_regionkey
Delta Live Table Pipeline Setup
All the code of above could be found here. Customers could easily orchestrate the whole data flow based on the Delta Live Table pipeline setup, the configuration below is how I set up the pipeline in my environment, click DLT Configuration for more details on how to set up a Delta Live Table Pipeline your workflow if required.
{
"id": "6835c6ad-42a2-498d-9037-25c9d990b380",
"clusters": [
{
"label": "default",
"autoscale": {
"min_workers": 1,
"max_workers": 5,
"mode": "ENHANCED"
}
}
],
"development": true,
"continuous": false,
"channel": "CURRENT",
"edition": "ADVANCED",
"photon": false,
"libraries": [
{
"notebook": {
"path": "/Repos/prod/databricks-lakehouse/lakehouse-buildout/data-vault/TPC-DLT-Data-Vault-2.0"
}
}
],
"name": "DLT Data Vault",
"storage": "dbfs:/pipelines/6835c6ad-42a2-498d-9037-25c9d990b380",
"configuration": {
"pipelines.enzyme.mode": "advanced",
"pipelines.enzyme.enabled": "true"
},
"target": "leo_lakehouse"
}
4. Conclusion
In this blog, we learned about core Data Vault modeling concepts, and how to implement them using Delta Live Tables. The Databricks Lakehouse Platform supports various modeling methods in a reliable, efficient and scalable way, while Databricks SQL - our serverless data warehouse - allows you to run all your BI and SQL applications on the Lakehouse. To see all of the above examples in a complete workflow, please look at this example.
Please also check out our related blogs:
- Five Simple Steps for Implementing a Star Schema in Databricks With Delta Lake
- Data Modeling Best Practices & Implementation on a Modern Lakehouse
- What's a Dimensional Model and How to Implement It on the Databricks Lakehouse Platform