Dimensional modeling is a time-tested approach to building analytics-ready data warehouses. While many organizations are shifting to modern platforms like Databricks, these foundational techniques still apply.
In Part 1, we designed our dimensional schema. In Part 2, we built ETL pipelines for dimension tables. Now in Part 3, we implement the ETL logic for fact tables, emphasizing efficiency and integrity.
In the first blog, we defined the fact table, FactInternetSales, as shown below. Compared to our dimension tables, the fact table is relatively narrow in terms of record length, with only foreign key references to our dimension tables, our fact measures, our degenerate dimension fields and a single metadata field present:
NOTE: In the example below, we’ve altered the CREATE TABLE statement from our first post to include the foreign key definitions instead of defining these in separate ALTER TABLE statements. We’ve also included a primary key constraint on the degenerate dimension fields to be more explicit about their role more explicit in this fact table.
The table definition is fairly straightforward, but it’s worth taking a moment to discuss the LastModifiedDateTime metadata field. While fact tables are relatively narrow in terms of field count, they tend to be very deep in terms of row count. Fact tables often house millions, if not billions, of records, often derived from high-volume operational activities. Instead of attempting to reload the table with a full extract on each ETL cycle, we will typically limit our efforts to new records and those that have been changed.
Depending on the source system and its underlying infrastructure, there are many ways to identify which operational records need to be extracted with a given ETL cycle. Change data capture (CDC) capabilities implemented on the operational side are the most reliable mechanisms. But when these are unavailable, we often fall back to timestamps recorded with each transaction record as it is created and modified. The approach is not bulletproof for change detection, but as any experienced ETL developer will attest, it’s often the best we’ve got.
NOTE: The introduction of Lakeflow Connect provides an interesting option for performing change data capture on relational databases. This capability is in preview at the time of the writing of this blog. Still, as the capability matures to expand more and more RDBMSs, we expect this to provide an effective and efficient mechanism for incremental extracts.
In our fact table, the LastModifiedDateTime field captures such a timestamp value recorded in the operational system. Before extracting data from our operational system, we will review the fact table to identify the latest value for this field we’ve recorded. That value will be the starting point for our incremental (aka delta) extract.
The high-level workflow for our fact ETL will proceed as follows:
To make this workflow easier to digest, we’ll describe its key phases in the following sections. Unlike the post on dimension ETL, we will implement our logic for this workflow using a combination of SQL and Python based on which language makes each step most straightforward to implement. Again, one of the strengths of the Databricks Platform is its support for multiple languages. Instead of presenting it as an all-or-nothing choice made at the top of an implementation, we will show how data engineers can quickly pivot between the two within a single implementation.
Our workflow's first two steps focus on extracting new and newly updated information from our operational system. In the first step, we do a simple lookup of the latest recorded value for LastModifiedDateTime. If the fact table is empty, as it should be upon initialization, we define a default value that’s far enough back in time that we believe it will capture all the relevant data in the source system:
We can now extract the required data from our operational system using that value. While this query includes quite a bit of detail, focus your attention on the WHERE clause, where we employ the last observed timestamp value from the previous step to retrieve the individual line items that are new or modified (or associated with sales orders that are new or modified):
As before, the extracted data is persisted to a table in our staging schema, only accessible to our data engineers, before proceeding to subsequent steps in the workflow. If we have any additional data cleansing to perform, we should do so now.
The typical sequence in a data warehouse ETL cycle is running our dimension ETL workflows and then our fact workflows shortly after. By organizing our processes this way, we can better ensure all the information required to connect our fact records to dimension data will be in place. However, there is a narrow window within which new, dimension-oriented data arrives and is picked up by a fact-relevant transactional record. That window increases should we have a failure in the overall ETL cycle that delays fact data extraction. And, of course, there can always be referential failures in source systems that allow questionable data to appear in a transactional record.
To insulate ourselves from this problem, we will insert into a given dimension table any business key values found in our staged fact data but not in the set of current (unexpired) records for that dimension. This approach will create a record with a business (natural) key and a surrogate key that our fact table can reference. These records will be flagged as late arriving if the targeted dimension is a Type-2 SCD so that we can update appropriately on the next ETL cycle.
To get us started, we will compile a list of key business fields in our staging data. Here, we are exploiting strict naming conventions that allow us to identify these fields dynamically:
NOTE: We are switching to Python for the following code examples. Databricks supports the use of multiple languages, even within the same workflow. In this example, Python gives us a bit more flexibility while still aligning with SQL concepts, making this approach accessible to more traditional SQL developers.
Notice that we have separated our date keys from the other business keys. We’ll return to those in a bit, but for now, let’s focus on the non-date (other) keys in this table.
For each non-date business key, we can use our field and table naming conventions to identify the dimension table that should hold that key and then perform a left-semi join (similar to a NOT IN() comparison but supporting multi-column matching if needed) to identify any values for that column in the staging table but not in the dimension table. When we find an unmatched value, we simply insert it into the dimension table with the appropriate setting for the IsLateArriving field:
This logic would work fine for our date dimension references if we wanted to ensure our fact records linked to valid entries. However, many downstream BI systems implement logic that requires the date dimension to house a continuous, uninterrupted series of dates between the earliest and latest values recorded. Should we encounter a date before or after the range of values in the table, we need not just to enter the missing member but create the additional values required to preserve an unbroken range. For that reason, we need slightly different logic for any late arrival dates:
If you have not worked much with Databricks or Spark SQL, the query at the heart of this last step is likely foreign. The sequence() function builds a sequence of values based on a specified start and stop. The result is an array that we can then explode (using the explode() function) so that each element in the array forms a row in a result set. From there, we simply compare the required range to what’s in the dimension table to identify which elements need to be inserted. With that insertion, we ensure we have a surrogate key value implemented in this dimension as a smart key so that our fact records will have something to reference.
Now that we can be confident that all business keys in our staging table can be matched to records in their corresponding dimensions, we can proceed with the publication to the fact table.
The first step in this process is to look up the foreign key values for these business keys. This can be done as part of a single publication step, but the large number of joins in the query often makes this approach challenging to maintain. For this reason, we might take the less efficient but easier-to-comprehend and modify the approach of looking up foreign key values one business key at a time and appending those values to our staging table:
Again, we are exploiting naming conventions to make this logic more straightforward to implement. Because our date dimension is a role-playing dimension and therefore follows a more variable naming convention, we implement slightly different logic for those business keys.
At this point, our staging table houses business keys and surrogate key values along with our measures, degenerate dimension fields, and the LastModifiedDate value extracted from our source system. To make publication more manageable, we should align the available fields with those supported by the fact table. To do that, we need to drop the business keys:
NOTE: The source dataframe is defined in the previous code block.
With the fields aligned, the publication step is straightforward. We match our incoming records to those in the fact table based on the degenerate dimension fields, which serve as a unique identifier for our fact records, and then update or insert values as needed:
We hope this blog series has been informative to those seeking to build dimensional models on the Databricks Platform. We expect that many experienced with this data modeling approach and the ETL workflows associated with it will find Databricks familiar, accessible and capable of supporting long-established patterns with minimal changes compared to what may have been implemented on RDBMS platforms. Where changes emerge, such as the ability to implement workflow logic using a combination of Python and SQL, we hope that data engineers will find this makes their work more straightforward to implement and support over time.
To learn more about Databricks SQL, visit our website or read the documentation. You can also check out the product tour for Databricks SQL. Suppose you want to migrate your existing warehouse to a high-performance, serverless data warehouse with a great user experience and lower total cost. In that case, Databricks SQL is the solution — try it for free.