If you are already familiar with terms used in data warehousing, you can skip this part.
Tracking Changes In Dimensions- Type 2 change management in a typical SQL environment
In a typical SQL scenario, changing dimension data is pretty straightforward. When a dimension value changes, start a transaction, update the end_date column of the currently effective record, store the new record with the same start_date, and commit the transaction. If the dimension is deleted, simply update the end date of the current record, signifying that there is no dimension after that.
To illustrate this in an example, let’s imagine a dimension of salespeople. Each row represents a person: it has a synthetic key (primary key), a unique code (natural key), a name, and a region.
While the version column can be implemented just fine, the effective date columns are problematic:
Moreover, there are special requirements for it to work.
As you can see, it is quite restrictive, not to mention the fact that it is still in Beta, which means no SLA, and no guarantee that it won’t be changed or scrapped altogether.
Despite its lack of a DML feature, BigQuery is a really great tool. So instead of moving back to a traditional SQL system, I decided to implement versioning relying on BigQuery’s strengths.
Instead of a Type 2 table, this solution is based on the Type 4 history table (with a deleted column). This is the only table we need; there is no current table in this scenario. Rules for filling the table are:
In this table, a query to get salespeople data at 2010–01–01 would look like this:
First, we search for the records valid at the time (start_date is the latest value before the specified time, sub-query A) which we then join to all records in the table that do not signify a deleted record to get all non-deleted records valid at the time provided.
We can transform our history table into a proper, effective date table using a BigQuery view:
This query left joins the table with a query of itself that renames the start_date to end_date. The join only allows rows where the start_date is before the end_date. Of all these rows, we select the one with the earliest end_date. This ensures there is no overlap. Next, we remove the rows that are marked as deleted. The resulting view is a proper Type 2 table with effective dates.
One problem with this view is that the currently valid row has null as an end date, and the BigQuery BETWEEN operator doesn’t like null values. To solve this problem, we wrap the end_date column with an IFNULL(expr, null_result) function and a date in the far future:
I’ve shown you how to implement Type 2 effective date versioning for Dimensions using self-joins in BigQuery. In the next post of this series, I’ll show you another way to do it, and then we’ll go over the performance of these solutions.