OUR LATEST NEWS

Data Warehouse in BigQuery  —  Tracking Changes In Dimensions

hero

Peter Farkas | March 8, 2017

5 Minutes Read

Tracking Changes In Dimensions- Definitions

If you are already familiar with terms used in data warehousing, you can skip this part.

  • Dimension: “A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions.” (Wikipedia) Dimensions are data that you usually filter on and group by when you create your queries.
  • Slowly changing dimension (SCD): Some dimensions remain constant (like time, for example), while others change over time. The latter are called SCDs even though they could change daily. An example of a change: a salesperson is re-assigned to a different region. In that case, if you want a report on a region’s sales performance, you have to consider this person for part of the time, while excluding the data created when they were assigned to a different region.
  • Change management (CM): There are many ways you can represent a change in a dimension: new row, a new column, overwrite, etc.
  • Type 2 CM: This type of CM creates a record for every version of the dimension, identified either by a version column or by start and end-date columns.
  • Type 4 CM: This type of CM is also called a “history table CM.” There are two tables in this scenario: a “current” table, which contains the latest data on each dimension, and a history table, where all versions reside. This history table is like the Type 2 table, except there is only one date column (start date). If your dimension values can be deleted, add a deleted column as well.

Tracking Changes In Dimensions- Type 2 change management in a typical SQL environment

In a typical SQL scenario, changing dimension data is pretty straightforward. When a dimension value changes, start a transaction, update the end_date column of the currently effective record, store the new record with the same start_date, and commit the transaction. If the dimension is deleted, simply update the end date of the current record, signifying that there is no dimension after that.

To illustrate this in an example, let’s imagine a dimension of salespeople. Each row represents a person: it has a synthetic key (primary key), a unique code (natural key), a name, and a region.

 Tracking Changes In Dimensions

 Tracking Changes In Dimensions

Data Warehouse in BigQuery- Problems with BigQuery

While the version column can be implemented just fine, the effective date columns are problematic:

  • There are no transactions in BigQuery. We can’t change a record and insert a new one automatically — if the second operation fails, there is no automatic rollback functionality.
  • Data Manipulation is not supported. DML was added to the BigQuery feature list in August 2016.

Moreover, there are special requirements for it to work.

  • It only works with Standard Queries.
  • The table cannot have a REQUIRED field.
  • The table cannot have a streaming buffer.
  • Only one DML job can run at a time.

As you can see, it is quite restrictive, not to mention the fact that it is still in Beta, which means no SLA, and no guarantee that it won’t be changed or scrapped altogether.

Data Warehouse in BigQuer- Workaround

Despite its lack of a DML feature, BigQuery is a really great tool. So instead of moving back to a traditional SQL system, I decided to implement versioning relying on BigQuery’s strengths.

Instead of a Type 2 table, this solution is based on the Type 4 history table (with a deleted column). This is the only table we need; there is no current table in this scenario. Rules for filling the table are:

  • New data version: Insert the new data into the table with the correct start_date and deleted = false.
  • Deleted data: Insert a row with the start_date as the delete date and deleted = true. Any required field can be copied from the previous state.

 Tracking Changes In Dimensions

In this table, a query to get salespeople data at 2010–01–01 would look like this:

 

 

First, we search for the records valid at the time (start_date is the latest value before the specified time, sub-query A) which we then join to all records in the table that do not signify a deleted record to get all non-deleted records valid at the time provided.

Tracking Changes In Dimensions- Type 2 effective date view

We can transform our history table into a proper, effective date table using a BigQuery view:

 

 

This query left joins the table with a query of itself that renames the start_date to end_date. The join only allows rows where the start_date is before the end_date. Of all these rows, we select the one with the earliest end_date. This ensures there is no overlap. Next, we remove the rows that are marked as deleted. The resulting view is a proper Type 2 table with effective dates.

One problem with this view is that the currently valid row has null as an end date, and the BigQuery BETWEEN operator doesn’t like null values. To solve this problem, we wrap the end_date column with an IFNULL(expr, null_result) function and a date in the far future:

 

 

Summary

I’ve shown you how to implement Type 2 effective date versioning for Dimensions using self-joins in BigQuery. In the next post of this series, I’ll show you another way to do it, and then we’ll go over the performance of these solutions.

How to Power Banking Services with Google Cloud

clock

2022-06-23

4 Minutes Read

Here’s why your company should be using the Google Cloud features to power banking services and how it makes things easy for financial service organizations.

Google Cloud Infrastructure Modernization - Stay Agile With An Open Architecture

clock

2022-02-17

3 Minutes Read

Learn more about how Google Cloud infrastructure modernization solutions can help your business to become more competitive.

Google BigQuery materialized view test drive

clock

2022-04-07

3 Minutes Read

I have tested the BigQuery materialized views against the documentation. While most of the functionality and limitations are accurate, there are a few gotchas you need to be aware of.

hero
logo

Ready for the future? Let’s talk!

Reach out, and let’s take your business to the next level.

image

By clicking submit below, you consent to allow Aliz.ai to store and process the personal information submitted above and share information about our products and services, as well as other content that may be of interest to you. For more information, please review our Privacy Policy. You may unsubscribe at any time. Your data will not be passed on to third parties.

I agree to receive other communications from Aliz.ai.

badge

New opportunities with cloud solutions!

Aliz is a proud Google Cloud Partner with specializations in Infrastructure, Data Analytics, Cloud Migration and Machine Learning. We deliver data analytics, machine learning, and infrastructure solutions, off the shelf, or custom-built on GCP using an agile, holistic approach.

logo

© Copyright 2022 Aliz Tech Kft.

logologologologo