Data Warehouse in BigQuery: Dimensions – Part 1

0
15

Data Warehouse in BigQuery — Dimensions — Part 1

Péter Farkas, Software Developer @Aliz Technologies

In this series of blogposts, I’ll describe the difficulties we encountered when we created a data warehouse on BigQuery. This first post is about tracking changes in dimensions.

Definitions

If you are already familiar with terms used in data warehousing, you can skip this part.

  • Dimension: “A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions.” (Wikipedia) Dimensions are data that you usually filter on and group by when you create your queries.
  • Slowly changing dimension (SCD): Some dimensions remain constant (like time, for example), while others change over time. The latter are called SCDs even though they could change daily. An example of a change: a sales person is re-assigned to a different region. In that case, if you want a report on a region’s sales performance, you have to consider this person for part of the time, while excluding the data created when they were assigned to a different region.
  • Change management (CM): There are many ways you can represent change in a dimension: new row, new column, overwrite, etc.
  • Type 2 CM: This type of CM creates a record for every version of the dimension, identified either by a version column or by start and end-date columns.
  • Type 4 CM: This type of CM is also called a “history table CM.” There are two tables in this scenario: a “current” table, which contains the latest data of each dimension, and a history table, where all versions reside. This history table is like the Type 2 table, except there is only one date column (start date). If your dimension values can be deleted, add a deleted column as well.

Type 2 change management in a typical SQL environment

In a typical SQL scenario, changing dimension data is pretty straightforward. When a dimension value changes, start a transaction, update the end_date column of the currently effective record, store the new record with the same start_date, and commit the transaction. If the dimension is deleted, simply update the end date of the current record, signifying that there is no dimension after that.

To illustrate this in an example, let’s imagine a dimension of sales people. Each row represents a person: it has a synthetic key (primary key), a unique code (natural key), a name, and a region.

Problems with BigQuery

While the version column can be implemented just fine, the effective date columns are problematic:

  • There are no transactions in BigQuery. We can’t change a record and insert a new one atomatically — if the second operation fails, there is no automatic rollback functionality.
  • Data Manipulation is not supported. DML was added to the BigQuery feature list in August 2016. Moreover, there are special requirements for it to work.

> It only works with Standard Queries.

> The table cannot have a REQUIRED field.

> The table cannot have a streaming buffer.

> Only one DML job can run at a time.

As you can see, it is quite restrictive, not to mention the fact that it is still in Beta, which means no SLA, and no guarantee that it won’t be changed or scrapped altogether.

Workaround

Despite its lack of a DML feature, BigQuery is a really great tool. So instead of moving back to a traditional SQL system, I decided to implement versioning relying on BigQuery’s strengths.

Instead of a Type 2 table, this solution is based on the Type 4 history table (with a deleted column). This is the only table we need; there is no current table in this scenario. Rules for filling the table are:

  • New data version: Insert the new data into the table with the correct start_date and deleted = false.
  • Deleted data: Insert a row with the start_date as the delete date and deleted = true. Any required field can be copied from the previous state.

In this table, a query to get sales people data at 2010–01–01 would look like this:

SELECT B.* FROM (
SELECT code, MAX(start_date) AS start_date
FROM `sales_people` WHERE start_date <= TIMESTAMP(‘2010–01–01’)
GROUP BY code
) AS A
JOIN ( SELECT * FROM `sales_people` ) AS B
ON A.code = B.code AND A.start_date = B.start_date
WHERE B.deleted = false

First, we search for the records valid at the time (start_date is the latest value before the specified time, sub-query A) which we then join to all records in the table that do not signify a deleted record to get all non-deleted records valid at the time provided.

Type 2 effective date view

We can transform our history table into a proper, effective date table using a BigQuery view:

SELECT A.id, A.code, A.name, A.region, A.start_date, MIN(B.end_date) AS end_date
FROM `sales_people` AS A
LEFT JOIN ( SELECT code, start_date AS end_date FROM `sales_people` ) AS B
ON A.code = B.code AND A.start_date < B.end_date
WHERE A.deleted = FALSE
GROUP BY A.id, A.start_date, A.code, A.name, A.region
ORDER BY A.id, A.start_date

This query left joins the table with a query of itself that renames the start_date to end_date. The join only allows rows where the start_date is before the end_date. Of all these rows, we select the one with the earliest end_date. This ensures there is no overlap. Next, we remove the rows that are marked as deleted. The resulting view is a proper Type 2 table with effective dates.

One problem with this view is that the currently valid row has null as an end date, and the BigQuery BETWEEN operator doesn’t like null values. To solve this problem, we wrap the end_date column with an IFNULL(expr, null_result) function and a date in the far future:

SELECT A.id, A.code, A.name, A.region, A.start_date, IFNULL(MIN(B.end_date), TIMESTAMP(‘9999–01–01’)) AS end_date
FROM `sales_people` AS A
LEFT JOIN ( SELECT code, start_date AS end_date FROM `sales_people` ) AS B
ON A.code = B.code AND A.start_date < B.end_date
WHERE A.deleted = FALSE
GROUP BY A.id, A.start_date, A.code, A.name, A.region
ORDER BY A.id, A.start_date

Summary

I’ve shown you how to implement Type 2 effective date versioning for Dimensions using self-joins in BigQuery. In the next post of this series, I’ll show you another way to do it, and then we’ll go over the performance of these solutions.

Leave a Reply