Google BigQuery materialized view test drive

Preamble

I have tested the Google BigQuery materialized views against the documentation. While most of the functionality and limitations are accurate, there are a few gotchas you need to be aware of.

Setup

I’ll use these tables and views for the rest of the post:

Example BQ schema:

Example view:

Data generation

In addition to static data, I also tested the view with constant streaming data into the base table. This is super simple using a Google Dataflow template. This template allows you to generate large streaming buffers quickly (10k rows/sec if needed). I recommend it, if you need to test something with a large sample of streaming data. (The target can be Google BigQuery, Pub/Sub, or GCS.)

The schema used for the Dataflow pipeline:

Findings

Materialized views cannot be created from the UI/API, only through the DDL
When querying materialized views, Google BigQuery also runs the query on the “partition” of the base table that has not yet been cached, so you will always get an accurate result.

For this reason only those aggregate functions can be used where you can compute “partial” results (e.g. max, sum, avg, but not rank)
For the same reason, analytic functions are out

Query rewrite works some of the time. This feature means that if you query the base table, and the result can be computed (at least partially) based on the materialized view, the Google BigQuery engine will rewrite the query to use it. If you want to take advantage of this feature, be aware of the following:

The query (or part of) has to be a “subset” of the materialized view. Meaning you can only filter on columns available to the view; you also have to do the same (or more) aggregations as the view (if any). As an example, the first query was always rewritten; while the second was picked up only about 50% of the time:

The only difference between these queries is the location of the WHERE clause. In the first, I shifted it to the full table scan portion, while the JOIN table can be swapped with the materialized view directly. In the second example, the filter is on the whole joined table; the engine gets confused about this, and scans the table for both parts of the query.
Materialized views are refreshed within 5 minutes of a table change, but only once every 30 minutes. This can be changed with DDL to between 1 minute and 7 days. For a constantly streaming use case, it might be prudent to tune the refresh down (to between 5 and 10 minutes). This, of course, depends on the amount of data ingested, the query complexity, and the pricing.

Conclusion

Where should you use materialized views? Unfortunately, not as a standard “cache this query result” for any use case because of its limitations: both the reduced SQL functions it can accept, as well as the limit on the number of materialized views a table can have (up to 20 inside a single dataset; query rewrite only works inside the dataset). The following scenarios are worth considering:

Caching a compatible query that has to be performant.

Unfortunately, a type 2 history view is incompatible because it uses analytic functions.

Caching “base” or common aggregations for BI uses. For example, if a Data Studio chart has the option of displaying the data with three different aggregations, it might be worth creating three materialized views. Any filters or sorting applied to the BI charts is then run on these views, instead of the full table. This can be a powerful tool for both cost savings and increased performance, provided that the materialized view is much smaller in size.

You can set the clustering of a materialized view, so you can create views for common aggregation/sorting scenarios for even larger gains.

Let's talk!