OUR LATEST NEWS

Google BigQuery materialized view test drive

hero

Peter Farkas | April 7, 2022

3 Minutes Read

Preamble

I have tested the BigQuery materialized views against the documentation. While most of the functionality and limitations are accurate, there are a few gotchas you need to be aware of.

Setup

I’ll use these tables and views for the rest of the post:

Example BQ schema:

Example view:

Data generation

In addition to static data, I also tested the view with constant streaming data into the base table. This is super simple using a Google Dataflow template. This template allows you to generate large streaming buffers quickly (10k rows/sec if needed). I recommend it, if you need to test something with a large sample of streaming data. (The target can be BigQuery, Pub/Sub, or GCS.)

The schema used for the Dataflow pipeline:

Findings

  • Materialized views cannot be created from the UI/API, only through the DDL
  • When querying materialized views, BigQuery also runs the query on the “partition” of the base table that has not yet been cached, so you will always get an accurate result.
    • For this reason only those aggregate functions can be used where you can compute “partial” results (e.g. max, sum, avg, but not rank)
    • For the same reason, analytic functions are out
  • Query rewrite works some of the time. This feature means that if you query the base table, and the result can be computed (at least partially) based on the materialized view, the BigQuery engine will rewrite the query to use it. If you want to take advantage of this feature, be aware of the following:

    • The query (or part of) has to be a “subset” of the materialized view. Meaning you can only filter on columns available to the view; you also have to do the same (or more) aggregations as the view (if any). As an example, the first query was always rewritten; while the second was picked up only about 50% of the time:

    The only difference between these queries is the location of the WHERE clause. In the first, I shifted it to the full table scan portion, while the JOIN table can be swapped with the materialized view directly. In the second example, the filter is on the whole joined table; the engine gets confused about this, and scans the table for both parts of the query.

  • Materialized views are refreshed within 5 minutes of a table change, but only once every 30 minutes. This can be changed with DDL to between 1 minute and 7 days. For a constantly streaming use case, it might be prudent to tune the refresh down (to between 5 and 10 minutes). This, of course, depends on the amount of data ingested, the query complexity, and the pricing.

Conclusion

Where should you use materialized views? Unfortunately, not as a standard “cache this query result” for any use case because of its limitations: both the reduced SQL functions it can accept, as well as the limit on the number of materialized views a table can have (up to 20 inside a single dataset; query rewrite only works inside the dataset). The following scenarios are worth considering:

  • Caching a compatible query that has to be performant.
    • Unfortunately, a type 2 history view is incompatible because it uses analytic functions.
  • Caching “base” or common aggregations for BI uses. For example, if a Data Studio chart has the option of displaying the data with three different aggregations, it might be worth creating three materialized views. Any filters or sorting applied to the BI charts is then run on these views, instead of the full table. This can be a powerful tool for both cost savings and increased performance, provided that the materialized view is much smaller in size.
    • You can set the clustering of a materialized view, so you can create views for common aggregation/sorting scenarios for even larger gains.

AutoML: An Introduction To Get You Started

clock

2020-01-29

4 Minutes Read

AutoML is an exciting new trend in the Machine Learning (ML) industry. It revolves around analyzing data automatically and getting meaningful insights with minimum effort. By using the ingested data it is also capable of building models which can later be used as predictors for new data points.

Employee well-being initiatives: Creating an engaged workforce

clock

2019-05-08

5 Minutes Read

In my previous blog post, I shared how important it is for us to provide a flexible and healthy working environment for our employees. In addition to having an open policy on home office, we feel that as an employer, it’s our responsibility to help our team maintain their physical and mental health.

Part 1: Customer lifetime value estimation via probabilistic modeling

clock

2022-02-16

15 Minutes Read

Customer lifetime value (CLV) is the total worth of a customer to a company over the length of their relationship. In practice, this "worth" can be defined as revenue, profit, or any metric of an analyst's choosing.

hero
logo

Ready for the future? Let’s talk!

Reach out, and let’s take your business to the next level.

image

By clicking submit below, you consent to allow Aliz.ai to store and process the personal information submitted above and share information about our products and services, as well as other content that may be of interest to you. For more information, please review our Privacy Policy. You may unsubscribe at any time. Your data will not be passed on to third parties.

I agree to receive other communications from Aliz.ai.

logo

© Copyright 2022 Aliz Tech Kft.
Aliz is a proud Google Cloud Premier Partner with specializations in Data Analytics and Machine Learning. We deliver data analytics, machine learning, and infrastructure solutions, off the shelf, or custom-built on GCP using an agile, holistic approach.

logologologologo

designed and executed with kifli by kifli.tech