Peter Farkas | April 7, 2022
3 Minutes Read
Preamble
I have tested the BigQuery materialized views against the documentation. While most of the functionality and limitations are accurate, there are a few gotchas you need to be aware of.
Setup
I’ll use these tables and views for the rest of the post:
Example BQ schema:
Example view:
Data generation
In addition to static data, I also tested the view with constant streaming data into the base table. This is super simple using a Google Dataflow template. This template allows you to generate large streaming buffers quickly (10k rows/sec if needed). I recommend it, if you need to test something with a large sample of streaming data. (The target can be BigQuery, Pub/Sub, or GCS.)
The schema used for the Dataflow pipeline:
Findings
Query rewrite works some of the time. This feature means that if you query the base table, and the result can be computed (at least partially) based on the materialized view, the BigQuery engine will rewrite the query to use it. If you want to take advantage of this feature, be aware of the following:
The query (or part of) has to be a “subset” of the materialized view. Meaning you can only filter on columns available to the view; you also have to do the same (or more) aggregations as the view (if any). As an example, the first query was always rewritten; while the second was picked up only about 50% of the time:
The only difference between these queries is the location of the WHERE clause. In the first, I shifted it to the full table scan portion, while the JOIN table can be swapped with the materialized view directly. In the second example, the filter is on the whole joined table; the engine gets confused about this, and scans the table for both parts of the query.
Materialized views are refreshed within 5 minutes of a table change, but only once every 30 minutes. This can be changed with DDL to between 1 minute and 7 days. For a constantly streaming use case, it might be prudent to tune the refresh down (to between 5 and 10 minutes). This, of course, depends on the amount of data ingested, the query complexity, and the pricing.
Conclusion
Where should you use materialized views? Unfortunately, not as a standard “cache this query result” for any use case because of its limitations: both the reduced SQL functions it can accept, as well as the limit on the number of materialized views a table can have (up to 20 inside a single dataset; query rewrite only works inside the dataset). The following scenarios are worth considering:
AutoML: An Introduction To Get You Started
2020-01-29
4 Minutes Read
AutoML is an exciting new trend in the Machine Learning (ML) industry. It revolves around analyzing data automatically and getting meaningful insights with minimum effort. By using the ingested data it is also capable of building models which can later be used as predictors for new data points.
Employee well-being initiatives: Creating an engaged workforce
2019-05-08
5 Minutes Read
In my previous blog post, I shared how important it is for us to provide a flexible and healthy working environment for our employees. In addition to having an open policy on home office, we feel that as an employer, it’s our responsibility to help our team maintain their physical and mental health.
Part 1: Customer lifetime value estimation via probabilistic modeling
2022-02-16
15 Minutes Read
Customer lifetime value (CLV) is the total worth of a customer to a company over the length of their relationship. In practice, this "worth" can be defined as revenue, profit, or any metric of an analyst's choosing.