OUR LATEST NEWS

How to Boost your BigQuery Performance

hero

Csaba Kassai | November 17, 2016

4 Minutes Read

As we prepare for our workshop A Big Data Adventure in Google Cloud Platform at Crunch Data Engineering and Analytics Conference, we have started to collect some tips and tricks when it comes to using Google’s Big Data platform. In our previous blog post, we presented some advice about controlling your BigQuery (BQ) costs. Now I will share some methods that can help to boost the performance, making your work with BigQuery more effective. Read this blog post and find out how to boost BigQuery Performance.

Divide and Conquer to boost BigQuery performance

Divide datasets by frequently used filtering attributes. BigQuery is a columnar database without indexes so it always does a full table scan on the required columns. If you break down your dataset into smaller parts, BQ has to read fewer data. This leads to faster and cheaper queries.

Let’s assume that you have a dataset which contains data about people. For example, name or birthplace. You also have a use-case where you are only interested in the names and birthplaces of people who were born between a specific date range. Let’s say between 2001.09.24 and 2001.09.25.

This illustration below shows in green how much data BigQuery needs to read to answer this question if you have a monolith table and if you have wisely divided your data. The difference can be significant in favor of the divided table.

BigQuery needs to answer a question

There are two methods of dividing your dataset in BigQuery.

The first approach is called sharding, i.e., you insert the data into multiple tables with the same schema divided by one of the attributes. In this example it means five shards:

Using this technique, you need to query the union of the appropriate tables:

BigQuery  appropriate tables

When you have more than a few shards (and usually you do), typing the name of all the necessary shards in the query editor can be time-consuming. Fortunately, BigQuery provides wildcard tables. By applying them, our query looks like this:

BigQuery wildcard  tables

The other way is to create tables partitioned in the traditional way. The main difference between the two methods is that in the case of partitioning, you only have one table and logical partitions. Due to this, you don’t need to use the wildcard functions. The following query will get us the required results from a partitioned table called “person”:

Bigquery partitioned table

The disadvantage of this technique is that currently partitioning is only possible by date.

Boost Bigquery performance: Denormalize

When it comes to database schema design, most of us were trained to use normalized schemas. The main reasons to normalize are to avoid consistency issues caused by data duplication and to save storage space. But in the case of BigQuery, things work a little bit differently. First of all, data are often immutable and storage is quite cheap. So it’s usually wise to pre-join datasets into homogeneous tables. With this denormalization, you exchange compute resources for storage resources with storage being more performant and cost-effective. To easily handle these denormalized schemas, BigQuery provides nested, repeated data structures.

Reuse and share

This last section is not about how to boost BigQuery performance but about your performance: how you can use BigQuery more effectively.

It often happens that I need some information from BigQuery. Then I realize that I have already run the same or a similar query in the past. As an engineer, I want to reuse rather than re-create, so I start looking.

BigQuery reuse

The first place to look is Query History on the BigQuery UI in the upper left corner, although it has limitations both in terms of the number of stored queries and storage time. Also, it can be frustrating when you need to search in hundreds of queries. Then, when you finally find the one you need, you realize you could have been much faster by simply rewriting it.

The Save Query function can help resolve these weaknesses. By clicking on the button you can name your query to help you find it later. Also saved queries won’t be automatically deleted.

Reuse BigQuery

You can not only reuse your query, but you can also reuse the results of a query. It’s possible to export a query result into another table or even into a spreadsheet. You can download it in CSV and JSON formats. These options can be found above the panel which contains the results of the query.Reuse BigQuery-results

It would be possible to fill several pages with performance and usage tips. However, using these couple of best practices and features will help you achieve a significant performance boost.

 If you want to know more, come to our workshop A Big Data Adventure in Google Cloud Platform, at Crunch Data Engineering and Analytics Conference.

How to Power Banking Services with Google Cloud

clock

2022-06-23

4 Minutes Read

Here’s why your company should be using the Google Cloud features to power banking services and how it makes things easy for financial service organizations.

Google Cloud Infrastructure Modernization - Stay Agile With An Open Architecture

clock

2022-02-17

3 Minutes Read

Learn more about how Google Cloud infrastructure modernization solutions can help your business to become more competitive.

Google BigQuery materialized view test drive

clock

2022-04-07

3 Minutes Read

I have tested the BigQuery materialized views against the documentation. While most of the functionality and limitations are accurate, there are a few gotchas you need to be aware of.

hero
logo

Ready for the future? Let’s talk!

Reach out, and let’s take your business to the next level.

image

By clicking submit below, you consent to allow Aliz.ai to store and process the personal information submitted above and share information about our products and services, as well as other content that may be of interest to you. For more information, please review our Privacy Policy. You may unsubscribe at any time. Your data will not be passed on to third parties.

I agree to receive other communications from Aliz.ai.

badge

New opportunities with cloud solutions!

Aliz is a proud Google Cloud Partner with specializations in Infrastructure, Data Analytics, Cloud Migration and Machine Learning. We deliver data analytics, machine learning, and infrastructure solutions, off the shelf, or custom-built on GCP using an agile, holistic approach.

logo

© Copyright 2022 Aliz Tech Kft.

logologologologo