How to boost BigQuery performance

0
14

How to boost BigQuery performance

Csaba Kassai — Software Architect and Google Big Data expert @Aliz Technologies

As we prepare for our workshops, we started to collect some tips and tricks when it comes to using Google’s Big Data platform. In our previous blog post we presented some advice about controlling your BigQuery (BQ) costs. Now I will share some methods that can help to boost the performance, making your work with BQ more effective.

Divide and Conquer

Divide datasets by frequently used filtering attributes. BQ is a columnar database without indexes so it always does a full table scan on the required columns. If you break down your dataset into smaller parts, BQ has to read less data, which leads to faster and cheaper queries.

Let’s assume that you have a dataset which contains data about people, like name, birth place, etc. You also have a use-case where you are only interested in the names and birthplaces of people who were born between a specific date range — let’s say between 2001.09.24 and 2001.09.25.

This illustration below shows in green how much data BQ needs to read to answer this question if you have a monolith table and if you have wisely divided your data. The difference can be significant in favor of the divided table.

There are two methods of dividing your dataset in BQ.

The first approach is called sharding, i.e., you insert the data into multiple tables with the same schema divided by one of the attributes. In this example it means five shards:

  • person_20010923
  • person_20010924
  • person_20010925
  • person_20010926
  • person_20010927

Using this technique, you need to query the union of the appropriate tables:

SELECT name, birth_place

FROM person_20010924, person_20010925

When you have more than a few shards (and usually you do), typing the name of all the necessary shards in the query editor can be time-consuming. Fortunately, BQ provides wildcard tables. By applying them, our query looks like this:

SELECT name, birth_place

FROM person_*

WHERE _TABLE_SUFFIX BETWEEN ‘20010924’ AND ‘20010925’

The other way is to create tables partitioned in the traditional way. The main difference between the two methods is that in the case of partitioning, you have just one table and only logical partitions, so you don’t need to use the wildcard functions. The following query will get us the required results from a partitioned table called “person”:

SELECT name, birth_place

FROM person

WHERE _PARTITIONTIME BETWEEN TIMESTAMP(‘2001–09–24’) AND TIMESTAMP(‘2001–09–25’)

The disadvantage of this technique is that currently partitioning is only possible by date.

Denormalize

When it comes to database schema design, most of us were trained to use normalized schemas. The main reasons to normalize are to avoid consistency issues caused by data duplication, and to save storage space. But in the case of BQ, things work a little bit differently. First of all, these are no longer problems, since in practice data are often immutable and storage is quite cheap. So it’s usually wise to pre-join datasets into homogeneous tables. With this denormalization, you exchange compute resources for storage resources with storage being more performant and cost-effective. To easily handle these denormalized schemas, BQ provides nested, repeated data structures.

Reuse and share

This last section is not about query performance but about your performance: how you can use BQ more effectively.

It happens often that I need some information from BQ, but then I remember that I have already run the same or a similar query in the past. As an engineer, I want to reuse rather than re-create, so I start looking.

The first place to look is Query History on the BQ UI in upper left corner, although it has limitations both in terms of the number of stored queries and storage time. Also it can be frustrating when you need to search in hundreds of queries and, when you finally find the one you need, you feel that rewriting it would have been faster.

The Save Query function can help resolve these weaknesses. By clicking on the button you can name your query to help you find it later. Also saved queries won’t be automatically deleted.

You can not only reuse your query, but you can also reuse the results of a query. It’s possible to export a query result into another table or even into a spreadsheet, as you can download it in CSV and JSON formats. These options can be found above the panel which contains the results of the query.

It would be possible to fill several pages with performance and usage tips, but using these couple of best practices and features will help you achieve a significant performance boost.

View story at Medium.com

Leave a Reply