Reading Time: 4 Minutes ReadAs we prepare for our workshop A Big Data Adventure in Google Cloud Platform at Crunch Data Engineering and Analytics Conference, we have started to collect some tips and tricks when it comes to using Google’s Big Data platform. In our previous blog post, we presented some advice about controlling your BigQuery (BQ) costs. Now I will share some methods that can help to boost the performance, making your work with BQ more effective. Read this blog post and find out how to boost your BigQuery Performance.
Divide and ConquerDivide datasets by frequently used filtering attributes. BQ is a columnar database without indexes so it always does a full table scan on the required columns. If you break down your dataset into smaller parts, BQ has to read fewer data. This leads to faster and cheaper queries. Let’s assume that you have a dataset which contains data about people. For example name or birthplace. You also have a use-case where you are only interested in the names and birthplaces of people who were born between a specific date range. Let’s say between 2001.09.24 and 2001.09.25. This illustration below shows in green how much data BQ needs to read to answer this question if you have a monolith table and if you have wisely divided your data. The difference can be significant in favor of the divided table.
There are two methods of dividing your dataset in BQ.The first approach is called sharding, i.e., you insert the data into multiple tables with the same schema divided by one of the attributes. In this example it means five shards: