Analytic functions in Google BigQuery – Part 1: Basics

Analytic functions in Google BigQuery - Part 1: Basics
Analytic functions in Google BigQuery - Part 1: Basics

Google BigQuery, Google’s data warehouse solution, has many functions and capabilities. In Part 1 of a series of posts, we look at the definition of analytic functions, how they differ from aggregate functions, and the various uses they have in everyday scenarios.

Examples

First of all, let’s define our example table, Table 1, on which all queries in this post and their results are based.

Table: Orders

OrderId StoreId Price Date
1 1 23 2019-01-07
2 2 73 2019-01-07
3 1 123 2019-01-07
4 2 173 2019-01-07
5 1 45 2019-01-08
6 2 11 2019-01-08
7 1 145 2019-01-08
8 2 111 2019-01-08

Furthermore, all queries in this post are written in the BigQuery Standard SQL dialect.

Aggregate vs Analytic function

Those familiar with traditional SQL databases already know how aggregate functions work, but for the sake of completeness, I’ll include it here.

An aggregate function is a function that computes a single aggregate value over a group of rows. The syntax of this function is simple enough:

SELECT AVG(Price) FROM `Orders`

Each column that is not aggregated has to be present in the GROUP BY clause:

SELECT StoreId, AVG(Price) FROM `Orders` GROUP BY StoreId

In contrast, an analytic function is a function that computes an aggregate value for each row over a group of rows. The syntax of this function is as follows:

SELECT AVG(Price) OVER () FROM `Orders`

Because grouping is done through the analytic function’s own condition block, the GROUP BY clause is not necessary.

Clauses

PARTITION BY

PARTITION BY  is arguably the most important clause of an analytic function. Similar to the aggregate function’s GROUP BY clause, it limits the number of rows the function computes merely on a subset of all rows. While all non-aggregate columns must be included in a GROUP BY clause, there is no such restriction here.

Example 1

SELECT
  *,
  ABS(Price - AVG(Price) OVER (PARTITION BY StoreId)) AS PriceDifferenceFromAverage
FROM `Orders`
OrderId StoreId Price Date PDFA
1 1 23 2019-01-07 61
2 2 73 2019-01-07 19
3 1 123 2019-01-07 39
4 2 173 2019-01-07 81
5 1 45 2019-01-08 39
6 2 11 2019-01-08 81
7 1 145 2019-01-08 61
8 2 111 2019-01-08 19

This query computes the difference between the actual price and the store’s average price. Since we’re using an analytic function, we can include the OrderId and Date columns in our results without grouping by them. If we wanted to get the same result using aggregate functions, the query would look like this:

WITH averages AS (
  SELECT StoreId, AVG(Price) AS AvgPrice
  FROM `Orders`
  GROUP BY StoreId
)

SELECT
  o.*,
  ABS(o.Price - a.AvgPrice) AS PriceDifferenceFromAverage
FROM `Orders` o
JOIN averages a
USING (StoreId)

ORDER BY

Next, we have the ORDER BY clause. Similar to the ORDER BY at the (near) end of a query, its purpose is to order each partition prior to evaluating the analytic function. While functions such as SUM or AVG do not benefit from an ordered partition, there are certain functions where it is a must.

One such function is LAG. This function has one argument (a column name), and its return value is the value of the column from the previous row.

Example 2

WITH previous_order_date AS (
  SELECT
    *,
    LAG(Date) OVER (
      PARTITION BY StoreId
      ORDER BY Date, OrderId
    ) AS PreviousDate
  FROM `Orders`)

SELECT * EXCEPT (PreviousDate) FROM previous_order_date
WHERE PreviousDate IS NULL OR Date != PreviousDate
OrderId StoreId Price Date
1 1 23 2019-01-07
2 2 73 2019-01-07
5 1 45 2019-01-08
6 2 11 2019-01-08

This query returns the IDs of all orders that were processed first on any given date for each store. (We assume the Order with a lesser ID came in before an order with a greater ID.) The same problem can be solved with aggregate functions, but this method (again) joins a subset of the table with the table itself, which is much less performant than the analytic function we used.

WITH ids AS (
  SELECT MIN(OrderId) AS OrderId
  FROM `Orders`
  GROUP BY StoreId, Date
)

SELECT * FROM `Orders`
JOIN ids
USING (OrderId)

Conclusion

BigQuery’s analytic functions are powerful tools that can reduce difficult and expensive-to-compute queries to more simple solutions. Though we looked at the definition and semantics of analytic functions, we have yet to cover windowing and advanced use cases. Stay tuned for Part 2 of this blog series.

Interested in more about Analytic?