Google BigQuery, Google’s data warehouse solution, has many functions and capabilities. In Part 1 of a series of posts, we look at the definition of analytic functions, how they differ from aggregate functions, and the various uses they have in everyday scenarios.
First of all, let’s define our example table, Table 1, on which all queries in this post and their results are based.
Furthermore, all queries in this post are written in the BigQuery Standard SQL dialect.
Aggregate vs Analytic function
Those familiar with traditional SQL databases already know how aggregate functions work, but for the sake of completeness, I’ll include it here.
An aggregate function is a function that computes a single aggregate value over a group of rows. The syntax of this function is simple enough:
SELECT AVG(Price) FROM `Orders`
Each column that is not aggregated has to be present in the GROUP BY clause:
SELECT StoreId, AVG(Price) FROM `Orders` GROUP BY StoreId
In contrast, an analytic function is a function that computes an aggregate value for each row over a group of rows. The syntax of this function is as follows:
SELECT AVG(Price) OVER () FROM `Orders`
Because grouping is done through the analytic function’s own condition block, the GROUP BY clause is not necessary.
PARTITION BY is arguably the most important clause of an analytic function. Similar to the aggregate function’s GROUP BY clause, it limits the number of rows the function computes merely on a subset of all rows. While all non-aggregate columns must be included in a GROUP BY clause, there is no such restriction here.
SELECT *, ABS(Price - AVG(Price) OVER (PARTITION BY StoreId)) AS PriceDifferenceFromAverage FROM `Orders`
This query computes the difference between the actual price and the store’s average price. Since we’re using an analytic function, we can include the OrderId and Date columns in our results without grouping by them. If we wanted to get the same result using aggregate functions, the query would look like this:
WITH averages AS ( SELECT StoreId, AVG(Price) AS AvgPrice FROM `Orders` GROUP BY StoreId ) SELECT o.*, ABS(o.Price - a.AvgPrice) AS PriceDifferenceFromAverage FROM `Orders` o JOIN averages a USING (StoreId)
Next, we have the ORDER BY clause. Similar to the ORDER BY at the (near) end of a query, its purpose is to order each partition prior to evaluating the analytic function. While functions such as SUM or AVG do not benefit from an ordered partition, there are certain functions where it is a must.
One such function is LAG. This function has one argument (a column name), and its return value is the value of the column from the previous row.
WITH previous_order_date AS ( SELECT *, LAG(Date) OVER ( PARTITION BY StoreId ORDER BY Date, OrderId ) AS PreviousDate FROM `Orders`) SELECT * EXCEPT (PreviousDate) FROM previous_order_date WHERE PreviousDate IS NULL OR Date != PreviousDate
This query returns the IDs of all orders that were processed first on any given date for each store. (We assume the Order with a lesser ID came in before an order with a greater ID.) The same problem can be solved with aggregate functions, but this method (again) joins a subset of the table with the table itself, which is much less performant than the analytic function we used.
WITH ids AS ( SELECT MIN(OrderId) AS OrderId FROM `Orders` GROUP BY StoreId, Date ) SELECT * FROM `Orders` JOIN ids USING (OrderId)
BigQuery’s analytic functions are powerful tools that can reduce difficult and expensive-to-compute queries to more simple solutions. Though we looked at the definition and semantics of analytic functions, we have yet to cover windowing and advanced use cases. Stay tuned for Part 2 of this blog series.