Transitioning from Apache Hive to Google BigQuery often involves rethinking how we handle custom aggregations, particularly when dealing with User-Defined Aggregate Functions (UDAFs). This journey, as detailed in my previous posts, 'How to migrate Hive UDFs, UDTFs, and UDAFs to BigQuery' and 'Step-by-Step Guide to Migrating Hive Custom Functions to BigQuery SQL UDFs', has taken us through the intricacies of transforming basic Hive functions to their BigQuery counterparts. Building on that foundation, this post will delve deeper, focusing on how to migrate a sample Hive UDAF - calculating the median - to BigQuery, showcasing the process of adapting complex aggregations to BigQuery’s advanced SQL capabilities.
Before we dive into the specifics, let's quickly remind ourselves what a UDAF is. A User-Defined Aggregate Function (UDAF) is a special type of function. Unlike regular functions (UDFs) that accept a single row as input and emit a single row as an output, UDAFs operate across multiple rows to aggregate data and return a single result. Common examples include calculating averages, sums, or more complex statistical measures across a set of rows. You can find a summary of the different kinds of custom functions in Hive here.
In Hive, a custom UDAF might be used to calculate the median of a set of values. Just to refresh, the median is a measure of central tendency that gives the middle value of a sorted list of numbers. If the list has an even number of observations, the median is the average of the two middle numbers.
Here’s how you might approach this in Hive with Java for floating point numbers:
You can find the full Java class here. ( Disclaimer: I did not test this Hive function, it is just for demonstration purposes)
Reflecting on our previously outlined decision tree, the first step after establishing the necessity of the function is to search for an equivalent native function in BigQuery. Fortunately, in our scenario, BigQuery offers a convenient built-in function that suits our needs in most cases. We can leverage the APPROX_QUANTILES function to approximate the median, as demonstrated below:
The result of this query is 7, which aligns perfectly with our expectations. But let's consider what happens when we apply this function to an even number of rows:
Interestingly, the result remains 7. According to our initial definition, for an even number of elements, the median should be the average of the two middle numbers, which in this case would be (7+8) / 2 = 7.5 It’s important to note here that, as the function's name suggests, APPROX_QUANTILES provides an approximation, not a guarantee of exact accuracy.
These nuances – the approximation and the handling of even-numbered datasets – might pose limitations depending on the precision required for your use case. If these constraints are significant for your application, you will need to look for another alternative.
While the APPROX_QUANTILES function is useful for approximating the median, certain scenarios demand exact precision. To address this need, we can employ a combination of BigQuery’s ARRAY_AGG function and a custom UDF. This method ensures that we accurately calculate the median, even for datasets with an even number of rows.
Firstly, we use ARRAY_AGG to collect all the values into an array, ordering them to prepare for the median selection:
This step results in a sorted array of values from the dataset.
Next, we define a SQL UDF that takes this sorted array and calculates the exact median:
This UDF, exact_median, checks the length of the array to determine if the number of elements is odd or even. It then calculates the median by either returning the middle element or averaging the two central elements.
Finally, the UDF is applied to the sorted array to obtain the median:
The result is 7.5 as we expected.
After defining our SQL UDF for precise median calculation, the next step is to deploy it in a robust and manageable way. Terraform, an infrastructure-as-code tool, offers a streamlined and version-controlled approach to deploying resources like BigQuery UDFs. Here’s how you can deploy the EXACT_MEDIAN UDF using Terraform:
You can find the full terraform implementation here. By applying the root terraform module you should be able to deploy the function in your GCP environment. (If you do not have a GCP environment yet here are the steps on how you can get started with GCP).
After the successful ‘apply’ you can use the function like this:
While the exact median implementation using ARRAY_AGG and a SQL UDF in BigQuery provides precision, it's important to be aware of its scalability limitations. The size of the array created by ARRAY_AGG is not unlimited – BigQuery imposes a limit on the maximum array size. This means that for extremely large datasets, this method might not be feasible, as the array could exceed the allowable size limit. Therefore, it’s crucial to choose the application of this exact median calculation method wisely, taking into account the size of your dataset.
In conclusion, this deep dive into migrating a specific Hive UDAF for calculating the median to BigQuery has shown us the versatility and precision that SQL UDFs and functions like ARRAY_AGG can offer. While we've addressed the challenges of achieving exact calculations, it's crucial to remember the scalability constraints inherent in this approach, especially for very large datasets. As we continue to navigate the nuances of transitioning from Hive to BigQuery, our journey into the realm of custom function migration is far from over. In the next installment of this series, we will explore the exciting territory of BigQuery Remote Functions. I'll be providing a working example to demonstrate how we can replace more complex Hive UDFs with BigQuery’s powerful remote function capabilities, offering insights into leveraging cloud-based solutions for even the most intricate data processing needs. Stay tuned for more practical solutions and expert tips to enhance your data migration strategy.