8 min read

Common HiveQL to BigQuery Migration Errors: A Detailed Exploration - Part 2

Published on
March 19, 2024
Author
Steve Ahlswede
Steve Ahlswede
Data Engineer
Subscribe to our newsletter
Subscribe
Common HiveQL to BigQuery Migration Errors: A Detailed Exploration - Part 2

In the last post we covered four common errors that we have had to contend with during our time migrating HiveQL to BigQuery during a data warehouse migration. There we covered ranking issues, issues with using the `TRIM` function, size of `NULL` arrays, and issues with the Gregorian calendar reform. In this post we will continue with a few more examples. 

Using Arrays in Group By

In Hive it is possible to use an array in a `GROUP BY` statement. 

However, in BigQuery this is not possible. Thus, a workaround is needed. One solution is to cast the array to a JSON string, as strings can be used in `GROUP BY`. You can wrap this in a CTE or subquery, and then convert the JSON string back to the original array afterwards

Using Aggregation Functions on STRUCT/RECORD

In Hive, it is possible to apply a function such as `MIN` OR `MAX` on a STRUCT. If we use `MIN` as an example, then the minimum STRUCT is determined based on the values of the sub-fields, where the first sub-field has precedence. So if we have two STRUCTs as such:

Then the `MIN` of the two would return

Because `abe` < `barry`. Now if we change the data such that the first field has a tie

Then the `MIN` would return 

This is because the first field was not enough to determine the minimum, and thus the second field is used.

However, BigQuery does not allow STRUCT fields to be used in aggregation functions, and thus a workaround is needed. One possible solution is to apply the same ranking logic using the `ROW_NUMBER()` function.

From the above you would get the following result

Replicating MONTHS_BETWEEN in BigQuery

In Hive there is a function called `MONTHS_BETWEEN`. This function calculates a floating point value of the number of months between two dates. For example, if two dates are exactly a half month apart, then it will return `0.5`. In BigQuery no such function exists, as the closest option is `DATE_DIFF`, which treats the difference in months as an integer, and thus if any date is within a different month from the original date, even if only 2 days apart (e.g. `2024-10-30` and `2024-11-1`). In order to mimic the original behaviour, we wrote a custom user-defined function (UDF).

Conclusion

While there are more small issues that we have encountered along the way of migrating HiveQL to BigQuery, we simply cannot cover every single one of them. We hope that these examples from parts one and two will be enough to give you an idea of the types of issues you could face, and how to solve them. Even better if they addressed the exact problems that you are currently stuck on.

Author
Steve Ahlswede
Data Engineer
Subscribe to our newsletter
Subscribe