Welcome back to the second installment of our series on data validation during data warehouse migrations. In our previous post, we provided a general overview of the data validation process, shedding light on its significance in ensuring a seamless transition between platforms and covering the basic steps and strategies involved. Building upon that foundation, we're now zeroing in on some real-world experiences with data validation using HiveQL to BigQuery as a case study.
At this point in your data migration, you should have already translated your data pipeline SQL files from the source dialect into the target dialect. From here, you can run both the original and migrated pipelines. As mentioned in our previous post, we prefer to perform the data comparison by having both tables within either the source or target system. As we will be talking about Hive to BigQuery here, that means uploading your Hive table to BigQuery (or vice versa if you prefer). Now you are ready to start validating! Let's go over some of the more common errors that came up for us during a recent migration project.
The `TRIM()` function appears to be a straightforward function to translate, as the functions appear to behave the same in the vast majority of cases. However, the Hive function states that it removes leading and trailing spaces from the string, whereas the BigQuery version of the function removes whitespaces. This means that the BigQuery TRIM will remove spaces, newlines, vertical tabs, etc. To the naked eye, the data will appear the same, but the subtle differences should be caught one way or another by your validation tool/query. The TRIM function in BigQuery has the option to specify a particular set of characters to remove, and thus by setting this to `” “`, you can mimic the behavior of the Hive TRIM function.
One of the most common errors which we have encountered are those stemming from the use of `ROW_NUMBER() OVER()` to establish a ranking of rows based on some field value(s). Both the BigQuery and Hive syntaxes are the same for this purpose, however, the problem arises when the field(s) used for ranking are not unique across the row partitions. In these cases, you can end up with multiple rows having the same value for the ranking field. For example, if you are sorting student `names` based on their `grades` in a certain class. If you have two or more students with the exact same grade, then using `ROW_NUMBER` will lead to the tied students being ranked based on some underlying shuffling of the data. In this case, Hive and BQ do not produce the same result. In order to perform the validation, you will need to introduce more fields which will be used for sorting, until you have sufficient fields to obtain a unique ranking for each student. Perhaps the student names can then be used as a secondary factor in the ranking, such that the grades are still the primary factor, but if a grade is tied, the students will be sorted based on the alphabetical order of their names. This of course can introduce bias which may lead to undesirable results in your data, and thus the determination of these additional factors should be given some thought. Since this is only needed for the period of validation, you can also revert to the original ranking once you are confident the table’s migrated data is valid.
The hive function `SIZE` is used to get the number of elements within an array. The BigQuery counterpart is `ARRAY_LENGTH`. Both work exactly the same, except for when given NULL values as input. In such cases, the ARRAY_LENGTH function will return NULL, whereas the Hive function will return -1. We have found this leading to differences in the results coming from `WHERE` clauses which check for `WHERE SIZE(array) < 100`, as the Hive code will still select rows where the array is NULL due to `-1 < 100`, whereas the BigQuery code will not as `NULL < 100` does not evaluate to `TRUE`. In this case, you can create a temporary CASE function in BigQuery which maps any NULL arrays to `-1`.
Next is a problem that took some real digging to reveal the root cause. We were working with publication data that stretched back to very early scientific works from as far back as the 16th century. Here we had a pipeline that used Unix timestamps as inputs and converted them to a human readable timestamp. This was working fine for all but a few rows where the Unix timestamp was, for example, `-17429212800`. In this case, we had the exact same Unix timestamp being used as input for the respective functions for conversion to a Timestamp (`FROM_UNIXTIME` in Hive, and `TIMESTAMP_SECONDS` in BigQuery). However, the resulting values from these two functions were nine days apart, giving `1417-01-01 00:00:00` in Hive, and `1417-09-10 00:00:00` in BigQuery. More modern dates (e.g. `2012-10-13`) were not having this issue, and neither were dates such as `1600-01-01`, so we began to try and zero in on the date where we could start to see some drift. Eventually, we identified the exact Unix timestamp where the jump of nine days occurred: from `-12219212800` to `-12219312800`, where the former corresponds to `1582-10-15`, and the latter `1582-10-04` in Hive. In BigQuery, these two timestamps did not cause a jump and were simply consecutive days (`1582-10-15` and `1582-10-14`). At this point, I was still a bit confused and was ready to chalk this up as a bug in Hive. However, a colleague enlightened me and explained that this time corresponds to the Gregorian calendar reform, whereby no days exist between `1582-10-04` and `1582-10-15`. So Hive follows the Gregorian calendar, whereas BigQuery uses the proleptic Gregorian calendar for dates before `1582-10-15`. Interesting piece of knowledge, but how did we handle this during the validation? Well, there were a couple of routes we could have gone. One option was to create a `CASE` statement in BigQuery whereby we subtract nine days from the timestamp if it comes before `1582-10-15`. However, after discussing with the client, we decided that it was sufficient to simply filter rows with timestamps before this date out from the validation. Had these timestamps had a larger impact on downstream tables, such as being used in a `WHERE` clause, then leaving the inconsistencies in the table would have led to more headaches, and in such a case we would recommend the `CASE` statement solution.
Wrap up
At this point, I think we have covered some very interesting topics, but to not overwhelm you with too much all at once, I will split this topic into 2 and I will share the remaining information in the second part of this post. Hopefully, these few examples can already be of help to you, or at least I hope it was an interesting read. See you in the next one soon!