8 min read

Bridging Complexities: Migrating Hive UDFs to BigQuery Remote Functions - Part 2

Published on
February 28, 2024
Author
Csaba Kassai
Csaba Kassai
Head of Data
Subscribe to our newsletter
Subscribe
Bridging Complexities: Migrating Hive UDFs to BigQuery Remote Functions - Part 2

In the previous part, we explored migrating complex Hive UDFs, like the ‘xml_product_info_extractor’, to BigQuery Remote Functions. This approach allows you to leverage existing Java code within a Cloud Function, enabling seamless integration with BigQuery's SQL environment. However, ensuring your function's reliability and performance requires thorough testing and deployment best practices. This part delves into these crucial aspects, focusing on reusing existing code for testing and deployment using Terraform.

Unit testing of the UDF

If you are lucky and the developers of the original Hive UDF followed the development best practices, there are already some unit tests that you can reuse. These unit tests are valuable assets that can significantly reduce the testing phase of the migration. With some adjustments, these tests can be repurposed to validate the functionality of your Cloud Function before deployment. 

Hive UDF unit tests

Here's a practical example of a unit test for our Hive UDF that processes XML data. This test follows the standard structure you'd expect:

  1. Setup: We start by creating an instance of our UDF using its default constructor.
  2. Input Preparation: Next, we prepare the input for our test by loading an XML file from our resources, mimicking the data our UDF would process in a real-world scenario.
  3. Execution: With our setup ready and input prepared, we invoke the evaluate method of our UDF, which is where the actual processing happens.
  4. Verification: Finally, we verify the outcome using jUnit's assert functions to ensure the UDF's output matches our expectations.

For simplicity, this example focuses on a single test case, but typically, you'd have multiple tests to cover different scenarios. The complete source code for these tests is available here. This example demonstrates a straightforward approach to testing Hive UDFs, ensuring they work correctly across various cases.

Testing the BigQuery UDF Cloud Functions

Maintaining the Java language for our BigQuery UDF allows us to repurpose existing test cases from the Hive UDF with minimal adjustments. This practical approach involves refining the test code to fit the Cloud Function's framework while preserving the core testing logic:

  1. Setup: We initialize by instantiating our Cloud Function class, similar to setting up the Hive UDF.
  2. Input Preparation: The same XML data from our test resources is utilized, but this time we format it into JSON objects that our Cloud Function expects.
  3. Execution: The process method is invoked with our prepared input, replicating the function's execution as it would occur in a live environment.
  4. Verification: The verification step remains consistent with our original approach, asserting the correctness of the function's output against expected values.

The complete code for these adapted tests is accessible here. This approach highlights the efficiency of leveraging existing tests with only necessary modifications for the Cloud Function, ensuring a smooth transition of your UDF's functionality to BigQuery with minimal effort.

Deployment to GCP

To get our UDF working in BigQuery, we need to deploy two key components. The first is the Cloud Function, which creates the necessary HTTP endpoint for BigQuery to access. This step is crucial for enabling our UDF's logic to be executed outside BigQuery's native environment. The second component is the BigQuery Remote Function definition, which essentially tells BigQuery how to find and use the Cloud Function. By setting up both these elements, we connect BigQuery with our external UDF, making it ready for use in our data processing workflows.

Cloud Functions

Deploying the Cloud Function involves a series of practical steps, starting from packaging the function to its deployment via Terraform. Initially, we create an uber-jar using Maven to bundle the function along with its dependencies, ensuring it's self-contained. This can be achieved by adding the Maven Assembly Plugin to your pom.xml and configuring it to create an uber-jar:

By incorporating this plugin into your pom.xml file, Maven is configured to produce an additional jar file during the build process. This file, distinguished by the -jar-with-dependencies suffix in its name, encapsulates all the necessary dependencies, making it ideal for deployment purposes. Following the creation of the uber-jar, the subsequent action involves compressing this jar into a zip file, a straightforward task achievable through the use of a basic zip command. This step prepares the jar for upload, ensuring it's packaged correctly for deployment.

Once zipped, the file is uploaded to Cloud Storage, which can be accomplished using the ‘gcloud storage cp’ command:

Here is the full shell script that is capable of executing these steps. 

With the function's code securely stored in Cloud Storage, the final step involves deploying the Cloud Function using Terraform. In your Terraform configuration, define the Cloud Function resource, specifying the Cloud Storage bucket and the zip file's path as the source. Here's an example Terraform snippet:

Executing ‘terraform apply’ will deploy the Cloud Function, making it accessible as an HTTP endpoint and ready for integration as a BigQuery Remote Function. This streamlined process ensures that your function is deployed with all necessary dependencies, ready to enhance your BigQuery data workflows with custom logic.

BigQuery Remote UDF

Following the successful deployment of our Cloud Function, the next critical step involves configuring the BigQuery environment to utilize the newly deployed UDF within SQL queries. This two-fold setup involves establishing a secure connection to the Cloud Function and defining the UDF within BigQuery.

Establishing a Connection to Cloud Functions

To facilitate communication between BigQuery and the Cloud Function, we first create a BigQuery connection resource. This connection acts as a secure conduit, allowing BigQuery to send data to and receive results from the Cloud Function.

In the Terraform configuration, the ‘google_bigquery_connection’ resource is defined to create this link. Here, we specify a unique connection_id, the Google Cloud Project id, and the geographical location where the Cloud Function is deployed. The cloud_resource block signifies that this connection is to a cloud resource, in our case, a Cloud Function.
With the connection a dedicated service account will be created and this service account must have the Cloud Run Invoker role, in order to access the API endpoint. We can use the ‘google_project_iam_member’ terraform resource to grant this role.

Creating the BigQuery UDF Definition

With the connection in place, we then proceed to define the BigQuery UDF itself. This involves specifying the UDF's name, the Cloud Function it should invoke (identified by its endpoint), and the function's parameters and return type.

The UDF definition is articulated using BigQuery's Data Definition Language (DDL) within a google_bigquery_job resource. This job executes a DDL query that creates or replaces the UDF in BigQuery, linking it to the Cloud Function via the previously established connection. The OPTIONS clause in the DDL statement specifies the Cloud Function's endpoint, ensuring that BigQuery knows where to send data for processing.

In the Terraform script, locals are used to organize and simplify the definition of the UDF's properties, such as its name (udf_name), the entry point in the Cloud Function (udf_entry_point), and the Cloud Storage bucket where the Cloud Function's code resides (udf_archive_bucket). The ddl_query local constructs the DDL statement using these properties.

You can find the full terraform code here. This Terraform configuration not only defines the UDF within BigQuery but also ensures it is tightly integrated with the Cloud Function, allowing for seamless data processing. By executing terraform apply, these resources are deployed, and the UDF becomes available for use in BigQuery queries, bridging the gap between your complex data processing needs and BigQuery's powerful analytical capabilities.

After the successful deployment the UDF can be used in SQL:

Next Up: Data Validation during a DWH migration project

With this comprehensive guide on migrating complex Hive UDFs to BigQuery Remote Functions, we conclude our detailed exploration into enhancing BigQuery's capabilities with external custom logic. Through this series, we've navigated the intricacies of transferring the rich functionality of Hive's UDFs, ensuring their performance and reliability in a cloud-native environment with the help of Cloud Functions and Terraform. The journey from understanding the need for migration, through testing, to the final deployment illustrates a clear path for leveraging existing code and tests, thereby streamlining the migration process.

As we close this chapter on custom function migration, we look ahead to another critical aspect of data warehouse migration projects: data validation. Ensuring the accuracy and consistency of your data post-migration is paramount for maintaining the integrity of your analytical workflows. In our upcoming series, we will delve into strategies and tools for effective data validation, helping you secure confidence in your migrated data and optimize your decision-making processes. Stay tuned for insightful discussions on safeguarding data quality in your next big migration project.

Author
Csaba Kassai
Head of Data
Subscribe to our newsletter
Subscribe