BigQuery compatible encryption in Java

hero

Duan Leu | June 23, 2022

5 Minutes Read

One of many nice things about Google BigQuery is that Google constantly adds new features. Recently, we had a look at BigQuery Column Encryption with Google Key Management Service (KMS) integration and it’s quite cool.

BigQuery offers multiple layers of security by default: encryption at rest and properly configured IAM policies prevent unauthorized parties from accessing your data. These measures provide strong enough guarantees for security for some of the use cases. In some industries, however, the defaut measures are not enough; for sensitive data such as PII (Personally Identifiable Information) an additional layer of security is required. BigQuery’s column encryption can help you in this case, BigQuery allows you to encrypt columns and decrypt encrypted columns in SQL using AEAD family functions like AEAD.ENCRYPT and AEAD.DECRYPT to minimize the risk of data exposure.

Regarding AEAD, we just need to know that it’s a strong cipher function that guarantees your data can never be exposed without the encryption key. Read more about AEAD encryption concepts.

But what if you have to encrypt information outside of BigQuery and later decrypt it using only native BigQuery SQL?

We actually had this challenge in a recent project, where our task was to build a real-time data ingestion pipeline from on-prem data sources into BigQuery using the Confluent tech-stack. Due to regulatory requirements, we had to encrypt certain sensitive fields before the change events reached the Kafka topic. We used custom Java Single Message Transform to fulfill this requirement, to encrypt customer data in the middle of the data ingestion pipeline.

In this post, we show how to encrypt a simple "hello world" string in Java to get the ciphertext, and decrypt that ciphertext in BigQuery.

To do this, we will use the Tink library to encrypt our plaintext in a way that BigQuery can understand and decrypt it using the native AEAD functions. From Tink’s documentation:

Tink is an open-source cryptography library written by cryptographers and security engineers at Google. Tink’s secure and simple APIs reduce common pitfalls through user-centered design, careful implementation and code reviews, and extensive testing.

Tink supports Java, Python, and C++ programming languages. In Java, we need to include the following dependencies:

  • com.google.crypto.tink:tink-gcpkms:1.6.1
  • com.google.crypto.tink:tink:1.6.1

Encrypt data using plaintext key

To replicate our example, you need to fulfill the following prerequisites:

Once you have the prerequisites ready, follow the steps below:

  1. First, create a encryption key using ‘tinkey’ utility (you can to install it by following guides of Tink documentation)

tinkey create-keyset --key-template AES256_GCM --out-format json --out plaintext-dek.json

  1. Use that plaintext-dek.json key to encrypt data in Java
  1. Take a note at the base64 output of the cipher text, and decrypt it using BigQuery SQL

BigQuery automatically logs all the queries, so adding the plaintext encryption key to the query directly means that the plaintext key can be found in the BigQuery logs. Query parameters can be used to avoid this security risk.

  1. Done, you’ll get the plaintext hello world back.

Encrypt data using an encrypted key protected by Google KMS

The approach using multiple layers of keys to encrypt data is called envelope encryption. A data encryption key (DEK) is used to encrypt the data itself, and the DEK is again encrypted by another key called a key encryption key (KEK). We still use the AEAD functions but this time the key is protected (encrypted) by Google KMS. Without appropriate permissions, such as Cloud KMS CryptoKey Encrypter/Decrypter on the key in KMS, no one can decrypt and use our key. You may be tempted to ask why we don’t use Google KMS to encrypt our data directly? Because it won’t work. Google KMS is not suitable for encrypting large volumes of data (the maximum size that you can encrypt using KMS is about 64KBs).

To execute the steps described you have to meet the following prerequisites (in addition to those mentioned in the previous example):

Follow these steps:

Follow these steps:

  1. Create a keyring named ‘demo-keyring-us’ in KMS.

gcloud kms keyrings create demo-keyring-us --location us --project <gcp-project-id>

  1. Create a key named ‘demo-key’ in KMS as KEK.

gcloud kms keys create demo-key --keyring demo-keyring-us --purpose encryption --location us --project <gcp-project-id>

  1. Generate the encrypted DEK using tinkey.

tinkey create-keyset --key-template AES256_GCM \ --master-key-uri "gcp-kms://projects/<gcp-project-name>/locations/us/keyRings/demo-keyring-us/cryptoKeys/demo-key" \ --out encrypted-dek.json

  1. Open encrypted-dek.json. Note the encryptedKeyset value. This is the base64 representation of the encrypted DEK keyset. We will use this during the decryption in BigQuery SQL later.

  2. Encrypt the plaintext in Java using the encrypted-dek.json file. To execute this Java code, you will need to authenticate properly to GCP. The easiest way to do this in a local environment is via gcloud auth application-default login.

  1. Finally, decrypt ciphertext with BigQuery SQL.

You can verify that we get our hello world plaintext back.

You can access theJava source code in the bq-column-level-encryption-example repository. Head over to Tink and BigQuery documentation for more advanced capabilities, such as key rotation.

5 Good Reasons to Move to a Cloud-based Data Warehouse

clock

2018-11-05

3 Minutes Read

Now, let me walk you through the benefits of cloud-based data warehouses, one by one.

Google BigQuery materialized view test drive

clock

2022-04-07

3 Minutes Read

I have tested the BigQuery materialized views against the documentation. While most of the functionality and limitations are accurate, there are a few gotchas you need to be aware of.

Part 1: Customer lifetime value estimation via probabilistic modeling

clock

2022-02-16

15 Minutes Read

Customer lifetime value (CLV) is the total worth of a customer to a company over the length of their relationship. In practice, this "worth" can be defined as revenue, profit, or any metric of an analyst's choosing.

hero
logo

Ready for the future? Let’s talk!

Reach out, and let’s take your business to the next level.

image

By clicking submit below, you consent to allow Aliz.ai to store and process the personal information submitted above and share information about our products and services, as well as other content that may be of interest to you. For more information, please review our Privacy Policy. You may unsubscribe at any time. Your data will not be passed on to third parties.

I agree to receive other communications from Aliz.ai.

logo

© Copyright 2022 ALIZ Tech Kft.
ALIZ is a proud Google Cloud Premier Partner with specializations in Data Analytics and Machine Learning. We deliver data analytics, machine learning, and infrastructure solutions, off the shelf, or custom-built on GCP using an agile, holistic approach.

logologologologo