Duan Leu | June 23, 2022
5 Minutes Read
One of many nice things about Google BigQuery is that Google constantly adds new features. Recently, we had a look at BigQuery Column Encryption with Google Key Management Service (KMS) integration and it’s quite cool.
BigQuery offers multiple layers of security by default: encryption at rest and properly configured IAM policies prevent unauthorized parties from accessing your data. These measures provide strong enough guarantees for security for some of the use cases. In some industries, however, the defaut measures are not enough; for sensitive data such as PII (Personally Identifiable Information) an additional layer of security is required. BigQuery’s column encryption can help you in this case, BigQuery allows you to encrypt columns and decrypt encrypted columns in SQL using AEAD family functions like AEAD.ENCRYPT and AEAD.DECRYPT to minimize the risk of data exposure.
Regarding AEAD, we just need to know that it’s a strong cipher function that guarantees your data can never be exposed without the encryption key. Read more about AEAD encryption concepts.
But what if you have to encrypt information outside of BigQuery and later decrypt it using only native BigQuery SQL?
We actually had this challenge in a recent project, where our task was to build a real-time data ingestion pipeline from on-prem data sources into BigQuery using the Confluent tech-stack. Due to regulatory requirements, we had to encrypt certain sensitive fields before the change events reached the Kafka topic. We used custom Java Single Message Transform to fulfill this requirement, to encrypt customer data in the middle of the data ingestion pipeline.
In this post, we show how to encrypt a simple "hello world" string in Java to get the ciphertext, and decrypt that ciphertext in BigQuery.
To do this, we will use the Tink library to encrypt our plaintext in a way that BigQuery can understand and decrypt it using the native AEAD functions. From Tink’s documentation:
Tink is an open-source cryptography library written by cryptographers and security engineers at Google. Tink’s secure and simple APIs reduce common pitfalls through user-centered design, careful implementation and code reviews, and extensive testing.
Tink supports Java, Python, and C++ programming languages. In Java, we need to include the following dependencies:
Encrypt data using plaintext key
To replicate our example, you need to fulfill the following prerequisites:
Once you have the prerequisites ready, follow the steps below:
tinkey create-keyset --key-template AES256_GCM --out-format json --out plaintext-dek.json
BigQuery automatically logs all the queries, so adding the plaintext encryption key to the query directly means that the plaintext key can be found in the BigQuery logs. Query parameters can be used to avoid this security risk.
Encrypt data using an encrypted key protected by Google KMS
The approach using multiple layers of keys to encrypt data is called envelope encryption. A data encryption key (DEK) is used to encrypt the data itself, and the DEK is again encrypted by another key called a key encryption key (KEK). We still use the AEAD functions but this time the key is protected (encrypted) by Google KMS. Without appropriate permissions, such as Cloud KMS CryptoKey Encrypter/Decrypter on the key in KMS, no one can decrypt and use our key. You may be tempted to ask why we don’t use Google KMS to encrypt our data directly? Because it won’t work. Google KMS is not suitable for encrypting large volumes of data (the maximum size that you can encrypt using KMS is about 64KBs).
To execute the steps described you have to meet the following prerequisites (in addition to those mentioned in the previous example):
Follow these steps:
Follow these steps:
gcloud kms keyrings create demo-keyring-us --location us --project <gcp-project-id>
gcloud kms keys create demo-key --keyring demo-keyring-us --purpose encryption --location us --project <gcp-project-id>
tinkey create-keyset --key-template AES256_GCM \ --master-key-uri "gcp-kms://projects/<gcp-project-name>/locations/us/keyRings/demo-keyring-us/cryptoKeys/demo-key" \ --out encrypted-dek.json
Open encrypted-dek.json. Note the encryptedKeyset value. This is the base64 representation of the encrypted DEK keyset. We will use this during the decryption in BigQuery SQL later.
Encrypt the plaintext in Java using the encrypted-dek.json file. To execute this Java code, you will need to authenticate properly to GCP. The easiest way to do this in a local environment is via gcloud auth application-default login.
You can verify that we get our hello world plaintext back.
5 Good Reasons to Move to a Cloud-based Data Warehouse
3 Minutes Read
Now, let me walk you through the benefits of cloud-based data warehouses, one by one.
Google BigQuery materialized view test drive
3 Minutes Read
I have tested the BigQuery materialized views against the documentation. While most of the functionality and limitations are accurate, there are a few gotchas you need to be aware of.
Part 1: Customer lifetime value estimation via probabilistic modeling
15 Minutes Read
Customer lifetime value (CLV) is the total worth of a customer to a company over the length of their relationship. In practice, this "worth" can be defined as revenue, profit, or any metric of an analyst's choosing.