One of many nice things about Google BigQuery is that Google constantly adds new features. Recently, we had a look at BigQuery Column Encryption with Google Key Management Service (KMS) integration and it’s quite cool.
BigQuery offers multiple layers of security by default: encryption at rest and properly configured IAM policies prevent unauthorized parties from accessing your data. These measures provide strong enough guarantees for security for some of the use cases. In some industries, however, the defaut measures are not enough; for sensitive data such as PII (Personally Identifiable Information) an additional layer of security is required. BigQuery’s column encryption can help you in this case, BigQuery allows you to encrypt columns and decrypt encrypted columns in SQL using AEAD family functions like AEAD.ENCRYPT and AEAD.DECRYPT to minimize the risk of data exposure
Regarding AEAD, we just need to know that it’s a strong cipher function that guarantees your data can never be exposed without the encryption key. Read more about AEAD encryption concepts.
But what if you have to encrypt information outside of BigQuery and later decrypt it using only native BigQuery SQL?
We actually had this challenge in a recent project, where our task was to build a real-time data ingestion pipeline from on-prem data sources into BigQuery using the Confluent tech-stack. Due to regulatory requirements, we had to encrypt certain sensitive fields before the change events reached the Kafka topic. We used custom Java Single Message Transform to fulfill this requirement, to encrypt customer data in the middle of the data ingestion pipeline.
In this post, we show how to encrypt a simple "hello world" string in Java to get the ciphertext, and decrypt that ciphertext in BigQuery.
To do this, we will use the Tink library to encrypt our plaintext in a way that BigQuery can understand and decrypt it using the native AEAD functions. From Tink’s documentation:
Tink is an open-source cryptography library written by cryptographers and security engineers at Google. Tink’s secure and simple APIs reduce common pitfalls through user-centered design, careful implementation and code reviews, and extensive testing
Tink supports Java, Python, and C++ programming languages. In Java, we need to include the following dependencies:
To replicate our example, you need to fulfill the following prerequisites:
Once you have the prerequisites ready, follow the steps below:
1. First, create a encryption key using ‘tinkey’ utility (you can to install it by following guides of Tink documentation)
tinkey create-keyset --key-template AES256_GCM --out-format json --out plaintext-dek.json
2. Use that plaintext-dek.json key to encrypt data in Java
3.Take a note at the base64 output of the cipher text, and decrypt it using BigQuery SQL
BigQuery automatically logs all the queries, so adding the plaintext encryption key to the query directly means that the plaintext key can be found in the BigQuery logs. Query parameters can be used to avoid this security risk.
4.Done, you’ll get the plaintext hello world back.
The approach using multiple layers of keys to encrypt data is called envelope encryption. A data encryption key (DEK) is used to encrypt the data itself, and the DEK is again encrypted by another key called a key encryption key (KEK). We still use the AEAD functions but this time the key is protected (encrypted) by Google KMS. Without appropriate permissions, such as Cloud KMS CryptoKey Encrypter/Decrypter on the key in KMS, no one can decrypt and use our key. You may be tempted to ask why we don’t use Google KMS to encrypt our data directly? Because it won’t work. Google KMS is not suitable for encrypting large volumes of data (the maximum size that you can encrypt using KMS is about 64KBs).
To execute the steps described you have to meet the following prerequisites (in addition to those mentioned in the previous example):
Follow these steps:
1.Create a keyring named ‘demo-keyring-us’ in KMS.
gcloud kms keyrings create demo-keyring-us --location us --project <gcp-project-id>
2.Create a key named ‘demo-key’ in KMS as KEK.
gcloud kms keys create demo-key --keyring demo-keyring-us --purpose encryption --location us --project <gcp-project-id>
3.Generate the encrypted DEK using tinkey.
tinkey create-keyset --key-template AES256_GCM \
--master-key-uri "gcp-kms://projects/<gcp-project-name>/locations/us/keyRings/demo-keyring-us/cryptoKeys/demo-key" \
--out encrypted-dek.json
4. Open encrypted-dek.json. Note the encryptedKeyset value. This is the base64 representation of the encrypted DEK keyset. We will use this during the decryption in BigQuery SQL later.
5. Encrypt the plaintext in Java using the encrypted-dek.json file. To execute this Java code, you will need to authenticate properly to GCP. The easiest way to do this in a local environment is via gcloud auth application-default login.
6. Finally, decrypt ciphertext with BigQuery SQL.
You can verify that we get our hello world plaintext back.
You can access the Java source code in the bq-column-level-encryption-example repository. Head over to Tink and BigQuery documentation for more advanced capabilities, such as key rotation.