Ivan Kobza | July 27, 2022
5 Minutes Read
Often there is a need to check and monitor data quality in your data warehouse. Given that in most of our projects, BigQuery is used as a DWH solution, the requirements for the data quality checks toolkit are as follows:
This case study uses Dataplex data quality tasks to fulfill these requirements.
The Dataplex data quality uses open source software CloudDQ under the hood. Validation results are stored in a target BigQuery data set so that they can be easily accessed.
Dashboards in a Data Studio may be used to present results in a visualized way. Cloud Monitoring can send alerts to various channels, including emails for employee notifications and PubSub for automation.
We used a subset of publicly available NYC taxi trips dataset.
CloudDQ uses YAML config files for validation rules configuration.
Please check a reference.
In this example, we created a config file to find the following issues:
A scheduled task can be created in Dataplex to perform regular data quality validation (e.g., daily). Please refer to this section of the documentation regarding task creation.
There is also an option to trigger a Dataplex task from Cloud Composer. This may be useful if there is a requirement to implement a quality gate instead of monitoring data quality with a schedule.
Validation results are stored in a summary table in BigQuery.
Data Studio can be used to build dashboards on top of it:
There is no alerting functionality in Dataplex data quality tasks. However, you can use Cloud Monitoring to send notifications to various channels.
One workaround is to create a scheduled query that regularly checks dq_summary for the data check failures for which we want to send an alert. Once identified, the query will append a row to the alerting table.
Then you can create a log-based alert policy with the following log filter:
Another option is to use --summary_to_stdout key in a Dataplex DQ task to publish output to stdout. Then you can again create a log-based alerting policy.
Dataplex DQ is a simple yet powerful option to implement data quality checks and quality gates. In addition, it is flexible and provides easily accessible results in BigQuery. Another benefit is that it’s serverless, so you don’t have to worry about managing infrastructure.
Buon giorno, Milan! Aliz opens 4th office, this time in Italy
3 Minutes Read
Ciao, Milano! Celebrate the opening of our new office in Italy with us and find out why we’re so excited about setting foot in Milan. Read our blog post announcing the news:
Google Cloud Infrastructure Modernization - Stay Agile With An Open Architecture
3 Minutes Read
Learn more about how Google Cloud infrastructure modernization solutions can help your business to become more competitive.
Everyone can do ML on GCP
3 Minutes Read
In this post, we will look at how GCP brings ML to the masses through premade APIs, AutoML, BigQueryML, and AI building blocks.