Aliz Team | January 12, 2023
4 Minutes Read
Data-first companies often face challenges when it comes to the democratization and scaling of their analytics. Data mesh architecture alleviates the bottlenecks that frequently occur when data is siloed and then loaded into a centralized data warehouse.
To overcome these challenges, first let’s take a look at how data mesh compares to the centralized approach to understand the four principles of data mesh and the resulting architecture paradigm. After reviewing the benefits and drawbacks of the concept, we will share how our company, AlizTech, helped to implement a data-mesh-based solution for one of our partners.
As the software engineering paradigm shifted from monolithic applications to microservices, the data industry responded to the weaknesses of traditional data lakes with data mesh. Not so long ago it was common practice to introduce a central data team, whose responsibility was to extract data from all the sources throughout the company and load it into a monolithic data warehouse for further analysis.
This approach, however, carries limitations. For example, the throughput of the data team can easily become a bottleneck slowing down the entire workflow, which makes it harder and harder to make timely data-driven decisions. Data mesh addresses the issue with decentralization, giving back the responsibility for data management to the source teams. A central data team is expected to manage data barely familiar to them, while in a mesh architecture, small, domain-specific teams can develop a deep understanding of the business-related aspects of their data ownership.
Our data mesh/data lake comparison already touched on the first and second principles. Domain data owners are accountable for providing their data as a product, while the consumers of the data are considered customers. The data should be discoverable and explorable, but still trustworthy and secure. Owners should understand their consumers’ needs, then autonomously serve it through extract-load-transform (ELT) pipelines.
The mortar that keeps the mesh together is a domain-agnostic data infrastructure, a platform for running pipelines and storage, and messaging. Domain teams do not have to manage the underlying technology stack and infrastructure; that is the platform team’s job. With available documentation, clear syntax and semantics, strict quality control, understandable metadata, and computational governance, the data platform ensures that everything is in place for self-service.
Finally, the principle of federated governance and security means a set of standards ensuring that all domain teams adhere to the organization’s rules and industry regulations. Governance and security should be unified and promoted through the entire mesh to achieve interoperability without duplicating efforts and reinventing the wheel.
The main benefits of the data mesh architecture are its scalability and agility. A well-implemented data mesh can greatly improve time-to-market, while its distributed nature guarantees the flexibility of the system and reduces the number of possible single points of failure. Because data mesh is so heavily reliant on unified documentation, it can also increase transparency throughout the organization.
On the other hand the paradigm requires a high level of administration and forward planning. For startups and other small companies, this overhead may outweigh the benefits.
In this specific case, our customer is a huge food delivery company, whose business extends to various countries on multiple continents. The company's internal structure is segmented both by business function and by localization, while the domain teams mirror the different business areas.
Each domain team has their own Apache Airflow instance created from a unified template, where they can run their ELT pipelines. There are also several pipeline templates prepared for the most common tasks like data curation, documentation, and policy tag application. These pipelines only need configuration to get going. Whenever a staging area is needed for the ingested data, the domain teams can use their Google Cloud Storage (GCS) bucket to temporarily store their files.
The core of the data platform is Google’s data warehousing service, BigQuery, which is a serverless, columnar storage with extensive analytical capabilities. BigQuery supports ANSI SQL queries as well asML and it is well prepared to handle big data on a petabyte scale - it was perfect for our business case.
Data loaded into BigQuery was separated by domain into multiple datasets. Each domain managed data in its own layer, also represented by datasets. For example, freshly ingested data landed in the raw dataset of the domain team. After deduplication and filtering, it ended up in either the reporting layer or the curated layer ready for the consumers.
Data engineers had the freedom to use templated pipelines or implement custom solutions to bring data from various sources to the raw layer using their team’s Airflow instance, GCS bucket, and in some cases, even third-party services like Fivetran with Salesforce. To mention a few examples, we ingested data from PostgreSQL, Google Playstore, the backend APIs of various in-house and third-party applications, as well as from Amazon S3.
To ensure interoperability and easy data discovery between domain teams, we implemented a few unified standards. For example:
After the data engineers imported the data into the raw layer and ensured its quality with some filtering, the data analysts could use it to write their SQL queries to generate reports and curated tables that suited their dashboards and BI tools.
We used short-lived dbt instances to execute the queries because dbt is an open-source tool that is able to automatically generate documentation, model dependencies between queries, and run tests. In our case, dbt ran containerized on a Kubernetes pod started by an Airflow pipeline. We selected dbt for its in-built data testing capabilities, and easy templating and configuration, and also because this way our analysts could accomplish self-service without much intervention from a data engineer. They could use the familiar SQL and test their queries from their local machine, while only having to learn basic Jinja2 templating and some Git commands.
This allowed us not only to shorten time-to-market, but also to reduce the engineering hours needed to maintain the data warehouse. Instead of adding new SQL queries and handling trivial tasks, data engineers could focus on further developments.
We compared the data mesh architecture to the monolithic, centrally managed data warehouse solution, and explored its advantages over the older paradigm. Data mesh replaced the central team with domain teams, giving them the opportunity to become deeply familiar with their field of business, while a unified data Infrastructure-as-a-Platform enabled self-service for them, and a set of standards and rules ensured interoperability between them.
Beside the theory, we also explored a data mesh implementation in practice, where we used Airflow, Kubernetes, GCS, and BigQuery as our shared data platform. Our common standards for managing personal data, accesses, and documentation allowed smooth collaboration between the domain teams, while dbt extended self-service to analysts with easy, templatable SQL queries. The result was an efficient, flexible structure that not only supported data-driven decisions, but it also helped to reduce the needed data engineering time.
Doctusoft Office Opened in Singapore
July 13, 2018
< 1 Minute Read
In 2018 we announced a brand-new Doctusoft office opening officially in Singapore.
AutoML: An Introduction To Get You Started
January 29, 2020
4 Minutes Read
AutoML is attracting more and more attention. What is AutoML and what is it used for? Learn more in our article.
5 Good Reasons to Move to a Cloud-based Data Warehouse
November 5, 2018
3 Minutes Read
Now, let me walk you through the benefits of cloud-based data warehouses, one by one.
Get great career development news, job trends, and advice right in your inbox.