OUR LATEST NEWS

BigQuery: Migrating from on-prem PostgreSQL

hero

Sandor Muranyi | April 25, 2019

4 Minutes Read

Moving Data from On-Prem to BigQuery

Nowadays, more and more organizations are using Google Cloud BigQuery to solve their needs for a cloud-based enterprise data warehouse. If you’re one of them, one of the problems you might face is how to move your data from an on-premises (on-prem) solution to Google Cloud Platform without harming your current setup and production performance. In this post, we take a look at a very specific problem where you (the client)

  • want to move to the cloud, because your business reports are very slow, even on a 500GB dataset.
  • currently have a small production server that runs PostgreSQL with a size of ~500GB-1TB database, two cores, 16GB memory, and a high load.
  • also want to keep your data in PostgreSQL, so the best option is streaming (or mini bach transfers), on top of the initial data export.
  • don’t want to export all of your data (but you’ve defined it as a possible requirement in the future).
  • might want to export the results of specific queries from PostgreSQL rather than dump everything.
  • have data that’s very complex.
  • use VPN. Basically, you don’t want to change anything at the moment, but you would like to use BigQuery in parallel. Our challenge is to create a pipeline that is able to stream the data out of PostgreSQL to BigQuery. The hard part is that we have to take into account the server’s limitations. It has to also handle initial data loads and streaming at the same time, considering a maximum table size of ~200GB.

Architecture

  • Composer: Cloud Composer is a fully managed workflow orchestration service that empowers you to author, schedule, and monitor pipelines that span clouds and on-prem data centers. Built on the popular Apache Airflow open-source project and operated using the Python programming language, Cloud Composer is free from lock-in and easy to use.
  • Pub/Sub: Cloud Pub/Sub is a simple, reliable, scalable foundation for stream analytics/event-driven computing systems and it’s a global messaging and event ingestion service.
  • Google Cloud Storage: Google Cloud Storage (GCS) is a flexible, scalable, and durable storage option. You can read and write files to GCS buckets from almost anywhere, so you can use buckets as common storage between your instances, Google App Engine, your on-prem systems, and other cloud services.
  • BigQuery: BigQuery is Google’s serverless, highly scalable, enterprise data warehouse designed to make all your data analysts productive at an unmatched price-performance. Because there is no infrastructure to manage, you can focus on analyzing data to find meaningful insights using familiar SQL without the need for a database administrator.
  • Docker: Docker is a tool designed to make it easier to create, deploy, and run applications using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package.

PG2BQ.png.webp

Explanation

  1. Composer regularly publishes a request via Pub/Sub containing the incremental query (based on timestamps).
  2. The Java Docker application receives it, and starts the export.
  3. After the export is ready, the Java Docker application uploads the JSON file to GCS using gsutil.
  4. Then, the Java Docker application publishes a message to Pub/Sub that the task has been finished.
  5. Composer receives that message, it takes the JSON file from GCS and starts a load job to BigQuery

Exporting JSON from PostgreSQL in an Efficient Way

Here, we review only the most interesting part of efficiently exporting JSON from PostgreSQL: the Java Docker application’s psql command that does the export. It’s challenging. Sometimes it has to export a few megabytes, but when adding a new table, it might have to export 200GB+. And while it’s exporting, the database is still being used in production. So, we have to be memory and IO efficient as much as possible so as not to harm your application. The only way we can do this without consuming loads of memory is to use PostgreSQL’s COPY. COPY differs from other solutions (such as redirecting query results to a file), because it’s able to stream the data out to standard output or to a file, so a small amount of memory is enough. The problem is that it’s able to export only to CSV. And the problem with CSV is that the data in the database is so complex CSV’s format can’t handle it. Therefore, we choose an another output format, JSON.

Exporting to JSON with COPY

To solve this problem, the query itself has to return a JSON. Then we only have to stream it out to a file or stdout via COPY. It’s very easy actually: SELECT row_to_json(t) FROM (SELECT FROM table) t The problem with this though, is that when it’s exported via COPY, COPY will ruin the JSON’s format, because COPY by default uses double quotes (") as a CSV column separator. The solution is a little bit hacky, but works really well: COPY (SELECT row_to_json(t) FROM (SELECT FROM table) t) TO STDOUT WITH QUOTE E'\b' CSV This replaces the default double quotes with a backspace character during the escape stage. In this way, COPY won’t ruin the JSON structure as it’ll look for backspace characters as column separators. Using this solution, we have a file that contains JSON data in one CSV column separated with backspaces. We simply have to replace those backspace characters in the file in every row: sed -i -e 's/^.(.*).$/\1/' file.json Note: Depending on the data, we can replace a backspace with any other character that complies with the QUOTE parameter’s requirements. At this point, the data is ready to upload with gsutil to Cloud Storage.

The Good and the Bad

As always, there are good and bad things about this.

The Good

On the on-prem side, this solution is an efficient way to stream out data. Also, by using Pub/Sub, we avoid any VPN/Firewall issues easily as we don’t need any network hacks. It’s a scalable, robust solution.

The Bad

In this solution, Composer provides the orchestration which was invented for use in more complex scenarios. Here it’s a little bit of overkill, as it spins up three machines at a minimum.

Summary

In the end, we have a robust, scalable solution. It’s able to export a valid JSON from a 1TB database without consuming loads of memory and harming the database. These JSON files can then be easily imported into BigQuery from GCS so you can use them.

AutoML: An Introduction To Get You Started

clock

2020-01-29

4 Minutes Read

AutoML is an exciting new trend in the Machine Learning (ML) industry. It revolves around analyzing data automatically and getting meaningful insights with minimum effort. By using the ingested data it is also capable of building models which can later be used as predictors for new data points.

Google BigQuery materialized view test drive

clock

2022-04-07

3 Minutes Read

I have tested the BigQuery materialized views against the documentation. While most of the functionality and limitations are accurate, there are a few gotchas you need to be aware of.

Employee well-being initiatives: Creating an engaged workforce

clock

2019-05-08

5 Minutes Read

In my previous blog post, I shared how important it is for us to provide a flexible and healthy working environment for our employees. In addition to having an open policy on home office, we feel that as an employer, it’s our responsibility to help our team maintain their physical and mental health.

hero
logo

Ready for the future? Let’s talk!

Reach out, and let’s take your business to the next level.

image

By clicking submit below, you consent to allow Aliz.ai to store and process the personal information submitted above and share information about our products and services, as well as other content that may be of interest to you. For more information, please review our Privacy Policy.

I agree to receive other communications from Aliz.ai.

logo

© Copyright 2022 Aliz Tech Kft.
Aliz is a proud Google Cloud Premier Partner with specializations in Data Analytics and Machine Learning. We deliver data analytics, machine learning, and infrastructure solutions, off the shelf, or custom-built on GCP using an agile, holistic approach.

logologologologo

designed and executed with kifli by kifli.tech