Moving Data from On-Prem to BigQuery
Nowadays, more and more organizations are using Google Cloud BigQuery to solve their needs for a cloud-based enterprise data warehouse. If you’re one of them, one of the problems you might face is how to move your data from an on-premises (on-prem) solution to Google Cloud Platform without harming your current setup and production performance. In this post, we take a look at a very specific problem where you (the client)
- want to move to the cloud, because your business reports are very slow, even on a 500GB dataset.
- currently have a small production server that runs PostgreSQL with a size of ~500GB-1TB database, two cores, 16GB memory, and a high load.
- also want to keep your data in PostgreSQL, so the best option is streaming (or mini bach transfers), on top of the initial data export.
- don’t want to export all of your data (but you’ve defined it as a possible requirement in the future).
- might want to export the results of specific queries from PostgreSQL rather than dump everything.
- have data that’s very complex.
- use VPN.
Basically, you don’t want to change anything at the moment, but you would like to use BigQuery in parallel.
Our challenge is to create a pipeline that is able to stream the data out of PostgreSQL to BigQuery. The hard part is that we have to take into account the server’s limitations. It has to also handle initial data loads and streaming at the same time, considering a maximum table size of ~200GB.
- Composer: Cloud Composer is a fully managed workflow orchestration service that empowers you to author, schedule, and monitor pipelines that span clouds and on-prem data centers. Built on the popular Apache Airflow open-source project and operated using the Python programming language, Cloud Composer is free from lock-in and easy to use.
- Pub/Sub: Cloud Pub/Sub is a simple, reliable, scalable foundation for stream analytics/event-driven computing systems and it’s a global messaging and event ingestion service.
- Google Cloud Storage: Google Cloud Storage (GCS) is a flexible, scalable, and durable storage option. You can read and write files to GCS buckets from almost anywhere, so you can use buckets as common storage between your instances, Google App Engine, your on-prem systems, and other cloud services.
- BigQuery: BigQuery is Google’s serverless, highly scalable, enterprise data warehouse designed to make all your data analysts productive at an unmatched price-performance. Because there is no infrastructure to manage, you can focus on analyzing data to find meaningful insights using familiar SQL without the need for a database administrator.
- Docker: Docker is a tool designed to make it easier to create, deploy, and run applications using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package.
- Composer regularly publishes a request via Pub/Sub containing the incremental query (based on timestamps).
- The Java Docker application receives it, and starts the export.
- After the export is ready, the Java Docker application uploads the JSON file to GCS using gsutil.
- Then, the Java Docker application publishes a message to Pub/Sub that the task has been finished.
- Composer receives that message, it takes the JSON file from GCS and starts a load job to BigQuery
Exporting JSON from PostgreSQL in an Efficient Way
Here, we review only the most interesting part of efficiently exporting JSON from PostgreSQL: the Java Docker application’s psql command that does the export. It’s challenging. Sometimes it has to export a few megabytes, but when adding a new table, it might have to export 200GB+. And while it’s exporting, the database is still being used in production. So, we have to be memory and IO efficient as much as possible so as not to harm your application.
The only way we can do this without consuming loads of memory is to use PostgreSQL’s COPY.
COPY differs from other solutions (such as redirecting query results to a file), because it’s able to stream the data out to standard output or to a file, so a small amount of memory is enough. The problem is that it’s able to export only to CSV. And the problem with CSV is that the data in the database is so complex CSV’s format can’t handle it. Therefore, we choose an another output format, JSON.
Exporting to JSON with COPY
To solve this problem, the query itself has to return a JSON. Then we only have to stream it out to a file or stdout via COPY. It’s very easy actually:
SELECT row_to_json(t) FROM (SELECT * FROM table) t
The problem with this though, is that when it’s exported via COPY, COPY will ruin the JSON’s format, because COPY by default uses double quotes (
") as a CSV column separator.
The solution is a little bit hacky, but works really well:
COPY (SELECT row_to_json(t) FROM (SELECT * FROM table) t) TO STDOUT WITH QUOTE E'\b' CSV
This replaces the default double quotes with a backspace character during the escape stage. In this way, COPY won’t ruin the JSON structure as it’ll look for backspace characters as column separators. Using this solution, we have a file that contains JSON data in one CSV column separated with backspaces. We simply have to replace those backspace characters in the file in every row:
sed -i -e 's/^.\(.*\).$/\1/' file.json
Note: Depending on the data, we can replace a backspace with any other character that complies with the QUOTE parameter’s requirements.
At this point, the data is ready to upload with gsutil to Cloud Storage.
The Good and the Bad
As always, there are good and bad things about this.
On the on-prem side, this solution is an efficient way to stream out data. Also, by using Pub/Sub, we avoid any VPN/Firewall issues easily as we don’t need any network hacks. It’s a scalable, robust solution.
In this solution, Composer provides the orchestration which was invented for use in more complex scenarios. Here it’s a little bit of overkill, as it spins up three machines at a minimum.
In the end, we have a robust, scalable solution. It’s able to export a valid JSON from a 1TB database without consuming loads of memory and harming the database. These JSON files can then be easily imported into BigQuery from GCS so you can use them.