Sandor Muranyi | April 25, 2019
4 Minutes Read
Moving Data from On-Prem to BigQuery
Nowadays, more and more organizations are using Google Cloud BigQuery to solve their needs for a cloud-based enterprise data warehouse. If you’re one of them, one of the problems you might face is how to move your data from an on-premises (on-prem) solution to Google Cloud Platform without harming your current setup and production performance. In this post, we take a look at a very specific problem where you (the client)
Architecture
Explanation
Exporting JSON from PostgreSQL in an Efficient Way
Here, we review only the most interesting part of efficiently exporting JSON from PostgreSQL: the Java Docker application’s psql command that does the export. It’s challenging. Sometimes it has to export a few megabytes, but when adding a new table, it might have to export 200GB+. And while it’s exporting, the database is still being used in production. So, we have to be memory and IO efficient as much as possible so as not to harm your application. The only way we can do this without consuming loads of memory is to use PostgreSQL’s COPY. COPY differs from other solutions (such as redirecting query results to a file), because it’s able to stream the data out to standard output or to a file, so a small amount of memory is enough. The problem is that it’s able to export only to CSV. And the problem with CSV is that the data in the database is so complex CSV’s format can’t handle it. Therefore, we choose an another output format, JSON.
Exporting to JSON with COPY
To solve this problem, the query itself has to return a JSON. Then we only have to stream it out to a file or stdout via COPY. It’s very easy actually: SELECT row_to_json(t) FROM (SELECT FROM table) t The problem with this though, is that when it’s exported via COPY, COPY will ruin the JSON’s format, because COPY by default uses double quotes (") as a CSV column separator. The solution is a little bit hacky, but works really well: COPY (SELECT row_to_json(t) FROM (SELECT FROM table) t) TO STDOUT WITH QUOTE E'\b' CSV This replaces the default double quotes with a backspace character during the escape stage. In this way, COPY won’t ruin the JSON structure as it’ll look for backspace characters as column separators. Using this solution, we have a file that contains JSON data in one CSV column separated with backspaces. We simply have to replace those backspace characters in the file in every row: sed -i -e 's/^.(.*).$/\1/' file.json Note: Depending on the data, we can replace a backspace with any other character that complies with the QUOTE parameter’s requirements. At this point, the data is ready to upload with gsutil to Cloud Storage.
The Good and the Bad
As always, there are good and bad things about this.
The GoodOn the on-prem side, this solution is an efficient way to stream out data. Also, by using Pub/Sub, we avoid any VPN/Firewall issues easily as we don’t need any network hacks. It’s a scalable, robust solution.
The BadIn this solution, Composer provides the orchestration which was invented for use in more complex scenarios. Here it’s a little bit of overkill, as it spins up three machines at a minimum.
Summary
In the end, we have a robust, scalable solution. It’s able to export a valid JSON from a 1TB database without consuming loads of memory and harming the database. These JSON files can then be easily imported into BigQuery from GCS so you can use them.
AutoML: An Introduction To Get You Started
2020-01-29
4 Minutes Read
AutoML is an exciting new trend in the Machine Learning (ML) industry. It revolves around analyzing data automatically and getting meaningful insights with minimum effort. By using the ingested data it is also capable of building models which can later be used as predictors for new data points.
Google BigQuery materialized view test drive
2022-04-07
3 Minutes Read
I have tested the BigQuery materialized views against the documentation. While most of the functionality and limitations are accurate, there are a few gotchas you need to be aware of.
Employee well-being initiatives: Creating an engaged workforce
2019-05-08
5 Minutes Read
In my previous blog post, I shared how important it is for us to provide a flexible and healthy working environment for our employees. In addition to having an open policy on home office, we feel that as an employer, it’s our responsibility to help our team maintain their physical and mental health.