Google BigQuery: Copying data between projects and locations

I’ve worked on a tool called Big Replicate that we’ve open-sourced: it helps copy, and synchronise, data between different datasets in projects in Google BigQuery.

I’m a huge fan of Google’s BigQuery product: a large-scale, and affordable, hosted data-warehouse.

Data is stored in Tables, with sets of tables stored within a Dataset (all of which are part of a Google Cloud Project). Datasets can be stored in a number of different locations: the US, EU and Asia.

We use BigQuery most often with our Google Analytics data. Google provide a daily export of all hit data as part of their paid-for version of the product. But this data is automatically placed in a Dataset based in the US. If we want to connect it to our CRM data, for instance, that’s hosted within the EU then we need to copy data into the EU also.

We’ve written a tool that automatically synchronises data from one dataset into another. It works in such a way that it’s possible to copy data between both datasets within the same project, or datasets in entirely different projects. Further, datasets can be in different locations (allowing data to be copied from the US to the EU, for instance).

The code is available on GitHub: https://github.com/uswitch/big-replicate, with binaries also available.

For example, to synchronise any Google Analytics session tables:

The tool will find all tables that exist in the source dataset and not yet in the destination dataset, orders them lexicographically in reverse order by name (so table foo_20160712 is prioritised before foo_20160711). The top --number tables are copied. We use this to help migrate very large datasets over time, where we generally only need the more recent data immediately.