BigQuery Sink
Prerequisites, IAM setup, and full configuration reference for the Nanosync BigQuery sink connector.
Nanosync writes to BigQuery using the Storage Write API in committed stream mode, which provides exactly-once delivery semantics. CDC events are upserted by primary key using InsertOrUpdate semantics — rows are merged on the configured primary_keys.
Prerequisites
- A GCP project with the BigQuery API enabled
- A BigQuery dataset created before starting the pipeline (Nanosync creates tables automatically, not datasets)
- A service account or Workload Identity with the required IAM roles
IAM permissions
Grant the following roles to the Nanosync service account:
| Role | Purpose |
|---|---|
roles/bigquery.dataEditor | Create tables and insert rows via the Storage Write API |
roles/bigquery.jobUser | Run jobs for schema operations (ALTER TABLE, CREATE TABLE) |
SA="nanosync@my-project.iam.gserviceaccount.com"
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:${SA}" \
--role="roles/bigquery.dataEditor"
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:${SA}" \
--role="roles/bigquery.jobUser"
Setup
-
Enable the BigQuery API
gcloud services enable bigquery.googleapis.com --project=my-project -
Create a service account and key
gcloud iam service-accounts create nanosync \ --display-name="Nanosync Replicator" \ --project=my-project gcloud iam service-accounts keys create nanosync-bq-key.json \ --iam-account=nanosync@my-project.iam.gserviceaccount.com -
Grant IAM roles (see commands above)
-
Create the destination dataset
bq mk --dataset --location=US my-project:replicationNanosync creates tables inside the dataset automatically on first run.
On GKE, use Workload Identity instead of a key file. Annotate the Kubernetes service account with the GCP service account email — no credentials_file needed.
Connection configuration
connections:
- name: prod-bigquery
type: bigquery
properties:
project_id: my-gcp-project
dataset_id: replication
# credentials_file: /path/to/nanosync-bq-key.json # omit to use ADC
Pipeline configuration
pipelines:
- name: orders-to-bigquery
source:
connection: prod-postgres
tables:
- public.orders
- public.order_items
sink:
connection: prod-bigquery
properties:
table_id: orders # destination table name
primary_keys: id # comma-separated; required for CDC upsert/delete
partition_type: DAY # DAY | HOUR | MONTH | YEAR
partition_field: created_at # column to partition on; omit for ingestion-time partitioning
Sink properties
Connection
| Property | Default | Description |
|---|---|---|
project_id | — | GCP project ID. Required. |
dataset_id | — | BigQuery dataset name. Required. |
credentials_file | — | Path to a service account JSON key. Defaults to GOOGLE_APPLICATION_CREDENTIALS or Application Default Credentials. |
Per-pipeline (set in sink.properties)
| Property | Default | Description |
|---|---|---|
table_id | — | Destination BigQuery table name. Required. |
primary_keys | — | Comma-separated list of primary key columns (e.g. id or tenant_id,id). Required for CDC pipelines — used to merge upserts and apply deletes. |
partition_type | DAY | Time-based partitioning granularity: DAY, HOUR, MONTH, or YEAR. Set to empty string to disable partitioning. |
partition_field | — | Column to partition on. If omitted, BigQuery uses ingestion time. Must be a DATE, DATETIME, or TIMESTAMP column. |
create_disposition | CREATE_IF_NEEDED | What to do when the destination table does not exist: CREATE_IF_NEEDED (create it) or CREATE_NEVER (fail). |
max_staleness | — | Maximum acceptable staleness for Storage Write API committed streams (e.g. 10m). Reduces write amplification on high-frequency tables. Leave empty for default BigQuery behaviour. |
Table schema
Nanosync creates the destination table automatically when create_disposition is CREATE_IF_NEEDED (the default). The schema is inferred from the source table.
Type mapping:
| Source type | BigQuery type |
|---|---|
integer, bigint, int4, int8 | INT64 |
numeric, decimal | NUMERIC |
float, double precision, float8 | FLOAT64 |
text, varchar, char, bpchar | STRING |
boolean, bool | BOOL |
timestamp, timestamptz | TIMESTAMP |
date | DATE |
time, timetz | TIME |
jsonb, json | JSON |
bytea | BYTES |
uuid | STRING |
array | REPEATED field |
Nanosync appends the following CDC metadata columns to every row:
| Column | Type | Description |
|---|---|---|
_ns_op | STRING | Operation: INSERT, UPDATE, or DELETE |
_ns_committed_at | TIMESTAMP | Commit timestamp from the source |
_ns_table | STRING | Fully-qualified source table name |
How writes work
-
Snapshot — rows are inserted in bulk via the Storage Write API using committed streams.
-
CDC — each change event is written as an upsert (INSERT or UPDATE →
InsertOrUpdatebyprimary_keys) or delete. BigQuery does not support row-level deletes via the Storage Write API. Deletes are recorded as rows with_ns_op = 'DELETE'. Filter them in your queries or create a view:-- View that excludes deleted rows and keeps the latest version of each row CREATE OR REPLACE VIEW `my_project.replication.orders_current` AS SELECT * EXCEPT (_ns_op, _ns_committed_at, _row_num) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY _ns_committed_at DESC) AS _row_num FROM `my_project.replication.orders` ) WHERE _row_num = 1 AND _ns_op != 'DELETE'; -
Batching — Nanosync flushes when the first of these thresholds fires: 1,000 events, 4 MiB of data, or 100 ms since the last flush.
Limitations
- DELETE operations do not remove rows. BigQuery’s Storage Write API does not support row-level deletes. Nanosync writes a sentinel row with
_ns_op = 'DELETE'. Use the view pattern shown above to present a current-state view to consumers. - Datasets must exist before the pipeline starts. Nanosync creates tables automatically but not datasets. Create the destination dataset manually with
bq mk --dataset. - Partitioning is set at table creation. Changing
partition_typeorpartition_fieldafter the table exists requires dropping and recreating the table. ARRAYand complex nested types from the source may not map cleanly. Verify the generated schema after the first pipeline run for tables with array or composite columns.- BigQuery Storage Write API quota applies per project. High-throughput pipelines may require a quota increase — request one in the GCP console under BigQuery → Storage Write API.
Monitoring
nanosync metrics pipeline orders-to-bigquery
Key Prometheus metrics:
| Metric | Description |
|---|---|
ns_pipeline_replication_lag_seconds | End-to-end source-to-sink latency |
ns_sink_rows_written_total | Rows committed to BigQuery |
ns_sink_write_errors_total | Storage Write API errors |
Monitor Storage Write API quota in the GCP console under BigQuery → Storage Write API.
Troubleshooting
quota exceeded errors on the Storage Write API
The project has exhausted its Storage Write API quota. Request a quota increase in the GCP console, or reduce throughput by increasing max_staleness.
NOT_FOUND: Table ... not found when create_disposition is CREATE_NEVER
The table does not exist and Nanosync is configured not to create it. Either pre-create the table or switch create_disposition to CREATE_IF_NEEDED.
DELETE events not removing rows in BigQuery
BigQuery does not support row-level deletes via the Storage Write API. Nanosync records deletes as rows with _ns_op = 'DELETE'. Use a BigQuery view or scheduled query to filter these rows for your consumers.
Schema mismatch after a source DDL change
Nanosync detects schema drift and pauses the pipeline. Review the drift in the Nanosync UI, then resume. In widen mode, new columns are added automatically via ALTER TABLE.