BigQuery Sink

Nanosync writes to BigQuery using the Storage Write API in committed stream mode, which provides exactly-once delivery semantics. CDC events are upserted by primary key using InsertOrUpdate semantics — rows are merged on the configured primary_keys.

Prerequisites

A GCP project with the BigQuery API enabled
A BigQuery dataset created before starting the pipeline (Nanosync creates tables automatically, not datasets)
A service account or Workload Identity with the required IAM roles

IAM permissions

Grant the following roles to the Nanosync service account:

Role	Purpose
`roles/bigquery.dataEditor`	Create tables and insert rows via the Storage Write API
`roles/bigquery.jobUser`	Run jobs for schema operations (ALTER TABLE, CREATE TABLE)

SA="nanosync@my-project.iam.gserviceaccount.com"

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:${SA}" \
  --role="roles/bigquery.dataEditor"

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:${SA}" \
  --role="roles/bigquery.jobUser"

Setup

Enable the BigQuery API

gcloud services enable bigquery.googleapis.com --project=my-project

Create a service account and key

gcloud iam service-accounts create nanosync \
  --display-name="Nanosync Replicator" \
  --project=my-project

gcloud iam service-accounts keys create nanosync-bq-key.json \
  --iam-account=nanosync@my-project.iam.gserviceaccount.com

Grant IAM roles (see commands above)
Create the destination dataset
```
bq mk --dataset --location=US my-project:replication
```
Nanosync creates tables inside the dataset automatically on first run.

On GKE, use Workload Identity instead of a key file. Annotate the Kubernetes service account with the GCP service account email — no credentials_file needed.

Connection configuration

connections:
  - name: prod-bigquery
    type: bigquery
    properties:
      project_id:  my-gcp-project
      dataset_id:  replication
      # credentials_file: /path/to/nanosync-bq-key.json   # omit to use ADC

Pipeline configuration

pipelines:
  - name: orders-to-bigquery
    source:
      connection: prod-postgres
      tables:
        - public.orders
        - public.order_items
    sink:
      connection: prod-bigquery
      properties:
        table_id:      orders           # destination table name
        primary_keys:  id               # comma-separated; required for CDC upsert/delete
        partition_type: DAY             # DAY | HOUR | MONTH | YEAR
        partition_field: created_at     # column to partition on; omit for ingestion-time partitioning

Sink properties

Connection

Property	Default	Description
`project_id`	—	GCP project ID. Required.
`dataset_id`	—	BigQuery dataset name. Required.
`credentials_file`	—	Path to a service account JSON key. Defaults to `GOOGLE_APPLICATION_CREDENTIALS` or Application Default Credentials.

Per-pipeline (set in `sink.properties`)

Property	Default	Description
`table_id`	—	Destination BigQuery table name. Required.
`primary_keys`	—	Comma-separated list of primary key columns (e.g. `id` or `tenant_id,id`). Required for CDC pipelines — used to merge upserts and apply deletes.
`partition_type`	`DAY`	Time-based partitioning granularity: `DAY`, `HOUR`, `MONTH`, or `YEAR`. Set to empty string to disable partitioning.
`partition_field`	—	Column to partition on. If omitted, BigQuery uses ingestion time. Must be a `DATE`, `DATETIME`, or `TIMESTAMP` column.
`create_disposition`	`CREATE_IF_NEEDED`	What to do when the destination table does not exist: `CREATE_IF_NEEDED` (create it) or `CREATE_NEVER` (fail).
`max_staleness`	—	Maximum acceptable staleness for Storage Write API committed streams (e.g. `10m`). Reduces write amplification on high-frequency tables. Leave empty for default BigQuery behaviour.

Table schema

Nanosync creates the destination table automatically when create_disposition is CREATE_IF_NEEDED (the default). The schema is inferred from the source table.

Type mapping:

Source type	BigQuery type
`integer`, `bigint`, `int4`, `int8`	`INT64`
`numeric`, `decimal`	`NUMERIC`
`float`, `double precision`, `float8`	`FLOAT64`
`text`, `varchar`, `char`, `bpchar`	`STRING`
`boolean`, `bool`	`BOOL`
`timestamp`, `timestamptz`	`TIMESTAMP`
`date`	`DATE`
`time`, `timetz`	`TIME`
`jsonb`, `json`	`JSON`
`bytea`	`BYTES`
`uuid`	`STRING`
`array`	`REPEATED` field

Nanosync appends the following CDC metadata columns to every row:

Column	Type	Description
`_ns_op`	`STRING`	Operation: `INSERT`, `UPDATE`, or `DELETE`
`_ns_committed_at`	`TIMESTAMP`	Commit timestamp from the source
`_ns_table`	`STRING`	Fully-qualified source table name

How writes work

Snapshot — rows are inserted in bulk via the Storage Write API using committed streams.

CDC — each change event is written as an upsert (INSERT or UPDATE → InsertOrUpdate by primary_keys) or delete. BigQuery does not support row-level deletes via the Storage Write API. Deletes are recorded as rows with _ns_op = 'DELETE'. Filter them in your queries or create a view:

-- View that excludes deleted rows and keeps the latest version of each row
CREATE OR REPLACE VIEW `my_project.replication.orders_current` AS
SELECT * EXCEPT (_ns_op, _ns_committed_at, _row_num)
FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY _ns_committed_at DESC) AS _row_num
  FROM `my_project.replication.orders`
)
WHERE _row_num = 1 AND _ns_op != 'DELETE';

Batching — Nanosync flushes when the first of these thresholds fires: 1,000 events, 4 MiB of data, or 100 ms since the last flush.

Limitations

DELETE operations do not remove rows. BigQuery’s Storage Write API does not support row-level deletes. Nanosync writes a sentinel row with _ns_op = 'DELETE'. Use the view pattern shown above to present a current-state view to consumers.
Datasets must exist before the pipeline starts. Nanosync creates tables automatically but not datasets. Create the destination dataset manually with bq mk --dataset.
Partitioning is set at table creation. Changing partition_type or partition_field after the table exists requires dropping and recreating the table.
ARRAY and complex nested types from the source may not map cleanly. Verify the generated schema after the first pipeline run for tables with array or composite columns.
BigQuery Storage Write API quota applies per project. High-throughput pipelines may require a quota increase — request one in the GCP console under BigQuery → Storage Write API.

Monitoring

nanosync metrics pipeline orders-to-bigquery

Key Prometheus metrics:

Metric	Description
`ns_pipeline_replication_lag_seconds`	End-to-end source-to-sink latency
`ns_sink_rows_written_total`	Rows committed to BigQuery
`ns_sink_write_errors_total`	Storage Write API errors

Monitor Storage Write API quota in the GCP console under BigQuery → Storage Write API.

Troubleshooting

quota exceeded errors on the Storage Write API The project has exhausted its Storage Write API quota. Request a quota increase in the GCP console, or reduce throughput by increasing max_staleness.

NOT_FOUND: Table ... not found when create_disposition is CREATE_NEVER The table does not exist and Nanosync is configured not to create it. Either pre-create the table or switch create_disposition to CREATE_IF_NEEDED.

DELETE events not removing rows in BigQuery BigQuery does not support row-level deletes via the Storage Write API. Nanosync records deletes as rows with _ns_op = 'DELETE'. Use a BigQuery view or scheduled query to filter these rows for your consumers.

Schema mismatch after a source DDL change Nanosync detects schema drift and pauses the pipeline. Review the drift in the Nanosync UI, then resume. In widen mode, new columns are added automatically via ALTER TABLE.

BigQuery Sink

Prerequisites

IAM permissions

Setup

Connection configuration

Pipeline configuration

Sink properties

Connection

Per-pipeline (set in sink.properties)

Table schema

How writes work

Limitations

Monitoring

Troubleshooting

Per-pipeline (set in `sink.properties`)