BigQuery Sink

Prerequisites, IAM setup, and full configuration reference for the Nanosync BigQuery sink connector.

Nanosync writes to BigQuery using the Storage Write API in committed stream mode, which provides exactly-once delivery semantics. CDC events are upserted by primary key using InsertOrUpdate semantics — rows are merged on the configured primary_keys.

Prerequisites

IAM permissions

Grant the following roles to the Nanosync service account:

RolePurpose
roles/bigquery.dataEditorCreate tables and insert rows via the Storage Write API
roles/bigquery.jobUserRun jobs for schema operations (ALTER TABLE, CREATE TABLE)
SA="nanosync@my-project.iam.gserviceaccount.com"

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:${SA}" \
  --role="roles/bigquery.dataEditor"

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:${SA}" \
  --role="roles/bigquery.jobUser"

Setup

  1. Enable the BigQuery API

    gcloud services enable bigquery.googleapis.com --project=my-project
  2. Create a service account and key

    gcloud iam service-accounts create nanosync \
      --display-name="Nanosync Replicator" \
      --project=my-project
    
    gcloud iam service-accounts keys create nanosync-bq-key.json \
      --iam-account=nanosync@my-project.iam.gserviceaccount.com
  3. Grant IAM roles (see commands above)

  4. Create the destination dataset

    bq mk --dataset --location=US my-project:replication

    Nanosync creates tables inside the dataset automatically on first run.

On GKE, use Workload Identity instead of a key file. Annotate the Kubernetes service account with the GCP service account email — no credentials_file needed.

Connection configuration

connections:
  - name: prod-bigquery
    type: bigquery
    properties:
      project_id:  my-gcp-project
      dataset_id:  replication
      # credentials_file: /path/to/nanosync-bq-key.json   # omit to use ADC

Pipeline configuration

pipelines:
  - name: orders-to-bigquery
    source:
      connection: prod-postgres
      tables:
        - public.orders
        - public.order_items
    sink:
      connection: prod-bigquery
      properties:
        table_id:      orders           # destination table name
        primary_keys:  id               # comma-separated; required for CDC upsert/delete
        partition_type: DAY             # DAY | HOUR | MONTH | YEAR
        partition_field: created_at     # column to partition on; omit for ingestion-time partitioning

Sink properties

Connection

PropertyDefaultDescription
project_idGCP project ID. Required.
dataset_idBigQuery dataset name. Required.
credentials_filePath to a service account JSON key. Defaults to GOOGLE_APPLICATION_CREDENTIALS or Application Default Credentials.

Per-pipeline (set in sink.properties)

PropertyDefaultDescription
table_idDestination BigQuery table name. Required.
primary_keysComma-separated list of primary key columns (e.g. id or tenant_id,id). Required for CDC pipelines — used to merge upserts and apply deletes.
partition_typeDAYTime-based partitioning granularity: DAY, HOUR, MONTH, or YEAR. Set to empty string to disable partitioning.
partition_fieldColumn to partition on. If omitted, BigQuery uses ingestion time. Must be a DATE, DATETIME, or TIMESTAMP column.
create_dispositionCREATE_IF_NEEDEDWhat to do when the destination table does not exist: CREATE_IF_NEEDED (create it) or CREATE_NEVER (fail).
max_stalenessMaximum acceptable staleness for Storage Write API committed streams (e.g. 10m). Reduces write amplification on high-frequency tables. Leave empty for default BigQuery behaviour.

Table schema

Nanosync creates the destination table automatically when create_disposition is CREATE_IF_NEEDED (the default). The schema is inferred from the source table.

Type mapping:

Source typeBigQuery type
integer, bigint, int4, int8INT64
numeric, decimalNUMERIC
float, double precision, float8FLOAT64
text, varchar, char, bpcharSTRING
boolean, boolBOOL
timestamp, timestamptzTIMESTAMP
dateDATE
time, timetzTIME
jsonb, jsonJSON
byteaBYTES
uuidSTRING
arrayREPEATED field

Nanosync appends the following CDC metadata columns to every row:

ColumnTypeDescription
_ns_opSTRINGOperation: INSERT, UPDATE, or DELETE
_ns_committed_atTIMESTAMPCommit timestamp from the source
_ns_tableSTRINGFully-qualified source table name

How writes work

  1. Snapshot — rows are inserted in bulk via the Storage Write API using committed streams.

  2. CDC — each change event is written as an upsert (INSERT or UPDATE → InsertOrUpdate by primary_keys) or delete. BigQuery does not support row-level deletes via the Storage Write API. Deletes are recorded as rows with _ns_op = 'DELETE'. Filter them in your queries or create a view:

    -- View that excludes deleted rows and keeps the latest version of each row
    CREATE OR REPLACE VIEW `my_project.replication.orders_current` AS
    SELECT * EXCEPT (_ns_op, _ns_committed_at, _row_num)
    FROM (
      SELECT *,
        ROW_NUMBER() OVER (PARTITION BY id ORDER BY _ns_committed_at DESC) AS _row_num
      FROM `my_project.replication.orders`
    )
    WHERE _row_num = 1 AND _ns_op != 'DELETE';
  3. Batching — Nanosync flushes when the first of these thresholds fires: 1,000 events, 4 MiB of data, or 100 ms since the last flush.

Limitations

Monitoring

nanosync metrics pipeline orders-to-bigquery

Key Prometheus metrics:

MetricDescription
ns_pipeline_replication_lag_secondsEnd-to-end source-to-sink latency
ns_sink_rows_written_totalRows committed to BigQuery
ns_sink_write_errors_totalStorage Write API errors

Monitor Storage Write API quota in the GCP console under BigQuery → Storage Write API.

Troubleshooting

quota exceeded errors on the Storage Write API The project has exhausted its Storage Write API quota. Request a quota increase in the GCP console, or reduce throughput by increasing max_staleness.

NOT_FOUND: Table ... not found when create_disposition is CREATE_NEVER The table does not exist and Nanosync is configured not to create it. Either pre-create the table or switch create_disposition to CREATE_IF_NEEDED.

DELETE events not removing rows in BigQuery BigQuery does not support row-level deletes via the Storage Write API. Nanosync records deletes as rows with _ns_op = 'DELETE'. Use a BigQuery view or scheduled query to filter these rows for your consumers.

Schema mismatch after a source DDL change Nanosync detects schema drift and pauses the pipeline. Review the drift in the Nanosync UI, then resume. In widen mode, new columns are added automatically via ALTER TABLE.