Concepts

This page defines the terms used throughout Nanosync’s documentation. Read it once and the rest of the docs will make immediate sense.

Pipeline

A pipeline is the end-to-end replication job. It pairs one source with one or more sinks, specifies which tables to replicate, and manages the full lifecycle: snapshot → CDC → checkpoint → resume.

Each pipeline is independent. You can run many pipelines against the same source database simultaneously — each gets its own replication slot and tracks its own position.

pipelines:
  - name: orders-to-bigquery   # unique name
    replication_type: cdc_backfill  # snapshot then stream
    source:
      connection: prod-postgres
      tables: [public.orders]
    sink:
      connection: prod-bigquery

For fan-out to multiple destinations, use sinks (plural) instead of sink:

pipelines:
  - name: orders-fan-out
    source:
      connection: prod-postgres
      tables: [public.orders]
    sinks:
      - connection: prod-bigquery
      - connection: prod-kafka

Connection

A connection is a named, reusable set of credentials and endpoint configuration for a source database or sink. Connections are defined once and referenced by name across multiple pipelines.

connections:
  - name: prod-postgres       # referenced in pipelines
    type: postgres
    dsn: "postgres://..."

Separating connections from pipelines means you change credentials in one place without touching pipeline definitions.

Source

A source is the database being replicated from. Nanosync reads changes from the source using the database’s native change stream — no polling, no triggers, no additional tables.

Currently available sources: PostgreSQL, SQL Server.

Sink

A sink is the destination where changes land. Nanosync writes to sinks using their native bulk-write APIs for maximum throughput.

Currently available sinks: BigQuery, Kafka, Local filesystem, AlloyDB, Cloud SQL, PostgreSQL, stdout.

CDC — Change Data Capture

CDC is the mechanism for capturing individual row-level changes (INSERT, UPDATE, DELETE) from a database as they happen, without polling or reading the full table.

Each database exposes CDC differently:

PostgreSQL — logical replication via the pgoutput plugin, reading the Write-Ahead Log (WAL)
SQL Server — CDC change tables (polling LSN watermarks) or direct transaction log reads (tlog mode)
MySQL (coming soon) — binary log streaming in GTID mode
MongoDB (coming soon) — Change Streams

CDC is the low-latency path. After the initial snapshot completes, all replication happens through CDC.

Snapshot

A snapshot is the initial full-table backfill that runs the first time a pipeline starts. Nanosync reads every row from each configured table and writes it to the sink before switching to CDC.

Snapshots are:

Consistent — reads happen inside a transaction or at a fixed WAL position
Resumable — if interrupted, the snapshot resumes from where it stopped
Parallel — multiple goroutines read different table partitions simultaneously

After the snapshot completes, the pipeline transitions to streaming CDC and never needs to do a full re-read unless the checkpoint is lost.

Checkpoint

A checkpoint is the persisted position marker that tells Nanosync exactly where it left off in the source’s change stream. After each batch is committed to the sink, the checkpoint is written to the embedded state store.

On restart, the pipeline reads the last checkpoint and resumes from that exact position — no duplicates, no data loss, no manual recovery.

Source	Checkpoint identifier
PostgreSQL	LSN (Log Sequence Number)
SQL Server CDC	LSN watermark
SQL Server tlog	Transaction log LSN
MySQL	Binlog position or GTID set
MongoDB	Change Stream resume token

WAL — Write-Ahead Log

The WAL is PostgreSQL’s append-only change log. Every committed transaction is written to the WAL before it’s applied to the data files. Nanosync reads the WAL via a replication slot using the pgoutput logical decoding plugin.

The WAL is PostgreSQL-specific terminology. SQL Server has the transaction log, MySQL has the binlog — all serve the same purpose.

Replication Slot

A replication slot is a PostgreSQL construct that tracks how far a consumer (Nanosync) has read the WAL. The slot prevents PostgreSQL from discarding WAL segments that Nanosync hasn’t consumed yet.

Replication slots retain WAL on disk until acknowledged. If Nanosync is stopped for a long time, WAL can accumulate and fill the disk. Monitor pg_replication_slots.wal_status in production.

Nanosync creates one slot per pipeline automatically (nanosync_slot_<pipeline-name>).

AdaptiveBuffer

The AdaptiveBuffer is Nanosync’s internal micro-batching layer between the CDC decoder and the sink writer. It accumulates Arrow records and flushes when any of these conditions fire:

Size gate — 1,000 events accumulated
Age gate — 100 ms since the last flush
Idle gate — source has gone quiet (no new events)

Arrow records

All events inside Nanosync flow as Apache Arrow columnar records. Arrow is a language-agnostic, zero-copy in-memory format. Using Arrow end-to-end means no JSON marshaling on the hot path and no row-by-row deserialization.

State store

The state store is an embedded SQLite database co-located with the Nanosync binary. It persists checkpoints, schema history, pipeline config, and run history. No external database is required.