Concepts

Core concepts and terminology used throughout Nanosync — pipelines, connections, CDC, snapshots, checkpoints, and more.

This page defines the terms used throughout Nanosync’s documentation. Read it once and the rest of the docs will make immediate sense.

Pipeline

A pipeline is the end-to-end replication job. It pairs one source connection with one sink connection, specifies which tables to replicate, and manages the full lifecycle: snapshot → CDC → checkpoint → resume.

Each pipeline is independent. You can run many pipelines against the same source database simultaneously — each gets its own replication slot and tracks its own position.

pipelines:
  - name: orders-to-bigquery   # unique name
    source:
      connection: prod-postgres
      tables: [public.orders]
    sink:
      connection: prod-bigquery

Connection

A connection is a named, reusable set of credentials and endpoint configuration for a source database or sink. Connections are defined once and referenced by name across multiple pipelines.

connections:
  - name: prod-postgres       # referenced in pipelines
    type: postgres
    dsn: "postgres://..."

Separating connections from pipelines means you change credentials in one place without touching pipeline definitions.

Source

A source is the database being replicated from. Nanosync reads changes from the source using the database’s native change stream — no polling, no triggers, no additional tables.

Currently available sources: PostgreSQL, SQL Server.

Sink

A sink is the destination where changes land. Nanosync writes to sinks using their native bulk-write APIs for maximum throughput.

Currently available sinks: BigQuery, Kafka, Local filesystem, AlloyDB, Cloud SQL, PostgreSQL, stdout.

CDC — Change Data Capture

CDC is the mechanism for capturing individual row-level changes (INSERT, UPDATE, DELETE) from a database as they happen, without polling or reading the full table.

Each database exposes CDC differently:

CDC is the low-latency path. After the initial snapshot completes, all replication happens through CDC.

Snapshot

A snapshot is the initial full-table backfill that runs the first time a pipeline starts. Nanosync reads every row from each configured table and writes it to the sink before switching to CDC.

Snapshots are:

After the snapshot completes, the pipeline transitions to streaming CDC and never needs to do a full re-read unless the checkpoint is lost.

Checkpoint

A checkpoint is the persisted position marker that tells Nanosync exactly where it left off in the source’s change stream. After each batch is committed to the sink, the checkpoint is written to the embedded state store.

On restart, the pipeline reads the last checkpoint and resumes from that exact position — no duplicates, no data loss, no manual recovery.

SourceCheckpoint identifier
PostgreSQLLSN (Log Sequence Number)
SQL Server CDCLSN watermark
SQL Server tlogTransaction log LSN
MySQLBinlog position or GTID set
MongoDBChange Stream resume token

WAL — Write-Ahead Log

The WAL is PostgreSQL’s append-only change log. Every committed transaction is written to the WAL before it’s applied to the data files. Nanosync reads the WAL via a replication slot using the pgoutput logical decoding plugin.

The WAL is PostgreSQL-specific terminology. SQL Server has the transaction log, MySQL has the binlog — all serve the same purpose.

Replication Slot

A replication slot is a PostgreSQL construct that tracks how far a consumer (Nanosync) has read the WAL. The slot prevents PostgreSQL from discarding WAL segments that Nanosync hasn’t consumed yet.

Nanosync creates one slot per pipeline automatically (nanosync_slot_<pipeline-name>).

AdaptiveBuffer

The AdaptiveBuffer is Nanosync’s internal micro-batching layer between the CDC decoder and the sink writer. It accumulates Arrow records and flushes when any of these conditions fire:

Arrow records

All events inside Nanosync flow as Apache Arrow columnar records. Arrow is a language-agnostic, zero-copy in-memory format. Using Arrow end-to-end means no JSON marshaling on the hot path and no row-by-row deserialization.

State store

The state store is an embedded SQLite database co-located with the Nanosync binary. It persists checkpoints, schema history, pipeline config, and run history. No external database is required.