Observability

Prometheus metrics, Grafana dashboards, and alerting rules for Nanosync.

Nanosync exports metrics via OpenTelemetry, bridged to a Prometheus scrape endpoint at /metrics. A Grafana dashboard and alerting rules are included in the repository.

Prometheus

GET http://localhost:7600/metrics

Add to prometheus.yml:

scrape_configs:
  - job_name: nanosync
    static_configs:
      - targets: ['localhost:7600']
    metrics_path: /metrics
    scrape_interval: 15s

Kubernetes ServiceMonitor

helm upgrade nanosync deploy/helm/nanosync/ --set serviceMonitor.enabled=true

Metrics reference

Pipeline metrics

MetricTypeLabelsDescription
ns_pipeline_events_per_secondGaugepipelineEWMA throughput, events/s
ns_pipeline_replication_lag_secondsGaugepipelineSource-to-sink commit latency — primary SLO metric
ns_pipeline_last_checkpoint_timestamp_secondsGaugepipelineUnix timestamp of last committed checkpoint
ns_pipeline_events_totalCounterpipeline, table, opCumulative events processed
ns_pipeline_sink_errors_totalCounterpipeline, error_typeFailed sink writes
ns_pipeline_stateGaugepipeline, state1 if the pipeline is in this state

CDC metrics

MetricTypeLabelsDescription
ns_cdc_table_events_totalCounterpipeline, table, opCDC events by operation (insert/update/delete)
ns_cdc_table_lag_secondsGaugepipeline, tablePer-table replication lag

Snapshot metrics

MetricTypeLabelsDescription
ns_snapshot_rows_totalCounterpipeline, tableRows backfilled during initial snapshot
ns_snapshot_partitions_completed_totalCounterpipeline, tableCompleted snapshot partitions
ns_snapshot_phaseGaugepipeline1 during snapshot, 0 during CDC

SQL Server tlog metrics

MetricTypeLabelsDescription
ns_tlog_read_lag_secondsGaugepipelineAge of oldest unprocessed transaction log record
ns_tlog_gaps_totalCounterpipelineLSN gap events that triggered a snapshot fallback

System metrics

MetricTypeLabelsDescription
ns_buffer_flush_totalCounterpipeline, reasonAdaptiveBuffer flush events by reason
ns_buffer_size_bytesGaugepipelineCurrent buffer size in bytes
ns_worker_countGaugeNumber of active Nanosync worker instances

Grafana dashboard

Import the pre-built dashboard at deploy/dashboards/nanosync-overview.json:

curl -X POST \
  -H "Content-Type: application/json" \
  -d @deploy/dashboards/nanosync-overview.json \
  http://admin:password@localhost:3000/api/dashboards/import

Panels: replication lag (P50/P95/P99), throughput, snapshot progress, pipeline state, sink errors, worker fleet.


Alerting rules

deploy/alerts/nanosync.yaml:

groups:
  - name: nanosync
    rules:
      - alert: NanosyncReplicationLagHigh
        expr: ns_pipeline_replication_lag_seconds > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag > 30s on {{ $labels.pipeline }}"

      - alert: NanosyncReplicationLagCritical
        expr: ns_pipeline_replication_lag_seconds > 300
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Replication lag > 5m on {{ $labels.pipeline }}"

      - alert: NanosyncPipelineError
        expr: ns_pipeline_state{state="error"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Pipeline {{ $labels.pipeline }} is in error state"

      - alert: NanosyncSinkErrorsHigh
        expr: rate(ns_pipeline_sink_errors_total[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Sink errors on {{ $labels.pipeline }}: {{ $value }}/s"

Structured logging

Nanosync emits structured logs in JSON (when piped) or colourised text (on TTY).

{"time":"2026-03-11T09:14:02Z","level":"INFO","msg":"checkpoint committed","pipeline":"orders-to-bigquery","lsn":"0/1A2B3C4","events":4231,"lag_ms":12}
nanosync start server --log-format json   # force JSON
nanosync start server --log-format text   # force text

The web UI at http://localhost:7600/app/ provides basic monitoring during development — no Grafana setup needed.