Observability
Prometheus metrics, Grafana dashboards, and alerting rules for Nanosync.
Nanosync exports metrics via OpenTelemetry, bridged to a Prometheus scrape endpoint at /metrics. A Grafana dashboard and alerting rules are included in the repository.
Prometheus
GET http://localhost:7600/metrics
Add to prometheus.yml:
scrape_configs:
- job_name: nanosync
static_configs:
- targets: ['localhost:7600']
metrics_path: /metrics
scrape_interval: 15s
Kubernetes ServiceMonitor
helm upgrade nanosync deploy/helm/nanosync/ --set serviceMonitor.enabled=true
Metrics reference
Pipeline metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
ns_pipeline_events_per_second | Gauge | pipeline | EWMA throughput, events/s |
ns_pipeline_replication_lag_seconds | Gauge | pipeline | Source-to-sink commit latency — primary SLO metric |
ns_pipeline_last_checkpoint_timestamp_seconds | Gauge | pipeline | Unix timestamp of last committed checkpoint |
ns_pipeline_events_total | Counter | pipeline, table, op | Cumulative events processed |
ns_pipeline_sink_errors_total | Counter | pipeline, error_type | Failed sink writes |
ns_pipeline_state | Gauge | pipeline, state | 1 if the pipeline is in this state |
CDC metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
ns_cdc_table_events_total | Counter | pipeline, table, op | CDC events by operation (insert/update/delete) |
ns_cdc_table_lag_seconds | Gauge | pipeline, table | Per-table replication lag |
Snapshot metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
ns_snapshot_rows_total | Counter | pipeline, table | Rows backfilled during initial snapshot |
ns_snapshot_partitions_completed_total | Counter | pipeline, table | Completed snapshot partitions |
ns_snapshot_phase | Gauge | pipeline | 1 during snapshot, 0 during CDC |
SQL Server tlog metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
ns_tlog_read_lag_seconds | Gauge | pipeline | Age of oldest unprocessed transaction log record |
ns_tlog_gaps_total | Counter | pipeline | LSN gap events that triggered a snapshot fallback |
System metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
ns_buffer_flush_total | Counter | pipeline, reason | AdaptiveBuffer flush events by reason |
ns_buffer_size_bytes | Gauge | pipeline | Current buffer size in bytes |
ns_worker_count | Gauge | — | Number of active Nanosync worker instances |
Grafana dashboard
Import the pre-built dashboard at deploy/dashboards/nanosync-overview.json:
curl -X POST \
-H "Content-Type: application/json" \
-d @deploy/dashboards/nanosync-overview.json \
http://admin:password@localhost:3000/api/dashboards/import
Panels: replication lag (P50/P95/P99), throughput, snapshot progress, pipeline state, sink errors, worker fleet.
Alerting rules
deploy/alerts/nanosync.yaml:
groups:
- name: nanosync
rules:
- alert: NanosyncReplicationLagHigh
expr: ns_pipeline_replication_lag_seconds > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Replication lag > 30s on {{ $labels.pipeline }}"
- alert: NanosyncReplicationLagCritical
expr: ns_pipeline_replication_lag_seconds > 300
for: 2m
labels:
severity: critical
annotations:
summary: "Replication lag > 5m on {{ $labels.pipeline }}"
- alert: NanosyncPipelineError
expr: ns_pipeline_state{state="error"} == 1
for: 1m
labels:
severity: critical
annotations:
summary: "Pipeline {{ $labels.pipeline }} is in error state"
- alert: NanosyncSinkErrorsHigh
expr: rate(ns_pipeline_sink_errors_total[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Sink errors on {{ $labels.pipeline }}: {{ $value }}/s"
Structured logging
Nanosync emits structured logs in JSON (when piped) or colourised text (on TTY).
{"time":"2026-03-11T09:14:02Z","level":"INFO","msg":"checkpoint committed","pipeline":"orders-to-bigquery","lsn":"0/1A2B3C4","events":4231,"lag_ms":12}
nanosync start server --log-format json # force JSON
nanosync start server --log-format text # force text
The web UI at http://localhost:7600/app/ provides basic monitoring during development — no Grafana setup needed.