mirror of
https://github.com/fosrl/newt.git
synced 2026-03-27 04:56:41 +00:00
Add WebSocket and proxy lifecycle metrics
This commit is contained in:
@@ -6,20 +6,20 @@ This document captures the current state of Newt metrics, prioritized fixes, and
|
||||
|
||||
- Export: Prometheus exposition (default), optional OTLP (gRPC)
|
||||
- Existing instruments:
|
||||
- Sites: newt_site_registrations_total, newt_site_online (0/1), newt_site_last_heartbeat_seconds
|
||||
- Sites: newt_site_registrations_total, newt_site_online (0/1), newt_site_last_heartbeat_timestamp_seconds
|
||||
- Tunnel/Traffic: newt_tunnel_sessions, newt_tunnel_bytes_total, newt_tunnel_latency_seconds, newt_tunnel_reconnects_total
|
||||
- Connection lifecycle: newt_connection_attempts_total, newt_connection_errors_total
|
||||
- Operations: newt_config_reloads_total, newt_restart_count_total, newt_build_info
|
||||
- Operations: newt_config_reloads_total, newt_restart_count_total, newt_config_apply_seconds, newt_cert_rotation_total
|
||||
- Operations: newt_config_reloads_total, process_start_time_seconds, newt_build_info
|
||||
- Operations: newt_config_reloads_total, process_start_time_seconds, newt_config_apply_seconds, newt_cert_rotation_total
|
||||
- Build metadata: newt_build_info
|
||||
- Control plane: newt_websocket_connect_latency_seconds, newt_websocket_messages_total
|
||||
- Proxy: newt_proxy_active_connections, newt_proxy_buffer_bytes, newt_proxy_async_backlog_bytes, newt_proxy_drops_total
|
||||
- Control plane: newt_websocket_connect_latency_seconds, newt_websocket_messages_total, newt_websocket_connected, newt_websocket_reconnects_total
|
||||
- Proxy: newt_proxy_active_connections, newt_proxy_buffer_bytes, newt_proxy_async_backlog_bytes, newt_proxy_drops_total, newt_proxy_accept_total, newt_proxy_connection_duration_seconds, newt_proxy_connections_total
|
||||
- Go runtime: GC, heap, goroutines via runtime instrumentation
|
||||
|
||||
2) Main issues addressed now
|
||||
|
||||
- Attribute filter (allow-list) extended to include site_id and region in addition to existing keys (tunnel_id, transport, protocol, direction, result, reason, error_type, version, commit).
|
||||
- site_id and region propagation: site_id is now attached as a metric label across newt_*; region is added as a metric label when set. Both remain resource attributes for consistency with OTEL.
|
||||
- site_id and region propagation: site_id/region remain resource attributes. Metric labels mirror them on per-site gauges and counters by default; set `NEWT_METRICS_INCLUDE_SITE_LABELS=false` to drop them for multi-tenant scrapes.
|
||||
- Label semantics clarified:
|
||||
- transport: control-plane mechanism (e.g., websocket, wireguard)
|
||||
- protocol: L4 payload type (tcp, udp)
|
||||
@@ -29,10 +29,9 @@ This document captures the current state of Newt metrics, prioritized fixes, and
|
||||
3) Remaining gaps and deviations
|
||||
|
||||
- Some call sites still need initiator label on reconnect outcomes (client vs server). This is planned.
|
||||
- WebSocket and Proxy metrics (connect latency, messages, active connections, buffer/drops, async backlog) are planned additions.
|
||||
- Config apply duration and cert rotation counters are planned.
|
||||
- Registration and config reload failures are not yet emitted; add failure code paths so result labels expose churn.
|
||||
- Restart counter increments only when build metadata is provided; consider decoupling to count all boots.
|
||||
- Document using `process_start_time_seconds` (and `time()` in PromQL) to derive uptime; no explicit restart counter is needed.
|
||||
- Metric helpers often use `context.Background()`. Where lightweight contexts exist (e.g., HTTP handlers), propagate them to ease future correlation.
|
||||
- Tracing coverage is limited to admin HTTP and WebSocket connect spans; extend to blueprint fetches, proxy accept loops, and WireGuard updates when OTLP is enabled.
|
||||
|
||||
@@ -44,8 +43,6 @@ This document captures the current state of Newt metrics, prioritized fixes, and
|
||||
- Correct label semantics (transport vs protocol); fix sessions transport labelling
|
||||
- Documentation alignment
|
||||
- Phase 2 (next)
|
||||
- WebSocket: newt_websocket_connect_latency_seconds; newt_websocket_messages_total{direction,msg_type}
|
||||
- Proxy: newt_proxy_active_connections, newt_proxy_buffer_bytes, newt_proxy_drops_total, newt_proxy_async_backlog_bytes
|
||||
- Reconnect: add initiator label (client/server)
|
||||
- Config & PKI: newt_config_apply_seconds{phase,result}; newt_cert_rotation_total{result}
|
||||
- WebSocket disconnect and keepalive failure counters
|
||||
@@ -66,7 +63,7 @@ This document captures the current state of Newt metrics, prioritized fixes, and
|
||||
- Sustained connection errors:
|
||||
- rate(newt_connection_errors_total[5m]) by (site_id,transport,error_type)
|
||||
- Heartbeat gaps:
|
||||
- max_over_time(newt_site_last_heartbeat_seconds[15m]) by (site_id)
|
||||
- max_over_time(time() - newt_site_last_heartbeat_timestamp_seconds[15m]) by (site_id)
|
||||
- Proxy drops:
|
||||
- increase(newt_proxy_drops_total[5m]) by (site_id,protocol)
|
||||
- WebSocket connect p95 (when added):
|
||||
|
||||
@@ -27,6 +27,7 @@ Enable exporters via environment variables (no code changes required)
|
||||
- OTEL_RESOURCE_ATTRIBUTES=service.instance.id=<id>,site_id=<id>
|
||||
- OTEL_METRIC_EXPORT_INTERVAL=15s (default)
|
||||
- NEWT_ADMIN_ADDR=127.0.0.1:2112 (default admin HTTP with /metrics)
|
||||
- NEWT_METRICS_INCLUDE_SITE_LABELS=true|false (default: true; disable to drop site_id/region as metric labels and rely on resource attributes only)
|
||||
|
||||
Runtime behavior
|
||||
|
||||
@@ -36,12 +37,14 @@ Runtime behavior
|
||||
|
||||
Metric catalog (current)
|
||||
|
||||
Unless otherwise noted, `site_id` and `region` are available via resource attributes and, by default, as metric labels. Set `NEWT_METRICS_INCLUDE_SITE_LABELS=false` to drop them from counter/histogram label sets in high-cardinality environments.
|
||||
|
||||
| Metric | Instrument | Key attributes | Purpose | Example |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `newt_build_info` | Observable gauge (Int64) | `version`, `commit`, `site_id`, `region` (optional) | Emits build metadata with value `1` for scrape-time verification. | `newt_build_info{version="1.5.0",site_id="acme-edge-1"} 1` |
|
||||
| `newt_build_info` | Observable gauge (Int64) | `version`, `commit`, `site_id`, `region` (optional when site labels enabled) | Emits build metadata with value `1` for scrape-time verification. | `newt_build_info{version="1.5.0"} 1` |
|
||||
| `newt_site_registrations_total` | Counter (Int64) | `result` (`success`/`failure`), `site_id`, `region` (optional) | Counts Pangolin registration attempts. | `newt_site_registrations_total{result="success",site_id="acme-edge-1"} 1` |
|
||||
| `newt_site_online` | Observable gauge (Int64) | `site_id` | Reports whether the site is currently connected (`1`) or offline (`0`). | `newt_site_online{site_id="acme-edge-1"} 1` |
|
||||
| `newt_site_last_heartbeat_seconds` | Observable gauge (Float64) | `site_id` | Time since the most recent Pangolin heartbeat. | `newt_site_last_heartbeat_seconds{site_id="acme-edge-1"} 2.4` |
|
||||
| `newt_site_last_heartbeat_timestamp_seconds` | Observable gauge (Float64) | `site_id` | Unix timestamp of the most recent Pangolin heartbeat (derive age via `time() - metric`). | `newt_site_last_heartbeat_timestamp_seconds{site_id="acme-edge-1"} 1.728e+09` |
|
||||
| `newt_tunnel_sessions` | Observable gauge (Int64) | `site_id`, `tunnel_id` (when enabled) | Counts active tunnel sessions per peer; collapses to per-site when tunnel IDs are disabled. | `newt_tunnel_sessions{site_id="acme-edge-1",tunnel_id="wgpub..."} 3` |
|
||||
| `newt_tunnel_bytes_total` | Counter (Int64) | `direction` (`ingress`/`egress`), `protocol` (`tcp`/`udp`), `tunnel_id` (optional), `site_id`, `region` (optional) | Measures proxied traffic volume across tunnels. | `newt_tunnel_bytes_total{direction="ingress",protocol="tcp",site_id="acme-edge-1"} 4096` |
|
||||
| `newt_tunnel_latency_seconds` | Histogram (Float64) | `transport` (e.g., `wireguard`), `tunnel_id` (optional), `site_id`, `region` (optional) | Captures RTT or configuration-driven latency samples. | `newt_tunnel_latency_seconds_bucket{transport="wireguard",le="0.5"} 42` |
|
||||
@@ -49,15 +52,18 @@ Metric catalog (current)
|
||||
| `newt_connection_attempts_total` | Counter (Int64) | `transport` (`auth`/`websocket`), `result`, `site_id`, `region` (optional) | Measures control-plane dial attempts and their outcomes. | `newt_connection_attempts_total{transport="websocket",result="success",site_id="acme-edge-1"} 8` |
|
||||
| `newt_connection_errors_total` | Counter (Int64) | `transport`, `error_type`, `site_id`, `region` (optional) | Buckets connection failures by normalized error class. | `newt_connection_errors_total{transport="websocket",error_type="tls_handshake",site_id="acme-edge-1"} 1` |
|
||||
| `newt_config_reloads_total` | Counter (Int64) | `result`, `site_id`, `region` (optional) | Counts remote blueprint/config reloads. | `newt_config_reloads_total{result="success",site_id="acme-edge-1"} 3` |
|
||||
| `newt_restart_count_total` | Counter (Int64) | `site_id`, `region` (optional) | Increments once per process boot to detect restarts. | `newt_restart_count_total{site_id="acme-edge-1"} 1` |
|
||||
| `process_start_time_seconds` | Observable gauge (Float64) | — | Unix timestamp of the Newt process start time (use `time() - process_start_time_seconds` for uptime). | `process_start_time_seconds 1.728e+09` |
|
||||
| `newt_config_apply_seconds` | Histogram (Float64) | `phase` (`interface`/`peer`), `result`, `site_id`, `region` (optional) | Measures time spent applying WireGuard configuration phases. | `newt_config_apply_seconds_sum{phase="peer",result="success",site_id="acme-edge-1"} 0.48` |
|
||||
| `newt_cert_rotation_total` | Counter (Int64) | `result`, `site_id`, `region` (optional) | Tracks client certificate rotation attempts. | `newt_cert_rotation_total{result="success",site_id="acme-edge-1"} 2` |
|
||||
| `newt_websocket_connect_latency_seconds` | Histogram (Float64) | `transport="websocket"`, `result`, `error_type` (on failure), `site_id`, `region` (optional) | Measures WebSocket dial latency and exposes failure buckets. | `newt_websocket_connect_latency_seconds_bucket{result="success",le="0.5",site_id="acme-edge-1"} 9` |
|
||||
| `newt_websocket_messages_total` | Counter (Int64) | `direction` (`in`/`out`), `msg_type` (`text`/`ping`/`pong`), `site_id`, `region` (optional) | Accounts for control WebSocket traffic volume by type. | `newt_websocket_messages_total{direction="out",msg_type="ping",site_id="acme-edge-1"} 12` |
|
||||
| `newt_websocket_connected` | Observable gauge (Int64) | `site_id`, `region` (optional) | Reports current WebSocket connectivity (`1` when connected). | `newt_websocket_connected{site_id="acme-edge-1"} 1` |
|
||||
| `newt_websocket_reconnects_total` | Counter (Int64) | `reason` (`tls_handshake`, `dial_timeout`, `io_error`, `ping_write`, `timeout`, etc.), `site_id`, `region` (optional) | Counts reconnect attempts with normalized reasons for failure analysis. | `newt_websocket_reconnects_total{reason="timeout",site_id="acme-edge-1"} 3` |
|
||||
| `newt_proxy_active_connections` | Observable gauge (Int64) | `protocol` (`tcp`/`udp`), `direction` (`ingress`/`egress`), `tunnel_id` (optional), `site_id`, `region` (optional) | Current proxy connections per tunnel and protocol. | `newt_proxy_active_connections{protocol="tcp",direction="egress",site_id="acme-edge-1"} 4` |
|
||||
| `newt_proxy_buffer_bytes` | Observable gauge (Int64) | `protocol`, `direction`, `tunnel_id` (optional), `site_id`, `region` (optional) | Volume of buffered data awaiting flush in proxy queues. | `newt_proxy_buffer_bytes{protocol="udp",direction="egress",site_id="acme-edge-1"} 2048` |
|
||||
| `newt_proxy_async_backlog_bytes` | Observable gauge (Int64) | `protocol`, `direction`, `tunnel_id` (optional), `site_id`, `region` (optional) | Tracks async write backlog when deferred flushing is enabled. | `newt_proxy_async_backlog_bytes{protocol="tcp",direction="egress",site_id="acme-edge-1"} 512` |
|
||||
| `newt_proxy_drops_total` | Counter (Int64) | `protocol`, `tunnel_id` (optional), `site_id`, `region` (optional) | Counts proxy drop events caused by downstream write errors. | `newt_proxy_drops_total{protocol="udp",site_id="acme-edge-1"} 1` |
|
||||
| `newt_proxy_connections_total` | Counter (Int64) | `event` (`opened`/`closed`), `protocol`, `tunnel_id` (optional), `site_id`, `region` (optional) | Tracks proxy connection lifecycle events for rate/SLO calculations. | `newt_proxy_connections_total{event="opened",protocol="tcp",site_id="acme-edge-1"} 10` |
|
||||
|
||||
Conventions
|
||||
|
||||
@@ -174,7 +180,7 @@ sum(newt_tunnel_sessions)
|
||||
Compatibility notes
|
||||
|
||||
- Gauges do not use the _total suffix (e.g., newt_tunnel_sessions).
|
||||
- site_id is emitted as both resource attribute and metric label on all newt_* series; region is included as a metric label only when set. tunnel_id is a metric label (WireGuard public key). Never expose secrets in labels.
|
||||
- site_id/region remain resource attributes. Metric labels for these fields appear on per-site gauges (e.g., `newt_site_online`) and, by default, on counters/histograms; disable them with `NEWT_METRICS_INCLUDE_SITE_LABELS=false` if needed. `tunnel_id` is a metric label (WireGuard public key). Never expose secrets in labels.
|
||||
- NEWT_METRICS_INCLUDE_TUNNEL_ID (default: true) toggles whether tunnel_id is included as a label on bytes/sessions/proxy/reconnect metrics. Disable in high-cardinality environments.
|
||||
- Avoid double-scraping: scrape either Newt (/metrics) or the Collector's Prometheus exporter, not both.
|
||||
- Prometheus does not accept remote_write; use Mimir/Cortex/VM/Thanos-Receive for remote_write.
|
||||
|
||||
@@ -16,7 +16,7 @@ A global attribute filter (see `buildMeterProvider`) constrains exposed label ke
|
||||
exports stay bounded.
|
||||
|
||||
- **Site lifecycle**: `newt_site_registrations_total`, `newt_site_online`, and
|
||||
`newt_site_last_heartbeat_seconds` capture registration attempts and liveness. They
|
||||
`newt_site_last_heartbeat_timestamp_seconds` capture registration attempts and liveness. They
|
||||
are fed either manually (`IncSiteRegistration`) or via the `TelemetryView` state
|
||||
callback that publishes observable gauges for the active site.
|
||||
- **Tunnel health and usage**: Counters and histograms track bytes, latency, reconnects,
|
||||
@@ -27,17 +27,20 @@ exports stay bounded.
|
||||
`newt_connection_errors_total` are emitted throughout the WebSocket client to classify
|
||||
authentication, dial, and transport failures.
|
||||
- **Operations/configuration**: `newt_config_reloads_total`,
|
||||
`newt_restart_count_total`, `newt_config_apply_seconds`, and
|
||||
`process_start_time_seconds`, `newt_config_apply_seconds`, and
|
||||
`newt_cert_rotation_total` provide visibility into blueprint reloads, process boots,
|
||||
configuration timings, and certificate rotation outcomes.
|
||||
- **Build metadata**: `newt_build_info` records the binary version/commit together
|
||||
with a monotonic restart counter when build information is supplied at startup.
|
||||
- **WebSocket control-plane**: `newt_websocket_connect_latency_seconds` and
|
||||
`newt_websocket_messages_total` report connect latency and ping/pong/text activity.
|
||||
with optional site metadata when build information is supplied at startup.
|
||||
- **WebSocket control-plane**: `newt_websocket_connect_latency_seconds`,
|
||||
`newt_websocket_messages_total`, `newt_websocket_connected`, and
|
||||
`newt_websocket_reconnects_total` report connect latency, ping/pong/text activity,
|
||||
connection state, and reconnect reasons.
|
||||
- **Proxy data-plane**: Observable gauges (`newt_proxy_active_connections`,
|
||||
`newt_proxy_buffer_bytes`, `newt_proxy_async_backlog_bytes`) and the
|
||||
`newt_proxy_drops_total` counter are fed from the proxy manager to monitor backlog
|
||||
and drop behaviour alongside per-protocol byte counters.
|
||||
`newt_proxy_buffer_bytes`, `newt_proxy_async_backlog_bytes`) plus counters for
|
||||
drops, accepts, connection lifecycle events (`newt_proxy_connections_total`), and
|
||||
duration histograms (`newt_proxy_connection_duration_seconds`) surface backlog,
|
||||
drop behaviour, and churn alongside per-protocol byte counters.
|
||||
|
||||
Refer to `docs/observability.md` for a tabular catalogue with instrument types,
|
||||
attributes, and sample exposition lines.
|
||||
@@ -61,8 +64,9 @@ The implementation adheres to most OTel Go recommendations:
|
||||
suffixes for counters and `_seconds`/`_bytes` unit conventions. Histograms are
|
||||
registered with explicit second-based buckets.
|
||||
- **Resource attributes** – Service name/version and optional `site_id`/`region`
|
||||
populate the `resource.Resource` and are also injected as metric attributes for
|
||||
compatibility with Prometheus queries.
|
||||
populate the `resource.Resource`. Metric labels mirror these by default (and on
|
||||
per-site gauges) but can be disabled with `NEWT_METRICS_INCLUDE_SITE_LABELS=false`
|
||||
to avoid unnecessary cardinality growth.
|
||||
- **Attribute hygiene** – A single attribute filter (`sdkmetric.WithView`) enforces
|
||||
the allow-list of label keys to prevent accidental high-cardinality emission.
|
||||
- **Runtime metrics** – Go runtime instrumentation is enabled automatically through
|
||||
@@ -83,10 +87,9 @@ The review identified a few actionable adjustments:
|
||||
2. **Surface config reload failures** – `telemetry.IncConfigReload` is invoked with
|
||||
`result="success"` only. Callers should record a failure result when blueprint
|
||||
parsing or application aborts before success counters are incremented.
|
||||
3. **Harmonise restart count behaviour** – `newt_restart_count_total` increments only
|
||||
when build metadata is provided. Consider moving the increment out of
|
||||
`RegisterBuildInfo` so the counter advances even for ad-hoc builds without version
|
||||
strings.
|
||||
3. **Expose robust uptime** – Document using `time() - process_start_time_seconds`
|
||||
to derive uptime now that the restart counter has been replaced with a timestamp
|
||||
gauge.
|
||||
4. **Propagate contexts where available** – Many emitters call metric helpers with
|
||||
`context.Background()`. Passing real contexts (when inexpensive) would allow future
|
||||
exporters to correlate spans and metrics.
|
||||
@@ -98,17 +101,17 @@ The review identified a few actionable adjustments:
|
||||
|
||||
Prioritised additions that would close visibility gaps:
|
||||
|
||||
1. **WebSocket disconnect outcomes** – A counter (e.g., `newt_websocket_disconnects_total`)
|
||||
partitioned by `reason` would complement the existing connect latency histogram and
|
||||
explain reconnect storms.
|
||||
2. **Keepalive/heartbeat failures** – Counting ping timeouts or heartbeat misses would
|
||||
make `newt_site_last_heartbeat_seconds` actionable by providing discrete events.
|
||||
3. **Proxy connection lifecycle** – Add counters/histograms for proxy accept events and
|
||||
connection durations to correlate drops with load and backlog metrics.
|
||||
1. **Config reload error taxonomy** – Split reload attempts into a dedicated
|
||||
`newt_config_reload_errors_total{phase}` counter to make blueprint validation failures
|
||||
visible alongside the existing success counter.
|
||||
2. **Config source visibility** – Export `newt_config_source_info{source,version}` so
|
||||
operators can audit the active blueprint origin/commit during incidents.
|
||||
3. **Certificate expiry** – Emit `newt_cert_expiry_timestamp_seconds` (per cert) to
|
||||
enable proactive alerts before mTLS credentials lapse.
|
||||
4. **Blueprint/config pull latency** – Measuring Pangolin blueprint fetch durations and
|
||||
HTTP status distribution would expose slow control-plane operations.
|
||||
5. **Certificate rotation attempts** – Complement `newt_cert_rotation_total` with a
|
||||
duration histogram to observe slow PKI updates and detect stuck rotations.
|
||||
5. **Tunnel setup latency** – Histograms for DNS resolution and tunnel handshakes would
|
||||
help correlate connect latency spikes with network dependencies.
|
||||
|
||||
These metrics rely on data that is already available in the code paths mentioned
|
||||
above and would round out operational dashboards.
|
||||
|
||||
Reference in New Issue
Block a user