Add WebSocket and proxy lifecycle metrics

This commit is contained in:
Marc Schäfer
2025-10-10 19:15:33 +02:00
parent e04c654292
commit d21f4951e9
14 changed files with 323 additions and 96 deletions

View File

@@ -6,20 +6,20 @@ This document captures the current state of Newt metrics, prioritized fixes, and
- Export: Prometheus exposition (default), optional OTLP (gRPC)
- Existing instruments:
- Sites: newt_site_registrations_total, newt_site_online (0/1), newt_site_last_heartbeat_seconds
- Sites: newt_site_registrations_total, newt_site_online (0/1), newt_site_last_heartbeat_timestamp_seconds
- Tunnel/Traffic: newt_tunnel_sessions, newt_tunnel_bytes_total, newt_tunnel_latency_seconds, newt_tunnel_reconnects_total
- Connection lifecycle: newt_connection_attempts_total, newt_connection_errors_total
- Operations: newt_config_reloads_total, newt_restart_count_total, newt_build_info
- Operations: newt_config_reloads_total, newt_restart_count_total, newt_config_apply_seconds, newt_cert_rotation_total
- Operations: newt_config_reloads_total, process_start_time_seconds, newt_build_info
- Operations: newt_config_reloads_total, process_start_time_seconds, newt_config_apply_seconds, newt_cert_rotation_total
- Build metadata: newt_build_info
- Control plane: newt_websocket_connect_latency_seconds, newt_websocket_messages_total
- Proxy: newt_proxy_active_connections, newt_proxy_buffer_bytes, newt_proxy_async_backlog_bytes, newt_proxy_drops_total
- Control plane: newt_websocket_connect_latency_seconds, newt_websocket_messages_total, newt_websocket_connected, newt_websocket_reconnects_total
- Proxy: newt_proxy_active_connections, newt_proxy_buffer_bytes, newt_proxy_async_backlog_bytes, newt_proxy_drops_total, newt_proxy_accept_total, newt_proxy_connection_duration_seconds, newt_proxy_connections_total
- Go runtime: GC, heap, goroutines via runtime instrumentation
2) Main issues addressed now
- Attribute filter (allow-list) extended to include site_id and region in addition to existing keys (tunnel_id, transport, protocol, direction, result, reason, error_type, version, commit).
- site_id and region propagation: site_id is now attached as a metric label across newt_*; region is added as a metric label when set. Both remain resource attributes for consistency with OTEL.
- site_id and region propagation: site_id/region remain resource attributes. Metric labels mirror them on per-site gauges and counters by default; set `NEWT_METRICS_INCLUDE_SITE_LABELS=false` to drop them for multi-tenant scrapes.
- Label semantics clarified:
- transport: control-plane mechanism (e.g., websocket, wireguard)
- protocol: L4 payload type (tcp, udp)
@@ -29,10 +29,9 @@ This document captures the current state of Newt metrics, prioritized fixes, and
3) Remaining gaps and deviations
- Some call sites still need initiator label on reconnect outcomes (client vs server). This is planned.
- WebSocket and Proxy metrics (connect latency, messages, active connections, buffer/drops, async backlog) are planned additions.
- Config apply duration and cert rotation counters are planned.
- Registration and config reload failures are not yet emitted; add failure code paths so result labels expose churn.
- Restart counter increments only when build metadata is provided; consider decoupling to count all boots.
- Document using `process_start_time_seconds` (and `time()` in PromQL) to derive uptime; no explicit restart counter is needed.
- Metric helpers often use `context.Background()`. Where lightweight contexts exist (e.g., HTTP handlers), propagate them to ease future correlation.
- Tracing coverage is limited to admin HTTP and WebSocket connect spans; extend to blueprint fetches, proxy accept loops, and WireGuard updates when OTLP is enabled.
@@ -44,8 +43,6 @@ This document captures the current state of Newt metrics, prioritized fixes, and
- Correct label semantics (transport vs protocol); fix sessions transport labelling
- Documentation alignment
- Phase 2 (next)
- WebSocket: newt_websocket_connect_latency_seconds; newt_websocket_messages_total{direction,msg_type}
- Proxy: newt_proxy_active_connections, newt_proxy_buffer_bytes, newt_proxy_drops_total, newt_proxy_async_backlog_bytes
- Reconnect: add initiator label (client/server)
- Config & PKI: newt_config_apply_seconds{phase,result}; newt_cert_rotation_total{result}
- WebSocket disconnect and keepalive failure counters
@@ -66,7 +63,7 @@ This document captures the current state of Newt metrics, prioritized fixes, and
- Sustained connection errors:
- rate(newt_connection_errors_total[5m]) by (site_id,transport,error_type)
- Heartbeat gaps:
- max_over_time(newt_site_last_heartbeat_seconds[15m]) by (site_id)
- max_over_time(time() - newt_site_last_heartbeat_timestamp_seconds[15m]) by (site_id)
- Proxy drops:
- increase(newt_proxy_drops_total[5m]) by (site_id,protocol)
- WebSocket connect p95 (when added):