fix(metrics): update metrics recommendations and add OpenTelemetry review documentation

2026-03-26 12:36:45 +00:00 · 2025-10-10 18:18:38 +02:00
parent 1a9f6c4685
commit 46384e6242
3 changed files with 161 additions and 17 deletions
--- a/docs/METRICS_RECOMMENDATIONS.md
+++ b/docs/METRICS_RECOMMENDATIONS.md
@@ -10,6 +10,10 @@ This document captures the current state of Newt metrics, prioritized fixes, and
     - Tunnel/Traffic: newt_tunnel_sessions, newt_tunnel_bytes_total, newt_tunnel_latency_seconds, newt_tunnel_reconnects_total
     - Connection lifecycle: newt_connection_attempts_total, newt_connection_errors_total
     - Operations: newt_config_reloads_total, newt_restart_count_total, newt_build_info
+     - Operations: newt_config_reloads_total, newt_restart_count_total, newt_config_apply_seconds, newt_cert_rotation_total
+     - Build metadata: newt_build_info
+     - Control plane: newt_websocket_connect_latency_seconds, newt_websocket_messages_total
+     - Proxy: newt_proxy_active_connections, newt_proxy_buffer_bytes, newt_proxy_async_backlog_bytes, newt_proxy_drops_total
     - Go runtime: GC, heap, goroutines via runtime instrumentation

 2) Main issues addressed now
@@ -27,6 +31,10 @@ This document captures the current state of Newt metrics, prioritized fixes, and
   - Some call sites still need initiator label on reconnect outcomes (client vs server). This is planned.
   - WebSocket and Proxy metrics (connect latency, messages, active connections, buffer/drops, async backlog) are planned additions.
   - Config apply duration and cert rotation counters are planned.
+   - Registration and config reload failures are not yet emitted; add failure code paths so result labels expose churn.
+   - Restart counter increments only when build metadata is provided; consider decoupling to count all boots.
+   - Metric helpers often use `context.Background()`. Where lightweight contexts exist (e.g., HTTP handlers), propagate them to ease future correlation.
+   - Tracing coverage is limited to admin HTTP and WebSocket connect spans; extend to blueprint fetches, proxy accept loops, and WireGuard updates when OTLP is enabled.

 4) Roadmap (phased)

@@ -40,6 +48,10 @@ This document captures the current state of Newt metrics, prioritized fixes, and
     - Proxy: newt_proxy_active_connections, newt_proxy_buffer_bytes, newt_proxy_drops_total, newt_proxy_async_backlog_bytes
     - Reconnect: add initiator label (client/server)
     - Config & PKI: newt_config_apply_seconds{phase,result}; newt_cert_rotation_total{result}
+     - WebSocket disconnect and keepalive failure counters
+     - Proxy connection lifecycle metrics (accept totals, duration histogram)
+     - Pangolin blueprint/config fetch latency and status metrics
+     - Certificate rotation duration histogram to complement success/failure counter

 5) Operational guidance

@@ -64,9 +76,3 @@ This document captures the current state of Newt metrics, prioritized fixes, and

   - Direct scrape variant requires no attribute promotion since site_id is already a metric label.
   - Transform/promote variant remains optional for environments that rely on resource-to-label promotion.
-
-8) Testing
-
- curl :2112/metrics | grep ^newt_
- Verify presence of site_id across series; region appears when set.
- Ensure disallowed attributes are filtered; allowed (site_id) retained.
--- a/docs/observability.md
+++ b/docs/observability.md
@@ -34,18 +34,30 @@ Runtime behavior
 - When OTLP is enabled, metrics and traces are exported to OTLP gRPC endpoint
 - Go runtime metrics (goroutines, GC, memory) are exported automatically

-Metric catalog (initial)
+Metric catalog (current)

- newt_build_info (gauge) labels: version, commit, site_id[, region]; value is always 1
- newt_site_registrations_total (counter) labels: result, site_id[, region]
- newt_site_online (observable gauge) labels: site_id (0/1)
- newt_site_last_heartbeat_seconds (observable gauge) labels: site_id
- newt_tunnel_sessions (observable gauge) labels: site_id, tunnel_id [transport optional when known]
- newt_tunnel_bytes_total (counter) labels: site_id, tunnel_id, protocol (tcp|udp), direction (ingress|egress)
- newt_tunnel_latency_seconds (histogram) labels: site_id, tunnel_id, transport (e.g., wireguard)
- newt_tunnel_reconnects_total (counter) labels: site_id, tunnel_id, initiator (client|server), reason
- newt_connection_attempts_total (counter) labels: site_id, transport, result
- newt_connection_errors_total (counter) labels: site_id, transport, error_type (dial_timeout|tls_handshake|auth_failed|io_error)
+| Metric | Instrument | Key attributes | Purpose | Example |
+| --- | --- | --- | --- | --- |
+| `newt_build_info` | Observable gauge (Int64) | `version`, `commit`, `site_id`, `region` (optional) | Emits build metadata with value `1` for scrape-time verification. | `newt_build_info{version="1.5.0",site_id="acme-edge-1"} 1` |
+| `newt_site_registrations_total` | Counter (Int64) | `result` (`success`/`failure`), `site_id`, `region` (optional) | Counts Pangolin registration attempts. | `newt_site_registrations_total{result="success",site_id="acme-edge-1"} 1` |
+| `newt_site_online` | Observable gauge (Int64) | `site_id` | Reports whether the site is currently connected (`1`) or offline (`0`). | `newt_site_online{site_id="acme-edge-1"} 1` |
+| `newt_site_last_heartbeat_seconds` | Observable gauge (Float64) | `site_id` | Time since the most recent Pangolin heartbeat. | `newt_site_last_heartbeat_seconds{site_id="acme-edge-1"} 2.4` |
+| `newt_tunnel_sessions` | Observable gauge (Int64) | `site_id`, `tunnel_id` (when enabled) | Counts active tunnel sessions per peer; collapses to per-site when tunnel IDs are disabled. | `newt_tunnel_sessions{site_id="acme-edge-1",tunnel_id="wgpub..."} 3` |
+| `newt_tunnel_bytes_total` | Counter (Int64) | `direction` (`ingress`/`egress`), `protocol` (`tcp`/`udp`), `tunnel_id` (optional), `site_id`, `region` (optional) | Measures proxied traffic volume across tunnels. | `newt_tunnel_bytes_total{direction="ingress",protocol="tcp",site_id="acme-edge-1"} 4096` |
+| `newt_tunnel_latency_seconds` | Histogram (Float64) | `transport` (e.g., `wireguard`), `tunnel_id` (optional), `site_id`, `region` (optional) | Captures RTT or configuration-driven latency samples. | `newt_tunnel_latency_seconds_bucket{transport="wireguard",le="0.5"} 42` |
+| `newt_tunnel_reconnects_total` | Counter (Int64) | `initiator` (`client`/`server`), `reason` (enumerated), `tunnel_id` (optional), `site_id`, `region` (optional) | Tracks reconnect causes for troubleshooting flaps. | `newt_tunnel_reconnects_total{initiator="client",reason="timeout",site_id="acme-edge-1"} 5` |
+| `newt_connection_attempts_total` | Counter (Int64) | `transport` (`auth`/`websocket`), `result`, `site_id`, `region` (optional) | Measures control-plane dial attempts and their outcomes. | `newt_connection_attempts_total{transport="websocket",result="success",site_id="acme-edge-1"} 8` |
+| `newt_connection_errors_total` | Counter (Int64) | `transport`, `error_type`, `site_id`, `region` (optional) | Buckets connection failures by normalized error class. | `newt_connection_errors_total{transport="websocket",error_type="tls_handshake",site_id="acme-edge-1"} 1` |
+| `newt_config_reloads_total` | Counter (Int64) | `result`, `site_id`, `region` (optional) | Counts remote blueprint/config reloads. | `newt_config_reloads_total{result="success",site_id="acme-edge-1"} 3` |
+| `newt_restart_count_total` | Counter (Int64) | `site_id`, `region` (optional) | Increments once per process boot to detect restarts. | `newt_restart_count_total{site_id="acme-edge-1"} 1` |
+| `newt_config_apply_seconds` | Histogram (Float64) | `phase` (`interface`/`peer`), `result`, `site_id`, `region` (optional) | Measures time spent applying WireGuard configuration phases. | `newt_config_apply_seconds_sum{phase="peer",result="success",site_id="acme-edge-1"} 0.48` |
+| `newt_cert_rotation_total` | Counter (Int64) | `result`, `site_id`, `region` (optional) | Tracks client certificate rotation attempts. | `newt_cert_rotation_total{result="success",site_id="acme-edge-1"} 2` |
+| `newt_websocket_connect_latency_seconds` | Histogram (Float64) | `transport="websocket"`, `result`, `error_type` (on failure), `site_id`, `region` (optional) | Measures WebSocket dial latency and exposes failure buckets. | `newt_websocket_connect_latency_seconds_bucket{result="success",le="0.5",site_id="acme-edge-1"} 9` |
+| `newt_websocket_messages_total` | Counter (Int64) | `direction` (`in`/`out`), `msg_type` (`text`/`ping`/`pong`), `site_id`, `region` (optional) | Accounts for control WebSocket traffic volume by type. | `newt_websocket_messages_total{direction="out",msg_type="ping",site_id="acme-edge-1"} 12` |
+| `newt_proxy_active_connections` | Observable gauge (Int64) | `protocol` (`tcp`/`udp`), `direction` (`ingress`/`egress`), `tunnel_id` (optional), `site_id`, `region` (optional) | Current proxy connections per tunnel and protocol. | `newt_proxy_active_connections{protocol="tcp",direction="egress",site_id="acme-edge-1"} 4` |
+| `newt_proxy_buffer_bytes` | Observable gauge (Int64) | `protocol`, `direction`, `tunnel_id` (optional), `site_id`, `region` (optional) | Volume of buffered data awaiting flush in proxy queues. | `newt_proxy_buffer_bytes{protocol="udp",direction="egress",site_id="acme-edge-1"} 2048` |
+| `newt_proxy_async_backlog_bytes` | Observable gauge (Int64) | `protocol`, `direction`, `tunnel_id` (optional), `site_id`, `region` (optional) | Tracks async write backlog when deferred flushing is enabled. | `newt_proxy_async_backlog_bytes{protocol="tcp",direction="egress",site_id="acme-edge-1"} 512` |
+| `newt_proxy_drops_total` | Counter (Int64) | `protocol`, `tunnel_id` (optional), `site_id`, `region` (optional) | Counts proxy drop events caused by downstream write errors. | `newt_proxy_drops_total{protocol="udp",site_id="acme-edge-1"} 1` |

 Conventions

--- a/docs/otel-review.md
+++ b/docs/otel-review.md
@@ -0,0 +1,126 @@
+# Newt OpenTelemetry Review
+
+## Overview
+
+This document summarises the current OpenTelemetry (OTel) instrumentation in Newt, assesses
+compliance with OTel guidelines, and lists concrete improvements to pursue before release.
+It is based on the implementation in `internal/telemetry` and the call-sites that emit
+metrics and traces across the code base.
+
+## Current metric instrumentation
+
+All instruments are registered in `internal/telemetry/metrics.go`. They are grouped
+into site, tunnel, connection, configuration, build, WebSocket, and proxy domains.
+A global attribute filter (see `buildMeterProvider`) constrains exposed label keys to
+`site_id`, `region`, and a curated list of low-cardinality dimensions so that Prometheus
+exports stay bounded.
+
+- **Site lifecycle**: `newt_site_registrations_total`, `newt_site_online`, and
+  `newt_site_last_heartbeat_seconds` capture registration attempts and liveness. They
+  are fed either manually (`IncSiteRegistration`) or via the `TelemetryView` state
+  callback that publishes observable gauges for the active site.
+- **Tunnel health and usage**: Counters and histograms track bytes, latency, reconnects,
+  and active sessions per tunnel (`newt_tunnel_*` family). Attribute helpers respect
+  the `NEWT_METRICS_INCLUDE_TUNNEL_ID` toggle to keep cardinality manageable on larger
+  fleets.
+- **Connection attempts**: `newt_connection_attempts_total` and
+  `newt_connection_errors_total` are emitted throughout the WebSocket client to classify
+  authentication, dial, and transport failures.
+- **Operations/configuration**: `newt_config_reloads_total`,
+  `newt_restart_count_total`, `newt_config_apply_seconds`, and
+  `newt_cert_rotation_total` provide visibility into blueprint reloads, process boots,
+  configuration timings, and certificate rotation outcomes.
+- **Build metadata**: `newt_build_info` records the binary version/commit together
+  with a monotonic restart counter when build information is supplied at startup.
+- **WebSocket control-plane**: `newt_websocket_connect_latency_seconds` and
+  `newt_websocket_messages_total` report connect latency and ping/pong/text activity.
+- **Proxy data-plane**: Observable gauges (`newt_proxy_active_connections`,
+  `newt_proxy_buffer_bytes`, `newt_proxy_async_backlog_bytes`) and the
+  `newt_proxy_drops_total` counter are fed from the proxy manager to monitor backlog
+  and drop behaviour alongside per-protocol byte counters.
+
+Refer to `docs/observability.md` for a tabular catalogue with instrument types,
+attributes, and sample exposition lines.
+
+## Tracing coverage
+
+Tracing is optional and enabled only when OTLP export is configured. When active:
+
+- The admin HTTP mux is wrapped with `otelhttp.NewHandler`, producing spans for
+  `/metrics` and `/healthz` requests.
+- The WebSocket dial path creates a `ws.connect` span around the gRPC-based handshake.
+
+No other subsystems currently create spans, so data-plane operations, blueprint fetches,
+Docker discovery, and WireGuard reconfiguration happen without trace context.
+
+## Guideline & best-practice alignment
+
+The implementation adheres to most OTel Go recommendations:
+
+- **Naming & units** – Every instrument follows the `newt_*` prefix with `_total`
+  suffixes for counters and `_seconds`/`_bytes` unit conventions. Histograms are
+  registered with explicit second-based buckets.
+- **Resource attributes** – Service name/version and optional `site_id`/`region`
+  populate the `resource.Resource` and are also injected as metric attributes for
+  compatibility with Prometheus queries.
+- **Attribute hygiene** – A single attribute filter (`sdkmetric.WithView`) enforces
+  the allow-list of label keys to prevent accidental high-cardinality emission.
+- **Runtime metrics** – Go runtime instrumentation is enabled automatically through
+  `runtime.Start`.
+- **Configuration via environment** – `telemetry.FromEnv` honours `OTEL_*` variables
+  alongside `NEWT_*` overrides so operators can configure exporters without code
+  changes.
+- **Shutdown handling** – `Setup.Shutdown` iterates exporters in reverse order to
+  flush buffers before process exit.
+
+## Adjustments & improvements
+
+The review identified a few actionable adjustments:
+
+1. **Record registration failures** – `newt_site_registrations_total` is currently
+   incremented only on success. Emit `result="failure"` samples whenever Pangolin
+   rejects a registration or credential exchange so operators can alert on churn.
+2. **Surface config reload failures** – `telemetry.IncConfigReload` is invoked with
+   `result="success"` only. Callers should record a failure result when blueprint
+   parsing or application aborts before success counters are incremented.
+3. **Harmonise restart count behaviour** – `newt_restart_count_total` increments only
+   when build metadata is provided. Consider moving the increment out of
+   `RegisterBuildInfo` so the counter advances even for ad-hoc builds without version
+   strings.
+4. **Propagate contexts where available** – Many emitters call metric helpers with
+   `context.Background()`. Passing real contexts (when inexpensive) would allow future
+   exporters to correlate spans and metrics.
+5. **Extend tracing coverage** – Instrument critical flows such as blueprint fetches,
+   WireGuard reconfiguration, proxy accept loops, and Docker discovery to provide end
+   to end visibility when OTLP tracing is enabled.
+
+## Metrics to add before release
+
+Prioritised additions that would close visibility gaps:
+
+1. **WebSocket disconnect outcomes** – A counter (e.g., `newt_websocket_disconnects_total`)
+   partitioned by `reason` would complement the existing connect latency histogram and
+   explain reconnect storms.
+2. **Keepalive/heartbeat failures** – Counting ping timeouts or heartbeat misses would
+   make `newt_site_last_heartbeat_seconds` actionable by providing discrete events.
+3. **Proxy connection lifecycle** – Add counters/histograms for proxy accept events and
+   connection durations to correlate drops with load and backlog metrics.
+4. **Blueprint/config pull latency** – Measuring Pangolin blueprint fetch durations and
+   HTTP status distribution would expose slow control-plane operations.
+5. **Certificate rotation attempts** – Complement `newt_cert_rotation_total` with a
+   duration histogram to observe slow PKI updates and detect stuck rotations.
+
+These metrics rely on data that is already available in the code paths mentioned
+above and would round out operational dashboards.
+
+## Tracing wishlist
+
+To benefit from tracing when OTLP is active, add spans around:
+
+- Pangolin REST calls (wrap the HTTP client with `otelhttp.NewTransport`).
+- Docker discovery cycles and target registration callbacks.
+- WireGuard reconfiguration (interface bring-up, peer updates).
+- Proxy dial/accept loops for both TCP and UDP targets.
+
+Capturing these stages will let operators correlate latency spikes with reconnects
+and proxy drops using distributed traces in addition to the metric signals.