mirror of
https://github.com/fosrl/newt.git
synced 2026-03-26 20:46:41 +00:00
Merge branch 'main' into codex/review-opentelemetry-metrics-and-tracing
This commit is contained in:
3
.github/workflows/cicd.yml
vendored
3
.github/workflows/cicd.yml
vendored
@@ -1,5 +1,8 @@
|
|||||||
name: CI/CD Pipeline
|
name: CI/CD Pipeline
|
||||||
|
|
||||||
|
permissions:
|
||||||
|
contents: read
|
||||||
|
|
||||||
on:
|
on:
|
||||||
push:
|
push:
|
||||||
tags:
|
tags:
|
||||||
|
|||||||
3
.github/workflows/test.yml
vendored
3
.github/workflows/test.yml
vendored
@@ -1,5 +1,8 @@
|
|||||||
name: Run Tests
|
name: Run Tests
|
||||||
|
|
||||||
|
permissions:
|
||||||
|
contents: read
|
||||||
|
|
||||||
on:
|
on:
|
||||||
pull_request:
|
pull_request:
|
||||||
branches:
|
branches:
|
||||||
|
|||||||
@@ -10,6 +10,10 @@ This document captures the current state of Newt metrics, prioritized fixes, and
|
|||||||
- Tunnel/Traffic: newt_tunnel_sessions, newt_tunnel_bytes_total, newt_tunnel_latency_seconds, newt_tunnel_reconnects_total
|
- Tunnel/Traffic: newt_tunnel_sessions, newt_tunnel_bytes_total, newt_tunnel_latency_seconds, newt_tunnel_reconnects_total
|
||||||
- Connection lifecycle: newt_connection_attempts_total, newt_connection_errors_total
|
- Connection lifecycle: newt_connection_attempts_total, newt_connection_errors_total
|
||||||
- Operations: newt_config_reloads_total, newt_restart_count_total, newt_build_info
|
- Operations: newt_config_reloads_total, newt_restart_count_total, newt_build_info
|
||||||
|
- Operations: newt_config_reloads_total, newt_restart_count_total, newt_config_apply_seconds, newt_cert_rotation_total
|
||||||
|
- Build metadata: newt_build_info
|
||||||
|
- Control plane: newt_websocket_connect_latency_seconds, newt_websocket_messages_total
|
||||||
|
- Proxy: newt_proxy_active_connections, newt_proxy_buffer_bytes, newt_proxy_async_backlog_bytes, newt_proxy_drops_total
|
||||||
- Go runtime: GC, heap, goroutines via runtime instrumentation
|
- Go runtime: GC, heap, goroutines via runtime instrumentation
|
||||||
|
|
||||||
2) Main issues addressed now
|
2) Main issues addressed now
|
||||||
@@ -27,6 +31,10 @@ This document captures the current state of Newt metrics, prioritized fixes, and
|
|||||||
- Some call sites still need initiator label on reconnect outcomes (client vs server). This is planned.
|
- Some call sites still need initiator label on reconnect outcomes (client vs server). This is planned.
|
||||||
- WebSocket and Proxy metrics (connect latency, messages, active connections, buffer/drops, async backlog) are planned additions.
|
- WebSocket and Proxy metrics (connect latency, messages, active connections, buffer/drops, async backlog) are planned additions.
|
||||||
- Config apply duration and cert rotation counters are planned.
|
- Config apply duration and cert rotation counters are planned.
|
||||||
|
- Registration and config reload failures are not yet emitted; add failure code paths so result labels expose churn.
|
||||||
|
- Restart counter increments only when build metadata is provided; consider decoupling to count all boots.
|
||||||
|
- Metric helpers often use `context.Background()`. Where lightweight contexts exist (e.g., HTTP handlers), propagate them to ease future correlation.
|
||||||
|
- Tracing coverage is limited to admin HTTP and WebSocket connect spans; extend to blueprint fetches, proxy accept loops, and WireGuard updates when OTLP is enabled.
|
||||||
|
|
||||||
4) Roadmap (phased)
|
4) Roadmap (phased)
|
||||||
|
|
||||||
@@ -40,6 +48,10 @@ This document captures the current state of Newt metrics, prioritized fixes, and
|
|||||||
- Proxy: newt_proxy_active_connections, newt_proxy_buffer_bytes, newt_proxy_drops_total, newt_proxy_async_backlog_bytes
|
- Proxy: newt_proxy_active_connections, newt_proxy_buffer_bytes, newt_proxy_drops_total, newt_proxy_async_backlog_bytes
|
||||||
- Reconnect: add initiator label (client/server)
|
- Reconnect: add initiator label (client/server)
|
||||||
- Config & PKI: newt_config_apply_seconds{phase,result}; newt_cert_rotation_total{result}
|
- Config & PKI: newt_config_apply_seconds{phase,result}; newt_cert_rotation_total{result}
|
||||||
|
- WebSocket disconnect and keepalive failure counters
|
||||||
|
- Proxy connection lifecycle metrics (accept totals, duration histogram)
|
||||||
|
- Pangolin blueprint/config fetch latency and status metrics
|
||||||
|
- Certificate rotation duration histogram to complement success/failure counter
|
||||||
|
|
||||||
5) Operational guidance
|
5) Operational guidance
|
||||||
|
|
||||||
@@ -64,9 +76,3 @@ This document captures the current state of Newt metrics, prioritized fixes, and
|
|||||||
|
|
||||||
- Direct scrape variant requires no attribute promotion since site_id is already a metric label.
|
- Direct scrape variant requires no attribute promotion since site_id is already a metric label.
|
||||||
- Transform/promote variant remains optional for environments that rely on resource-to-label promotion.
|
- Transform/promote variant remains optional for environments that rely on resource-to-label promotion.
|
||||||
|
|
||||||
8) Testing
|
|
||||||
|
|
||||||
- curl :2112/metrics | grep ^newt_
|
|
||||||
- Verify presence of site_id across series; region appears when set.
|
|
||||||
- Ensure disallowed attributes are filtered; allowed (site_id) retained.
|
|
||||||
|
|||||||
@@ -34,18 +34,30 @@ Runtime behavior
|
|||||||
- When OTLP is enabled, metrics and traces are exported to OTLP gRPC endpoint
|
- When OTLP is enabled, metrics and traces are exported to OTLP gRPC endpoint
|
||||||
- Go runtime metrics (goroutines, GC, memory) are exported automatically
|
- Go runtime metrics (goroutines, GC, memory) are exported automatically
|
||||||
|
|
||||||
Metric catalog (initial)
|
Metric catalog (current)
|
||||||
|
|
||||||
- newt_build_info (gauge) labels: version, commit, site_id[, region]; value is always 1
|
| Metric | Instrument | Key attributes | Purpose | Example |
|
||||||
- newt_site_registrations_total (counter) labels: result, site_id[, region]
|
| --- | --- | --- | --- | --- |
|
||||||
- newt_site_online (observable gauge) labels: site_id (0/1)
|
| `newt_build_info` | Observable gauge (Int64) | `version`, `commit`, `site_id`, `region` (optional) | Emits build metadata with value `1` for scrape-time verification. | `newt_build_info{version="1.5.0",site_id="acme-edge-1"} 1` |
|
||||||
- newt_site_last_heartbeat_seconds (observable gauge) labels: site_id
|
| `newt_site_registrations_total` | Counter (Int64) | `result` (`success`/`failure`), `site_id`, `region` (optional) | Counts Pangolin registration attempts. | `newt_site_registrations_total{result="success",site_id="acme-edge-1"} 1` |
|
||||||
- newt_tunnel_sessions (observable gauge) labels: site_id, tunnel_id [transport optional when known]
|
| `newt_site_online` | Observable gauge (Int64) | `site_id` | Reports whether the site is currently connected (`1`) or offline (`0`). | `newt_site_online{site_id="acme-edge-1"} 1` |
|
||||||
- newt_tunnel_bytes_total (counter) labels: site_id, tunnel_id, protocol (tcp|udp), direction (ingress|egress)
|
| `newt_site_last_heartbeat_seconds` | Observable gauge (Float64) | `site_id` | Time since the most recent Pangolin heartbeat. | `newt_site_last_heartbeat_seconds{site_id="acme-edge-1"} 2.4` |
|
||||||
- newt_tunnel_latency_seconds (histogram) labels: site_id, tunnel_id, transport (e.g., wireguard)
|
| `newt_tunnel_sessions` | Observable gauge (Int64) | `site_id`, `tunnel_id` (when enabled) | Counts active tunnel sessions per peer; collapses to per-site when tunnel IDs are disabled. | `newt_tunnel_sessions{site_id="acme-edge-1",tunnel_id="wgpub..."} 3` |
|
||||||
- newt_tunnel_reconnects_total (counter) labels: site_id, tunnel_id, initiator (client|server), reason
|
| `newt_tunnel_bytes_total` | Counter (Int64) | `direction` (`ingress`/`egress`), `protocol` (`tcp`/`udp`), `tunnel_id` (optional), `site_id`, `region` (optional) | Measures proxied traffic volume across tunnels. | `newt_tunnel_bytes_total{direction="ingress",protocol="tcp",site_id="acme-edge-1"} 4096` |
|
||||||
- newt_connection_attempts_total (counter) labels: site_id, transport, result
|
| `newt_tunnel_latency_seconds` | Histogram (Float64) | `transport` (e.g., `wireguard`), `tunnel_id` (optional), `site_id`, `region` (optional) | Captures RTT or configuration-driven latency samples. | `newt_tunnel_latency_seconds_bucket{transport="wireguard",le="0.5"} 42` |
|
||||||
- newt_connection_errors_total (counter) labels: site_id, transport, error_type (dial_timeout|tls_handshake|auth_failed|io_error)
|
| `newt_tunnel_reconnects_total` | Counter (Int64) | `initiator` (`client`/`server`), `reason` (enumerated), `tunnel_id` (optional), `site_id`, `region` (optional) | Tracks reconnect causes for troubleshooting flaps. | `newt_tunnel_reconnects_total{initiator="client",reason="timeout",site_id="acme-edge-1"} 5` |
|
||||||
|
| `newt_connection_attempts_total` | Counter (Int64) | `transport` (`auth`/`websocket`), `result`, `site_id`, `region` (optional) | Measures control-plane dial attempts and their outcomes. | `newt_connection_attempts_total{transport="websocket",result="success",site_id="acme-edge-1"} 8` |
|
||||||
|
| `newt_connection_errors_total` | Counter (Int64) | `transport`, `error_type`, `site_id`, `region` (optional) | Buckets connection failures by normalized error class. | `newt_connection_errors_total{transport="websocket",error_type="tls_handshake",site_id="acme-edge-1"} 1` |
|
||||||
|
| `newt_config_reloads_total` | Counter (Int64) | `result`, `site_id`, `region` (optional) | Counts remote blueprint/config reloads. | `newt_config_reloads_total{result="success",site_id="acme-edge-1"} 3` |
|
||||||
|
| `newt_restart_count_total` | Counter (Int64) | `site_id`, `region` (optional) | Increments once per process boot to detect restarts. | `newt_restart_count_total{site_id="acme-edge-1"} 1` |
|
||||||
|
| `newt_config_apply_seconds` | Histogram (Float64) | `phase` (`interface`/`peer`), `result`, `site_id`, `region` (optional) | Measures time spent applying WireGuard configuration phases. | `newt_config_apply_seconds_sum{phase="peer",result="success",site_id="acme-edge-1"} 0.48` |
|
||||||
|
| `newt_cert_rotation_total` | Counter (Int64) | `result`, `site_id`, `region` (optional) | Tracks client certificate rotation attempts. | `newt_cert_rotation_total{result="success",site_id="acme-edge-1"} 2` |
|
||||||
|
| `newt_websocket_connect_latency_seconds` | Histogram (Float64) | `transport="websocket"`, `result`, `error_type` (on failure), `site_id`, `region` (optional) | Measures WebSocket dial latency and exposes failure buckets. | `newt_websocket_connect_latency_seconds_bucket{result="success",le="0.5",site_id="acme-edge-1"} 9` |
|
||||||
|
| `newt_websocket_messages_total` | Counter (Int64) | `direction` (`in`/`out`), `msg_type` (`text`/`ping`/`pong`), `site_id`, `region` (optional) | Accounts for control WebSocket traffic volume by type. | `newt_websocket_messages_total{direction="out",msg_type="ping",site_id="acme-edge-1"} 12` |
|
||||||
|
| `newt_proxy_active_connections` | Observable gauge (Int64) | `protocol` (`tcp`/`udp`), `direction` (`ingress`/`egress`), `tunnel_id` (optional), `site_id`, `region` (optional) | Current proxy connections per tunnel and protocol. | `newt_proxy_active_connections{protocol="tcp",direction="egress",site_id="acme-edge-1"} 4` |
|
||||||
|
| `newt_proxy_buffer_bytes` | Observable gauge (Int64) | `protocol`, `direction`, `tunnel_id` (optional), `site_id`, `region` (optional) | Volume of buffered data awaiting flush in proxy queues. | `newt_proxy_buffer_bytes{protocol="udp",direction="egress",site_id="acme-edge-1"} 2048` |
|
||||||
|
| `newt_proxy_async_backlog_bytes` | Observable gauge (Int64) | `protocol`, `direction`, `tunnel_id` (optional), `site_id`, `region` (optional) | Tracks async write backlog when deferred flushing is enabled. | `newt_proxy_async_backlog_bytes{protocol="tcp",direction="egress",site_id="acme-edge-1"} 512` |
|
||||||
|
| `newt_proxy_drops_total` | Counter (Int64) | `protocol`, `tunnel_id` (optional), `site_id`, `region` (optional) | Counts proxy drop events caused by downstream write errors. | `newt_proxy_drops_total{protocol="udp",site_id="acme-edge-1"} 1` |
|
||||||
|
|
||||||
Conventions
|
Conventions
|
||||||
|
|
||||||
|
|||||||
@@ -1,64 +1,126 @@
|
|||||||
# OpenTelemetry Review
|
# Newt OpenTelemetry Review
|
||||||
|
|
||||||
## Metric inventory
|
## Overview
|
||||||
The table below lists every instrument registered by `internal/telemetry/metrics.go`, the helper that emits it, and an example time-series. Attribute sets automatically add `site_id` (and optionally `region`) via `attrsWithSite` unless the observable callback overrides them. 【F:internal/telemetry/metrics.go†L23-L205】【F:internal/telemetry/metrics.go†L289-L403】
|
|
||||||
|
|
||||||
| Metric | Instrument & unit | Purpose | Emission path | Example series |
|
This document summarises the current OpenTelemetry (OTel) instrumentation in Newt, assesses
|
||||||
| --- | --- | --- | --- | --- |
|
compliance with OTel guidelines, and lists concrete improvements to pursue before release.
|
||||||
| `newt_site_registrations_total` | Counter | Counts Pangolin registration attempts keyed by result (`success`, `failure`). | `telemetry.IncSiteRegistration` (called after registration completes). | `newt_site_registrations_total{result="success",site_id="abc"} 1` |
|
It is based on the implementation in `internal/telemetry` and the call-sites that emit
|
||||||
| `newt_site_online` | Observable gauge | 0/1 heartbeat for the active site, driven by the registered `StateView`. | `telemetry.SetObservableCallback` via `state.TelemetryView`. | `newt_site_online{site_id="self"} 1` |
|
metrics and traces across the code base.
|
||||||
| `newt_site_last_heartbeat_seconds` | Observable gauge | Seconds since the last Pangolin heartbeat. | Same callback as above using `state.TelemetryView.TouchHeartbeat`. | `newt_site_last_heartbeat_seconds{site_id="self"} 3.2` |
|
|
||||||
| `newt_tunnel_sessions` | Observable gauge | Active sessions per tunnel; collapses to site total when `tunnel_id` emission is disabled. | `state.TelemetryView.SessionsByTunnel` via `RegisterStateView`. | `newt_tunnel_sessions{site_id="self",tunnel_id="wgpub"} 2` |
|
|
||||||
| `newt_tunnel_bytes_total` | Counter (`By`) | Traffic accounting per tunnel, direction (`ingress`/`egress`), protocol (`tcp`/`udp`). | Proxy manager counting writers (`AddTunnelBytes`/`AddTunnelBytesSet`). | `newt_tunnel_bytes_total{direction="egress",protocol="tcp",site_id="self",tunnel_id="wgpub"} 8192` |
|
|
||||||
| `newt_tunnel_latency_seconds` | Histogram (`s`) | RTT samples from WireGuard stack and health pings per tunnel/transport. | `telemetry.ObserveTunnelLatency` from tunnel health checks. | `newt_tunnel_latency_seconds_bucket{transport="wireguard",le="0.05",tunnel_id="wgpub"} 4` |
|
|
||||||
| `newt_tunnel_reconnects_total` | Counter | Reconnect attempts bucketed by initiator (`client`/`server`) and reason enums. | `telemetry.IncReconnect` across websocket, WG, and utility flows. | `newt_tunnel_reconnects_total{initiator="client",reason="timeout",tunnel_id="wgpub"} 3` |
|
|
||||||
| `newt_connection_attempts_total` | Counter | Auth and WebSocket attempt counts by transport (`auth`, `websocket`) and result (`success`/`failure`). | `telemetry.IncConnAttempt` in auth/token and dial paths. | `newt_connection_attempts_total{transport="websocket",result="failure",site_id="self"} 2` |
|
|
||||||
| `newt_connection_errors_total` | Counter | Connection error tally keyed by transport and canonical error type (`dial_timeout`, `tls_handshake`, `auth_failed`, `io_error`). | `telemetry.IncConnError` in auth/websocket flows. | `newt_connection_errors_total{transport="auth",error_type="auth_failed",site_id="self"} 1` |
|
|
||||||
| `newt_config_reloads_total` | Counter | Successful/failed config reload attempts. | `telemetry.IncConfigReload` during WireGuard config reloads. | `newt_config_reloads_total{result="success",site_id="self"} 1` |
|
|
||||||
| `newt_restart_count_total` | Counter | Bumps to 1 at process boot for build info scrapers. | `telemetry.RegisterBuildInfo` called from `Init`. | `newt_restart_count_total{site_id="self"} 1` |
|
|
||||||
| `newt_config_apply_seconds` | Histogram (`s`) | Measures interface/peer apply duration per phase and result. | `telemetry.ObserveConfigApply` around config updates. | `newt_config_apply_seconds_bucket{phase="peer",result="success",le="0.1"} 5` |
|
|
||||||
| `newt_cert_rotation_total` | Counter | Certificate rotation events tagged by result. | `telemetry.IncCertRotation` during PKI updates. | `newt_cert_rotation_total{result="success",site_id="self"} 1` |
|
|
||||||
| `newt_build_info` | Observable gauge | Constant 1 with `version`/`commit` attributes to expose build metadata. | Callback registered in `registerBuildWSProxyInstruments`. | `newt_build_info{version="1.2.3",commit="abc123",site_id="self"} 1` |
|
|
||||||
| `newt_websocket_connect_latency_seconds` | Histogram (`s`) | Dial latency for Pangolin WebSocket connects annotated with result/error_type. | `telemetry.ObserveWSConnectLatency` inside `Client.establishConnection`. | `newt_websocket_connect_latency_seconds_bucket{result="success",transport="websocket",le="0.5"} 1` |
|
|
||||||
| `newt_websocket_messages_total` | Counter | Counts inbound/outbound WebSocket messages by direction and logical message type. | `telemetry.IncWSMessage` for ping/pong/text events. | `newt_websocket_messages_total{direction="out",msg_type="ping",site_id="self"} 4` |
|
|
||||||
| `newt_websocket_disconnects_total` | Counter | Tracks WebSocket disconnects grouped by `reason` (`shutdown`, `unexpected_close`, etc.) and `result`. | Emitted from `Client.readPumpWithDisconnectDetection` defer block. | `newt_websocket_disconnects_total{reason="unexpected_close",result="error",site_id="self"} 1` |
|
|
||||||
| `newt_websocket_keepalive_failures_total` | Counter | Failed WebSocket ping/pong keepalive attempts by reason. | Incremented in `Client.pingMonitor` when `WriteControl` fails. | `newt_websocket_keepalive_failures_total{reason="ping_write",site_id="self"} 1` |
|
|
||||||
| `newt_websocket_session_duration_seconds` | Histogram (`s`) | Duration of WebSocket sessions by outcome (`result`). | Observed when the read pump exits. | `newt_websocket_session_duration_seconds_sum{result="success",site_id="self"} 120` |
|
|
||||||
| `newt_proxy_active_connections` | Observable gauge | Active TCP/UDP proxy connections per tunnel and protocol. | Proxy manager callback via `SetProxyObservableCallback`. | `newt_proxy_active_connections{protocol="tcp",tunnel_id="wgpub"} 3` |
|
|
||||||
| `newt_proxy_buffer_bytes` | Observable gauge (`By`) | Size of proxy buffer pools (synchronous path) per tunnel/protocol. | Same proxy callback as above. | `newt_proxy_buffer_bytes{protocol="tcp",tunnel_id="wgpub"} 10240` |
|
|
||||||
| `newt_proxy_async_backlog_bytes` | Observable gauge (`By`) | Unflushed async byte backlog when deferred accounting is enabled. | Proxy callback when async accounting is turned on. | `newt_proxy_async_backlog_bytes{protocol="udp",tunnel_id="wgpub"} 4096` |
|
|
||||||
| `newt_proxy_drops_total` | Counter | Proxy write-drop events per protocol/tunnel. | `telemetry.IncProxyDrops` on UDP drop paths. | `newt_proxy_drops_total{protocol="udp",tunnel_id="wgpub"} 2` |
|
|
||||||
| `newt_proxy_accept_total` | Counter | Proxy accept attempts labelled by protocol, result, and reason. | `telemetry.IncProxyAccept` in TCP accept loop and UDP dial paths. | `newt_proxy_accept_total{protocol="tcp",result="failure",reason="timeout",site_id="self"} 1` |
|
|
||||||
| `newt_proxy_connection_duration_seconds` | Histogram (`s`) | Lifecycle duration for proxied TCP/UDP connections by result. | `telemetry.ObserveProxyConnectionDuration` when TCP/UDP handlers complete. | `newt_proxy_connection_duration_seconds_sum{protocol="udp",result="success",site_id="self"} 30` |
|
|
||||||
|
|
||||||
In addition, Go runtime metrics are automatically exported when telemetry is initialised. 【F:internal/telemetry/telemetry.go†L147-L155】
|
## Current metric instrumentation
|
||||||
|
|
||||||
## Tracing footprint
|
All instruments are registered in `internal/telemetry/metrics.go`. They are grouped
|
||||||
* Tracing is enabled only when OTLP export is turned on; `telemetry.Init` wires a batch `TracerProvider` and sets it globally. 【F:internal/telemetry/telemetry.go†L135-L155】
|
into site, tunnel, connection, configuration, build, WebSocket, and proxy domains.
|
||||||
* The admin HTTP mux (`/metrics`, `/healthz`) is wrapped with `otelhttp.NewHandler`, so any inbound admin requests produce spans. 【F:main.go†L373-L387】
|
A global attribute filter (see `buildMeterProvider`) constrains exposed label keys to
|
||||||
* WebSocket dials create a `ws.connect` span around the outbound handshake, but subsequent control-plane HTTP requests (token fetch, blueprint sync) use plain `http.Client` without propagation. 【F:websocket/client.go†L417-L459】
|
`site_id`, `region`, and a curated list of low-cardinality dimensions so that Prometheus
|
||||||
|
exports stay bounded.
|
||||||
|
|
||||||
Overall span coverage is limited to the WebSocket connect loop and admin server; tunnel setup, Docker discovery, config application, and health pings currently emit only metrics.
|
- **Site lifecycle**: `newt_site_registrations_total`, `newt_site_online`, and
|
||||||
|
`newt_site_last_heartbeat_seconds` capture registration attempts and liveness. They
|
||||||
|
are fed either manually (`IncSiteRegistration`) or via the `TelemetryView` state
|
||||||
|
callback that publishes observable gauges for the active site.
|
||||||
|
- **Tunnel health and usage**: Counters and histograms track bytes, latency, reconnects,
|
||||||
|
and active sessions per tunnel (`newt_tunnel_*` family). Attribute helpers respect
|
||||||
|
the `NEWT_METRICS_INCLUDE_TUNNEL_ID` toggle to keep cardinality manageable on larger
|
||||||
|
fleets.
|
||||||
|
- **Connection attempts**: `newt_connection_attempts_total` and
|
||||||
|
`newt_connection_errors_total` are emitted throughout the WebSocket client to classify
|
||||||
|
authentication, dial, and transport failures.
|
||||||
|
- **Operations/configuration**: `newt_config_reloads_total`,
|
||||||
|
`newt_restart_count_total`, `newt_config_apply_seconds`, and
|
||||||
|
`newt_cert_rotation_total` provide visibility into blueprint reloads, process boots,
|
||||||
|
configuration timings, and certificate rotation outcomes.
|
||||||
|
- **Build metadata**: `newt_build_info` records the binary version/commit together
|
||||||
|
with a monotonic restart counter when build information is supplied at startup.
|
||||||
|
- **WebSocket control-plane**: `newt_websocket_connect_latency_seconds` and
|
||||||
|
`newt_websocket_messages_total` report connect latency and ping/pong/text activity.
|
||||||
|
- **Proxy data-plane**: Observable gauges (`newt_proxy_active_connections`,
|
||||||
|
`newt_proxy_buffer_bytes`, `newt_proxy_async_backlog_bytes`) and the
|
||||||
|
`newt_proxy_drops_total` counter are fed from the proxy manager to monitor backlog
|
||||||
|
and drop behaviour alongside per-protocol byte counters.
|
||||||
|
|
||||||
## Guideline & best-practice adherence
|
Refer to `docs/observability.md` for a tabular catalogue with instrument types,
|
||||||
* **Resource & exporter configuration:** `telemetry.FromEnv` honours OTEL env-vars, sets service name/version, and promotes `site_id`/`region` resource attributes before building the provider. Exporters default to Prometheus with optional OTLP, aligning with OTel deployment guidance. 【F:internal/telemetry/telemetry.go†L56-L206】
|
attributes, and sample exposition lines.
|
||||||
* **Low-cardinality enforcement:** A view-level attribute allow-list retains only approved keys (`tunnel_id`, `transport`, `protocol`, etc.), protecting Prometheus cardinality while still surfacing `site_id`/`region`. 【F:internal/telemetry/telemetry.go†L209-L231】
|
|
||||||
* **Units and naming:** Instrument helpers enforce `_total` suffixes for counters, `_seconds` for durations, and attach `metric.WithUnit("By"|"s")` for size/time metrics, matching OTel semantic conventions. 【F:internal/telemetry/metrics.go†L23-L192】
|
|
||||||
* **Runtime metrics & shutdown:** The runtime instrumentation is enabled, and `Setup.Shutdown` drains exporters in reverse order to avoid data loss. 【F:internal/telemetry/telemetry.go†L147-L261】
|
|
||||||
* **Site-aware observables:** `state.TelemetryView` provides thread-safe snapshots to feed `newt_site_online`/`_last_heartbeat_seconds`/`_tunnel_sessions`, ensuring gauges report cohesive per-site data even when `tunnel_id` labels are disabled. 【F:internal/state/telemetry_view.go†L11-L79】
|
|
||||||
|
|
||||||
## Gaps & recommended improvements
|
## Tracing coverage
|
||||||
1. **Tracing coverage:** Instrument the Pangolin REST calls (`getToken`, blueprint downloads) with `otelhttp.NewTransport` or explicit spans, and consider spans for WireGuard handshake/config apply to enable end-to-end traces when OTLP is on. 【F:websocket/client.go†L240-L360】
|
|
||||||
2. **Histogram coverage:** Introduce `newt_site_registration_latency_seconds` (bootstrap) and `newt_ping_roundtrip_seconds` (heartbeat) to capture SLO-critical latencies before release. Existing latency buckets (`0.005s` → `30s`) can be reused. 【F:internal/telemetry/telemetry.go†L209-L218】
|
|
||||||
3. **Control-plane throughput:** Add `newt_websocket_payload_bytes_total` (direction/msg_type) or reuse the tunnel counter with a `transport="websocket"` label to quantify command traffic volume and detect back-pressure.
|
|
||||||
4. **Docker discovery metrics:** If Docker auto-discovery is enabled, expose counters for container add/remove events and failures so operators can trace missing backends to discovery issues.
|
|
||||||
|
|
||||||
## Pre-release metric backlog
|
Tracing is optional and enabled only when OTLP export is configured. When active:
|
||||||
Prior to GA, we recommend landing the following high-value instruments:
|
|
||||||
* **Bootstrap latency:** `newt_site_registration_latency_seconds` histogram emitted around the initial Pangolin registration HTTP call to detect slow control-plane responses.
|
|
||||||
* **Session duration:** `newt_websocket_session_duration_seconds` histogram recorded when a WebSocket closes (result + reason) to quantify stability.
|
|
||||||
* **Heartbeat lag:** `newt_ping_roundtrip_seconds` histogram from ping/pong monitors to capture tunnel health, complementing the heartbeat gauge.
|
|
||||||
* **Proxy accept errors:** `newt_proxy_accept_errors_total` counter keyed by protocol/reason to surface listener pressure distinct from data-plane drops.
|
|
||||||
* **Discovery events:** `newt_discovery_events_total` counter with `action` (`add`, `remove`, `error`) and `source` (`docker`, `file`) to audit service inventory churn.
|
|
||||||
|
|
||||||
Implementing the above will round out visibility into control-plane responsiveness, connection stability, and discovery health while preserving the existing low-cardinality discipline.
|
- The admin HTTP mux is wrapped with `otelhttp.NewHandler`, producing spans for
|
||||||
|
`/metrics` and `/healthz` requests.
|
||||||
|
- The WebSocket dial path creates a `ws.connect` span around the gRPC-based handshake.
|
||||||
|
|
||||||
|
No other subsystems currently create spans, so data-plane operations, blueprint fetches,
|
||||||
|
Docker discovery, and WireGuard reconfiguration happen without trace context.
|
||||||
|
|
||||||
|
## Guideline & best-practice alignment
|
||||||
|
|
||||||
|
The implementation adheres to most OTel Go recommendations:
|
||||||
|
|
||||||
|
- **Naming & units** – Every instrument follows the `newt_*` prefix with `_total`
|
||||||
|
suffixes for counters and `_seconds`/`_bytes` unit conventions. Histograms are
|
||||||
|
registered with explicit second-based buckets.
|
||||||
|
- **Resource attributes** – Service name/version and optional `site_id`/`region`
|
||||||
|
populate the `resource.Resource` and are also injected as metric attributes for
|
||||||
|
compatibility with Prometheus queries.
|
||||||
|
- **Attribute hygiene** – A single attribute filter (`sdkmetric.WithView`) enforces
|
||||||
|
the allow-list of label keys to prevent accidental high-cardinality emission.
|
||||||
|
- **Runtime metrics** – Go runtime instrumentation is enabled automatically through
|
||||||
|
`runtime.Start`.
|
||||||
|
- **Configuration via environment** – `telemetry.FromEnv` honours `OTEL_*` variables
|
||||||
|
alongside `NEWT_*` overrides so operators can configure exporters without code
|
||||||
|
changes.
|
||||||
|
- **Shutdown handling** – `Setup.Shutdown` iterates exporters in reverse order to
|
||||||
|
flush buffers before process exit.
|
||||||
|
|
||||||
|
## Adjustments & improvements
|
||||||
|
|
||||||
|
The review identified a few actionable adjustments:
|
||||||
|
|
||||||
|
1. **Record registration failures** – `newt_site_registrations_total` is currently
|
||||||
|
incremented only on success. Emit `result="failure"` samples whenever Pangolin
|
||||||
|
rejects a registration or credential exchange so operators can alert on churn.
|
||||||
|
2. **Surface config reload failures** – `telemetry.IncConfigReload` is invoked with
|
||||||
|
`result="success"` only. Callers should record a failure result when blueprint
|
||||||
|
parsing or application aborts before success counters are incremented.
|
||||||
|
3. **Harmonise restart count behaviour** – `newt_restart_count_total` increments only
|
||||||
|
when build metadata is provided. Consider moving the increment out of
|
||||||
|
`RegisterBuildInfo` so the counter advances even for ad-hoc builds without version
|
||||||
|
strings.
|
||||||
|
4. **Propagate contexts where available** – Many emitters call metric helpers with
|
||||||
|
`context.Background()`. Passing real contexts (when inexpensive) would allow future
|
||||||
|
exporters to correlate spans and metrics.
|
||||||
|
5. **Extend tracing coverage** – Instrument critical flows such as blueprint fetches,
|
||||||
|
WireGuard reconfiguration, proxy accept loops, and Docker discovery to provide end
|
||||||
|
to end visibility when OTLP tracing is enabled.
|
||||||
|
|
||||||
|
## Metrics to add before release
|
||||||
|
|
||||||
|
Prioritised additions that would close visibility gaps:
|
||||||
|
|
||||||
|
1. **WebSocket disconnect outcomes** – A counter (e.g., `newt_websocket_disconnects_total`)
|
||||||
|
partitioned by `reason` would complement the existing connect latency histogram and
|
||||||
|
explain reconnect storms.
|
||||||
|
2. **Keepalive/heartbeat failures** – Counting ping timeouts or heartbeat misses would
|
||||||
|
make `newt_site_last_heartbeat_seconds` actionable by providing discrete events.
|
||||||
|
3. **Proxy connection lifecycle** – Add counters/histograms for proxy accept events and
|
||||||
|
connection durations to correlate drops with load and backlog metrics.
|
||||||
|
4. **Blueprint/config pull latency** – Measuring Pangolin blueprint fetch durations and
|
||||||
|
HTTP status distribution would expose slow control-plane operations.
|
||||||
|
5. **Certificate rotation attempts** – Complement `newt_cert_rotation_total` with a
|
||||||
|
duration histogram to observe slow PKI updates and detect stuck rotations.
|
||||||
|
|
||||||
|
These metrics rely on data that is already available in the code paths mentioned
|
||||||
|
above and would round out operational dashboards.
|
||||||
|
|
||||||
|
## Tracing wishlist
|
||||||
|
|
||||||
|
To benefit from tracing when OTLP is active, add spans around:
|
||||||
|
|
||||||
|
- Pangolin REST calls (wrap the HTTP client with `otelhttp.NewTransport`).
|
||||||
|
- Docker discovery cycles and target registration callbacks.
|
||||||
|
- WireGuard reconfiguration (interface bring-up, peer updates).
|
||||||
|
- Proxy dial/accept loops for both TCP and UDP targets.
|
||||||
|
|
||||||
|
Capturing these stages will let operators correlate latency spikes with reconnects
|
||||||
|
and proxy drops using distributed traces in addition to the metric signals.
|
||||||
|
|||||||
Reference in New Issue
Block a user