6.8 KiB
Newt OpenTelemetry Review
Overview
This document summarises the current OpenTelemetry (OTel) instrumentation in Newt, assesses
compliance with OTel guidelines, and lists concrete improvements to pursue before release.
It is based on the implementation in internal/telemetry and the call-sites that emit
metrics and traces across the code base.
Current metric instrumentation
All instruments are registered in internal/telemetry/metrics.go. They are grouped
into site, tunnel, connection, configuration, build, WebSocket, and proxy domains.
A global attribute filter (see buildMeterProvider) constrains exposed label keys to
site_id, region, and a curated list of low-cardinality dimensions so that Prometheus
exports stay bounded.
- Site lifecycle:
newt_site_registrations_total,newt_site_online, andnewt_site_last_heartbeat_secondscapture registration attempts and liveness. They are fed either manually (IncSiteRegistration) or via theTelemetryViewstate callback that publishes observable gauges for the active site. - Tunnel health and usage: Counters and histograms track bytes, latency, reconnects,
and active sessions per tunnel (
newt_tunnel_*family). Attribute helpers respect theNEWT_METRICS_INCLUDE_TUNNEL_IDtoggle to keep cardinality manageable on larger fleets. - Connection attempts:
newt_connection_attempts_totalandnewt_connection_errors_totalare emitted throughout the WebSocket client to classify authentication, dial, and transport failures. - Operations/configuration:
newt_config_reloads_total,newt_restart_count_total,newt_config_apply_seconds, andnewt_cert_rotation_totalprovide visibility into blueprint reloads, process boots, configuration timings, and certificate rotation outcomes. - Build metadata:
newt_build_inforecords the binary version/commit together with a monotonic restart counter when build information is supplied at startup. - WebSocket control-plane:
newt_websocket_connect_latency_secondsandnewt_websocket_messages_totalreport connect latency and ping/pong/text activity. - Proxy data-plane: Observable gauges (
newt_proxy_active_connections,newt_proxy_buffer_bytes,newt_proxy_async_backlog_bytes) and thenewt_proxy_drops_totalcounter are fed from the proxy manager to monitor backlog and drop behaviour alongside per-protocol byte counters.
Refer to docs/observability.md for a tabular catalogue with instrument types,
attributes, and sample exposition lines.
Tracing coverage
Tracing is optional and enabled only when OTLP export is configured. When active:
- The admin HTTP mux is wrapped with
otelhttp.NewHandler, producing spans for/metricsand/healthzrequests. - The WebSocket dial path creates a
ws.connectspan around the gRPC-based handshake.
No other subsystems currently create spans, so data-plane operations, blueprint fetches, Docker discovery, and WireGuard reconfiguration happen without trace context.
Guideline & best-practice alignment
The implementation adheres to most OTel Go recommendations:
- Naming & units – Every instrument follows the
newt_*prefix with_totalsuffixes for counters and_seconds/_bytesunit conventions. Histograms are registered with explicit second-based buckets. - Resource attributes – Service name/version and optional
site_id/regionpopulate theresource.Resourceand are also injected as metric attributes for compatibility with Prometheus queries. - Attribute hygiene – A single attribute filter (
sdkmetric.WithView) enforces the allow-list of label keys to prevent accidental high-cardinality emission. - Runtime metrics – Go runtime instrumentation is enabled automatically through
runtime.Start. - Configuration via environment –
telemetry.FromEnvhonoursOTEL_*variables alongsideNEWT_*overrides so operators can configure exporters without code changes. - Shutdown handling –
Setup.Shutdowniterates exporters in reverse order to flush buffers before process exit.
Adjustments & improvements
The review identified a few actionable adjustments:
- Record registration failures –
newt_site_registrations_totalis currently incremented only on success. Emitresult="failure"samples whenever Pangolin rejects a registration or credential exchange so operators can alert on churn. - Surface config reload failures –
telemetry.IncConfigReloadis invoked withresult="success"only. Callers should record a failure result when blueprint parsing or application aborts before success counters are incremented. - Harmonise restart count behaviour –
newt_restart_count_totalincrements only when build metadata is provided. Consider moving the increment out ofRegisterBuildInfoso the counter advances even for ad-hoc builds without version strings. - Propagate contexts where available – Many emitters call metric helpers with
context.Background(). Passing real contexts (when inexpensive) would allow future exporters to correlate spans and metrics. - Extend tracing coverage – Instrument critical flows such as blueprint fetches, WireGuard reconfiguration, proxy accept loops, and Docker discovery to provide end to end visibility when OTLP tracing is enabled.
Metrics to add before release
Prioritised additions that would close visibility gaps:
- WebSocket disconnect outcomes – A counter (e.g.,
newt_websocket_disconnects_total) partitioned byreasonwould complement the existing connect latency histogram and explain reconnect storms. - Keepalive/heartbeat failures – Counting ping timeouts or heartbeat misses would
make
newt_site_last_heartbeat_secondsactionable by providing discrete events. - Proxy connection lifecycle – Add counters/histograms for proxy accept events and connection durations to correlate drops with load and backlog metrics.
- Blueprint/config pull latency – Measuring Pangolin blueprint fetch durations and HTTP status distribution would expose slow control-plane operations.
- Certificate rotation attempts – Complement
newt_cert_rotation_totalwith a duration histogram to observe slow PKI updates and detect stuck rotations.
These metrics rely on data that is already available in the code paths mentioned above and would round out operational dashboards.
Tracing wishlist
To benefit from tracing when OTLP is active, add spans around:
- Pangolin REST calls (wrap the HTTP client with
otelhttp.NewTransport). - Docker discovery cycles and target registration callbacks.
- WireGuard reconfiguration (interface bring-up, peer updates).
- Proxy dial/accept loops for both TCP and UDP targets.
Capturing these stages will let operators correlate latency spikes with reconnects and proxy drops using distributed traces in addition to the metric signals.