fix(metrics): enhance documentation clarity and structure for metrics recommendations

This commit is contained in:
Marc Schäfer
2025-10-10 14:17:24 +02:00
parent b62e18622e
commit 8d0e6be2c7

View File

@@ -3,6 +3,7 @@
This document captures the current state of Newt metrics, prioritized fixes, and a pragmatic roadmap for near-term improvements. This document captures the current state of Newt metrics, prioritized fixes, and a pragmatic roadmap for near-term improvements.
1) Current setup (summary) 1) Current setup (summary)
- Export: Prometheus exposition (default), optional OTLP (gRPC) - Export: Prometheus exposition (default), optional OTLP (gRPC)
- Existing instruments: - Existing instruments:
- Sites: newt_site_registrations_total, newt_site_online (0/1), newt_site_last_heartbeat_seconds - Sites: newt_site_registrations_total, newt_site_online (0/1), newt_site_last_heartbeat_seconds
@@ -12,6 +13,7 @@ This document captures the current state of Newt metrics, prioritized fixes, and
- Go runtime: GC, heap, goroutines via runtime instrumentation - Go runtime: GC, heap, goroutines via runtime instrumentation
2) Main issues addressed now 2) Main issues addressed now
- Attribute filter (allow-list) extended to include site_id and region in addition to existing keys (tunnel_id, transport, protocol, direction, result, reason, error_type, version, commit). - Attribute filter (allow-list) extended to include site_id and region in addition to existing keys (tunnel_id, transport, protocol, direction, result, reason, error_type, version, commit).
- site_id and region propagation: site_id is now attached as a metric label across newt_*; region is added as a metric label when set. Both remain resource attributes for consistency with OTEL. - site_id and region propagation: site_id is now attached as a metric label across newt_*; region is added as a metric label when set. Both remain resource attributes for consistency with OTEL.
- Label semantics clarified: - Label semantics clarified:
@@ -21,11 +23,13 @@ This document captures the current state of Newt metrics, prioritized fixes, and
- Robustness improvements: removed duplicate clear logic on reconnect; avoided empty site_id by reading NEWT_SITE_ID/NEWT_ID and OTEL_RESOURCE_ATTRIBUTES. - Robustness improvements: removed duplicate clear logic on reconnect; avoided empty site_id by reading NEWT_SITE_ID/NEWT_ID and OTEL_RESOURCE_ATTRIBUTES.
3) Remaining gaps and deviations 3) Remaining gaps and deviations
- Some call sites still need initiator label on reconnect outcomes (client vs server). This is planned. - Some call sites still need initiator label on reconnect outcomes (client vs server). This is planned.
- WebSocket and Proxy metrics (connect latency, messages, active connections, buffer/drops, async backlog) are planned additions. - WebSocket and Proxy metrics (connect latency, messages, active connections, buffer/drops, async backlog) are planned additions.
- Config apply duration and cert rotation counters are planned. - Config apply duration and cert rotation counters are planned.
4) Roadmap (phased) 4) Roadmap (phased)
- Phase 1 (done in this iteration) - Phase 1 (done in this iteration)
- Fix attribute filter (site_id, region) - Fix attribute filter (site_id, region)
- Propagate site_id (and optional region) across metrics - Propagate site_id (and optional region) across metrics
@@ -38,11 +42,13 @@ This document captures the current state of Newt metrics, prioritized fixes, and
- Config & PKI: newt_config_apply_seconds{phase,result}; newt_cert_rotation_total{result} - Config & PKI: newt_config_apply_seconds{phase,result}; newt_cert_rotation_total{result}
5) Operational guidance 5) Operational guidance
- Do not double scrape: scrape either Newt (/metrics) or the Collectors Prometheus exporter (not both) to avoid double-counting cumulative counters. - Do not double scrape: scrape either Newt (/metrics) or the Collectors Prometheus exporter (not both) to avoid double-counting cumulative counters.
- For high cardinality tunnel_id, consider relabeling or dropping per-tunnel series in Prometheus to control cardinality. - For high cardinality tunnel_id, consider relabeling or dropping per-tunnel series in Prometheus to control cardinality.
- OTLP troubleshooting: enable TLS via OTEL_EXPORTER_OTLP_CERTIFICATE, use OTEL_EXPORTER_OTLP_HEADERS for auth; verify endpoint reachability. - OTLP troubleshooting: enable TLS via OTEL_EXPORTER_OTLP_CERTIFICATE, use OTEL_EXPORTER_OTLP_HEADERS for auth; verify endpoint reachability.
6) Example alerts/recording rules (suggestions) 6) Example alerts/recording rules (suggestions)
- Reconnect spikes: - Reconnect spikes:
- increase(newt_tunnel_reconnects_total[5m]) by (site_id) - increase(newt_tunnel_reconnects_total[5m]) by (site_id)
- Sustained connection errors: - Sustained connection errors:
@@ -55,12 +61,12 @@ This document captures the current state of Newt metrics, prioritized fixes, and
- histogram_quantile(0.95, sum(rate(newt_websocket_connect_latency_seconds_bucket[5m])) by (le,site_id)) - histogram_quantile(0.95, sum(rate(newt_websocket_connect_latency_seconds_bucket[5m])) by (le,site_id))
7) Collector configuration 7) Collector configuration
- Direct scrape variant requires no attribute promotion since site_id is already a metric label. - Direct scrape variant requires no attribute promotion since site_id is already a metric label.
- Transform/promote variant remains optional for environments that rely on resource-to-label promotion. - Transform/promote variant remains optional for environments that rely on resource-to-label promotion.
8) Testing 8) Testing
- curl :2112/metrics | grep ^newt_ - curl :2112/metrics | grep ^newt_
- Verify presence of site_id across series; region appears when set. - Verify presence of site_id across series; region appears when set.
- Ensure disallowed attributes are filtered; allowed (site_id) retained. - Ensure disallowed attributes are filtered; allowed (site_id) retained.