mirror of
https://github.com/fosrl/newt.git
synced 2026-03-26 20:46:41 +00:00
fix(metrics): enhance documentation clarity and structure for metrics recommendations
This commit is contained in:
@@ -3,64 +3,70 @@
|
|||||||
This document captures the current state of Newt metrics, prioritized fixes, and a pragmatic roadmap for near-term improvements.
|
This document captures the current state of Newt metrics, prioritized fixes, and a pragmatic roadmap for near-term improvements.
|
||||||
|
|
||||||
1) Current setup (summary)
|
1) Current setup (summary)
|
||||||
- Export: Prometheus exposition (default), optional OTLP (gRPC)
|
|
||||||
- Existing instruments:
|
- Export: Prometheus exposition (default), optional OTLP (gRPC)
|
||||||
- Sites: newt_site_registrations_total, newt_site_online (0/1), newt_site_last_heartbeat_seconds
|
- Existing instruments:
|
||||||
- Tunnel/Traffic: newt_tunnel_sessions, newt_tunnel_bytes_total, newt_tunnel_latency_seconds, newt_tunnel_reconnects_total
|
- Sites: newt_site_registrations_total, newt_site_online (0/1), newt_site_last_heartbeat_seconds
|
||||||
- Connection lifecycle: newt_connection_attempts_total, newt_connection_errors_total
|
- Tunnel/Traffic: newt_tunnel_sessions, newt_tunnel_bytes_total, newt_tunnel_latency_seconds, newt_tunnel_reconnects_total
|
||||||
- Operations: newt_config_reloads_total, newt_restart_count_total, newt_build_info
|
- Connection lifecycle: newt_connection_attempts_total, newt_connection_errors_total
|
||||||
- Go runtime: GC, heap, goroutines via runtime instrumentation
|
- Operations: newt_config_reloads_total, newt_restart_count_total, newt_build_info
|
||||||
|
- Go runtime: GC, heap, goroutines via runtime instrumentation
|
||||||
|
|
||||||
2) Main issues addressed now
|
2) Main issues addressed now
|
||||||
- Attribute filter (allow-list) extended to include site_id and region in addition to existing keys (tunnel_id, transport, protocol, direction, result, reason, error_type, version, commit).
|
|
||||||
- site_id and region propagation: site_id is now attached as a metric label across newt_*; region is added as a metric label when set. Both remain resource attributes for consistency with OTEL.
|
- Attribute filter (allow-list) extended to include site_id and region in addition to existing keys (tunnel_id, transport, protocol, direction, result, reason, error_type, version, commit).
|
||||||
- Label semantics clarified:
|
- site_id and region propagation: site_id is now attached as a metric label across newt_*; region is added as a metric label when set. Both remain resource attributes for consistency with OTEL.
|
||||||
- transport: control-plane mechanism (e.g., websocket, wireguard)
|
- Label semantics clarified:
|
||||||
- protocol: L4 payload type (tcp, udp)
|
- transport: control-plane mechanism (e.g., websocket, wireguard)
|
||||||
- newt_tunnel_bytes_total uses protocol and direction, not transport.
|
- protocol: L4 payload type (tcp, udp)
|
||||||
- Robustness improvements: removed duplicate clear logic on reconnect; avoided empty site_id by reading NEWT_SITE_ID/NEWT_ID and OTEL_RESOURCE_ATTRIBUTES.
|
- newt_tunnel_bytes_total uses protocol and direction, not transport.
|
||||||
|
- Robustness improvements: removed duplicate clear logic on reconnect; avoided empty site_id by reading NEWT_SITE_ID/NEWT_ID and OTEL_RESOURCE_ATTRIBUTES.
|
||||||
|
|
||||||
3) Remaining gaps and deviations
|
3) Remaining gaps and deviations
|
||||||
- Some call sites still need initiator label on reconnect outcomes (client vs server). This is planned.
|
|
||||||
- WebSocket and Proxy metrics (connect latency, messages, active connections, buffer/drops, async backlog) are planned additions.
|
- Some call sites still need initiator label on reconnect outcomes (client vs server). This is planned.
|
||||||
- Config apply duration and cert rotation counters are planned.
|
- WebSocket and Proxy metrics (connect latency, messages, active connections, buffer/drops, async backlog) are planned additions.
|
||||||
|
- Config apply duration and cert rotation counters are planned.
|
||||||
|
|
||||||
4) Roadmap (phased)
|
4) Roadmap (phased)
|
||||||
- Phase 1 (done in this iteration)
|
|
||||||
- Fix attribute filter (site_id, region)
|
- Phase 1 (done in this iteration)
|
||||||
- Propagate site_id (and optional region) across metrics
|
- Fix attribute filter (site_id, region)
|
||||||
- Correct label semantics (transport vs protocol); fix sessions transport labelling
|
- Propagate site_id (and optional region) across metrics
|
||||||
- Documentation alignment
|
- Correct label semantics (transport vs protocol); fix sessions transport labelling
|
||||||
- Phase 2 (next)
|
- Documentation alignment
|
||||||
- WebSocket: newt_websocket_connect_latency_seconds; newt_websocket_messages_total{direction,msg_type}
|
- Phase 2 (next)
|
||||||
- Proxy: newt_proxy_active_connections, newt_proxy_buffer_bytes, newt_proxy_drops_total, newt_proxy_async_backlog_bytes
|
- WebSocket: newt_websocket_connect_latency_seconds; newt_websocket_messages_total{direction,msg_type}
|
||||||
- Reconnect: add initiator label (client/server)
|
- Proxy: newt_proxy_active_connections, newt_proxy_buffer_bytes, newt_proxy_drops_total, newt_proxy_async_backlog_bytes
|
||||||
- Config & PKI: newt_config_apply_seconds{phase,result}; newt_cert_rotation_total{result}
|
- Reconnect: add initiator label (client/server)
|
||||||
|
- Config & PKI: newt_config_apply_seconds{phase,result}; newt_cert_rotation_total{result}
|
||||||
|
|
||||||
5) Operational guidance
|
5) Operational guidance
|
||||||
- Do not double scrape: scrape either Newt (/metrics) or the Collector’s Prometheus exporter (not both) to avoid double-counting cumulative counters.
|
|
||||||
- For high cardinality tunnel_id, consider relabeling or dropping per-tunnel series in Prometheus to control cardinality.
|
- Do not double scrape: scrape either Newt (/metrics) or the Collector’s Prometheus exporter (not both) to avoid double-counting cumulative counters.
|
||||||
- OTLP troubleshooting: enable TLS via OTEL_EXPORTER_OTLP_CERTIFICATE, use OTEL_EXPORTER_OTLP_HEADERS for auth; verify endpoint reachability.
|
- For high cardinality tunnel_id, consider relabeling or dropping per-tunnel series in Prometheus to control cardinality.
|
||||||
|
- OTLP troubleshooting: enable TLS via OTEL_EXPORTER_OTLP_CERTIFICATE, use OTEL_EXPORTER_OTLP_HEADERS for auth; verify endpoint reachability.
|
||||||
|
|
||||||
6) Example alerts/recording rules (suggestions)
|
6) Example alerts/recording rules (suggestions)
|
||||||
- Reconnect spikes:
|
|
||||||
- increase(newt_tunnel_reconnects_total[5m]) by (site_id)
|
- Reconnect spikes:
|
||||||
- Sustained connection errors:
|
- increase(newt_tunnel_reconnects_total[5m]) by (site_id)
|
||||||
- rate(newt_connection_errors_total[5m]) by (site_id,transport,error_type)
|
- Sustained connection errors:
|
||||||
- Heartbeat gaps:
|
- rate(newt_connection_errors_total[5m]) by (site_id,transport,error_type)
|
||||||
- max_over_time(newt_site_last_heartbeat_seconds[15m]) by (site_id)
|
- Heartbeat gaps:
|
||||||
- Proxy drops:
|
- max_over_time(newt_site_last_heartbeat_seconds[15m]) by (site_id)
|
||||||
- increase(newt_proxy_drops_total[5m]) by (site_id,protocol)
|
- Proxy drops:
|
||||||
- WebSocket connect p95 (when added):
|
- increase(newt_proxy_drops_total[5m]) by (site_id,protocol)
|
||||||
- histogram_quantile(0.95, sum(rate(newt_websocket_connect_latency_seconds_bucket[5m])) by (le,site_id))
|
- WebSocket connect p95 (when added):
|
||||||
|
- histogram_quantile(0.95, sum(rate(newt_websocket_connect_latency_seconds_bucket[5m])) by (le,site_id))
|
||||||
|
|
||||||
7) Collector configuration
|
7) Collector configuration
|
||||||
- Direct scrape variant requires no attribute promotion since site_id is already a metric label.
|
|
||||||
- Transform/promote variant remains optional for environments that rely on resource-to-label promotion.
|
- Direct scrape variant requires no attribute promotion since site_id is already a metric label.
|
||||||
|
- Transform/promote variant remains optional for environments that rely on resource-to-label promotion.
|
||||||
|
|
||||||
8) Testing
|
8) Testing
|
||||||
|
|
||||||
- curl :2112/metrics | grep ^newt_
|
- curl :2112/metrics | grep ^newt_
|
||||||
- Verify presence of site_id across series; region appears when set.
|
- Verify presence of site_id across series; region appears when set.
|
||||||
- Ensure disallowed attributes are filtered; allowed (site_id) retained.
|
- Ensure disallowed attributes are filtered; allowed (site_id) retained.
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user