diff --git a/docs.json b/docs.json index 6250c70..fb1a4fa 100644 --- a/docs.json +++ b/docs.json @@ -113,7 +113,8 @@ "self-host/advanced/container-cli-tool", "self-host/advanced/database-options", "self-host/advanced/integration-api", - "self-host/advanced/enable-geoblocking" + "self-host/advanced/enable-geoblocking", + "self-host/advanced/metrics" ] }, { @@ -201,6 +202,23 @@ "apiHost": "https://pangolin.net/relay-O7yI" } }, + "contextual": { + "options": [ + "copy", + "view", + "chatgpt", + "claude", + "perplexity", + "mcp", + "vscode", + { + "title": "Request a feature", + "description": "Open a GitHub discussion to request a new feature", + "icon": "plus", + "href": "https://github.com/fosrl/pangolin/discussions" + } + ] + }, "redirects": [ { "source": "/telemetry", diff --git a/self-host/advanced/metrics.mdx b/self-host/advanced/metrics.mdx new file mode 100644 index 0000000..3ca9d56 --- /dev/null +++ b/self-host/advanced/metrics.mdx @@ -0,0 +1,795 @@ +--- +title: "Metrics" +description: "Enable and consume OpenTelemetry & vendor specific metrics" +--- + +We provide metrics in the **OpenTelemetry** (OTel) format and additionally support the following vendor backends: + +* **Prometheus** (native scrape and via OTel Collector) + +## Why Metrics & OTel + +Observability enables: + +1. **Incident detection** (latency spikes, reconnect storms) +2. **Capacity planning** (bytes, active sessions) +3. **User‑experience SLAs** (p95 tunnel latency, auth latency) +4. **Faster RCA** (dimensions like `error_type`, `result`) + +OpenTelemetry provides a **vendor‑neutral** pipeline so you can change backends without retouching instrumented code. + +## Availability + +Newt exposes metrics starting from specific releases, but metrics are disabled in their default configuration. + +- Newt: metrics implemented since Newt 1.6.0 (disabled by default) + +## Open Telemetry + +Push metrics and traces to an **OTel Collector** or any backend that accepts OTLP. + + +If you only enable Prometheus scrape, leave `*_METRICS_OTLP_ENABLED=false` and omit OTLP vars. + + + +The OTel Collector commonly uses port 4317 for gRPC and 4318 for HTTP. Set OTEL_EXPORTER_OTLP_PROTOCOL to http/protobuf for HTTP or grpc for gRPC, and point OTEL_EXPORTER_OTLP_ENDPOINT accordingly. +For further customization, see the [OTel Collector documentation](https://opentelemetry.io/docs/collector/). + + + + + + + ```text + NEWT_METRICS_OTLP_ENABLED=true # enable OTLP exporter + OTEL_EXPORTER_OTLP_ENDPOINT=otel-collector:4317 + OTEL_EXPORTER_OTLP_INSECURE=true # or false + TLS vars + OTEL_METRIC_EXPORT_INTERVAL=15s + # Optional auth / TLS + OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer%20XYZ + OTEL_EXPORTER_OTLP_CERTIFICATE=/etc/otel/ca.pem + ``` + + + ```text + newt \ + --metrics-otlp-enabled=true \ # alias for otel + --otel=true \ + --otel-exporter-otlp-endpoint=otel-collector:4317 \ + --otel-exporter-otlp-insecure=true \ + --otel-metric-export-interval=15s \ + --otel-exporter-otlp-headers=authorization=Bearer%20XYZ \ + --otel-exporter-otlp-certificate=/etc/otel/ca.pem + ``` + See the [CLI reference](../../manage/sites/configure-site) for all available flags. + + + + + + + + + ```bash + # Enable OTLP exporters and point to your Collector's gRPC receiver. + export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317" + export OTEL_EXPORTER_OTLP_PROTOCOL="grpc" + + newt \ + --otlp=true + --id saz281jfa8z37zg + --secret ssfdfsder33rrerrwe + --endpoint http://pangolin.example.com + ``` + + + + + ```yaml title="docker-compose.metrics.yaml" + services: + otel-collector: + image: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest # DO NOT use 'latest' in production + command: ["--config=/etc/otel/config.yaml"] + volumes: + - ./otel-config.yaml:/etc/otel/config.yaml:ro + ports: + - "4317:4317" # gRPC + - "4318:4318" # HTTP + - "8888:8888" # Prometheus exporter (from the Collector) - Optional + + newt: + image: fosrl/newt:latest # DO NOT use 'latest' in production + environment: + NEWT_METRICS_OTLP_ENABLED: "true" + OTEL_EXPORTER_OTLP_ENDPOINT: otel-collector:4317 + OTEL_EXPORTER_OTLP_INSECURE: "true" + PANGOLIN_ENDPOINT: https://example.com + NEWT_ID: heresmynewtid + NEWT_SECRET: yoursupersecretkeyhere + ``` + + ```yaml title="otel-config.yaml" + receivers: + otlp: + protocols: + grpc: + endpoint: 0.0.0.0:4317 + http: + endpoint: 0.0.0.0:4318 + + processors: {} + + # Example exporters: + exporters: + otlp: + endpoint: otel-collector:4317 + insecure: true + prometheus: + endpoint: "0.0.0.0:8889" + + service: + pipelines: + metrics: + receivers: [otlp] + processors: [] + exporters: [prometheus] + ``` + + Forward to Remote Write Backend + + ```yaml title="otel-config-remote.yaml" + exporters: + prometheusremotewrite: + endpoint: https://prom-remote.example.com/api/v1/write + headers: + X-Scope-OrgID: tenant-a + tls: + insecure_skip_verify: false + service: + pipelines: + metrics/remote: + receivers: [otlp] + processors: [batch] + exporters: [prometheusremotewrite] + ``` + + Combine exporters (e.g. local Prometheus + remote write) to retain fast local dashboards and ship long‑term retention externally. + + + + + + + + +## Prometheus (without OTel Collector) + + + + +Each service listens on an admin HTTP address (example Newt default `:2112`). + + + + ```text + NEWT_METRICS_PROMETHEUS_ENABLED=true # /metrics endpoint + NEWT_ADMIN_ADDR=:2112 # admin HTTP address + ``` + + + ```text + newt \ + --metrics-prometheus-enabled=true \ # alias for metrics + --metrics=true + --admin-addr=:2112 \ + ``` + See the [CLI reference](../../manage/sites/configure-site) for all available flags. + + + + + + + + ```bash + newt \ + --metrics-prometheus-enabled=true \ + --admin-addr=:2112 \ + --id saz281jfa8z37zg \ + --secret ssfdfsder33rrerrwe \ + --endpoint https://pangolin.example.com + ``` + + + ```yaml title="docker-compose.metrics.yaml" + services: + newt: + image: fosrl/newt:latest # DO NOT use 'latest' in production + environment: + NEWT_METRICS_OTLP_ENABLED: "true" + OTEL_EXPORTER_OTLP_ENDPOINT: otel-collector:4317 + OTEL_EXPORTER_OTLP_INSECURE: "true" + PANGOLIN_ENDPOINT: https://example.com + NEWT_ID: saz281jfa8z37zg + NEWT_SECRET: ssfdfsder33rrerrwe + ``` + + + ```yaml title="prometheus.yml (fragment)" + scrape_configs: + - job_name: pangolin + static_configs: [{ targets: ["pangolin:2112"] }] + ``` + + + + + +## Full Metric Reference + +**Version 1.0.0 from 2025-10-28** + +Below are currently implemented metrics for **Newt**. + +* **Metric**: exact metric name +* **Instrument & unit**: OTel instrument type and canonical unit +* **Purpose**: what the metric conveys / recommended use +* **Emission path**: subsystem responsible (for troubleshooting missing data) +* **Example series**: representative sample including labels + + +Names/labels can change between major versions. Avoid hard‑coding full label sets in alerts; prefer existence checks and aggregate functions. + + +### Newt metrics + + + + + OpenTelemetry metric instruments exposed by Newt. Expand each section to see individual metrics with labels, units, emission points, and examples. + + + + Counts Pangolin registration attempts keyed by result. + + **Unit:** 1 + **Labels:** `result` (`success`|`failure`), `site_id` + **Emission path:** `telemetry.IncSiteRegistration` + **Example:** `newt_site_registrations_total{result="success",site_id="abc"} 1` + + + + + 0/1 heartbeat for the active site. + + **Unit:** 1 + **Labels:** `site_id` + **Emission path:** `state.TelemetryView` (callback) + **Example:** `newt_site_online{site_id="self"} 1` + + + + + Seconds since last Pangolin heartbeat. + + **Unit:** seconds + **Labels:** `site_id` + **Emission path:** `TouchHeartbeat` (callback) + **Example:** `newt_site_last_heartbeat_seconds{site_id="self"} 3.2` + + + + + Constant 1 with build metadata labels. + + **Unit:** 1 + **Labels:** `version`, `commit` + **Emission path:** Build info registration + **Example:** `newt_build_info{version="1.2.3",commit="abc123"} 1` + + + + + Process boot indicator (increments once per process start). + + **Unit:** 1 + **Labels:** — + **Emission path:** `RegisterBuildInfo` + **Example:** `newt_restart_count_total 1` + + + + + Certificate rotation events keyed by result. + + **Unit:** 1 + **Labels:** `result` + **Emission path:** `IncCertRotation` + **Example:** `newt_cert_rotation_total{result="success"} 1` + + + + + Config reload attempts keyed by result. + + **Unit:** 1 + **Labels:** `result` + **Emission path:** `telemetry.IncConfigReload` + **Example:** `newt_config_reloads_total{result="success"} 1` + + + + + Duration per config-apply phase keyed by `phase` and `result`. + + **Unit:** seconds + **Labels:** `phase`, `result` + **Emission path:** `telemetry.ObserveConfigApply` + **Example:** `newt_config_apply_seconds_bucket{phase="peer",result="success",le="0.1"} 3` + + + + + + + Active sessions per tunnel (or collapsed). + + **Unit:** 1 + **Labels:** `site_id`, `tunnel_id` + **Emission path:** `RegisterStateView` + **Example:** `newt_tunnel_sessions{site_id="self",tunnel_id="wgpub"} 2` + + + + + Traffic per tunnel, direction, and protocol. + + **Unit:** bytes + **Labels:** `tunnel_id`, `direction` (`ingress`|`egress`), `protocol` (`tcp`|`udp`) + **Emission path:** Proxy manager + **Example:** `newt_tunnel_bytes_total{direction="egress",protocol="tcp",tunnel_id="wgpub"} 8192` + + + + + RTT samples per tunnel/transport. + + **Unit:** seconds + **Labels:** `tunnel_id`, `transport` + **Emission path:** Health checks + **Example:** `newt_tunnel_latency_seconds_bucket{transport="wireguard",le="0.05",tunnel_id="wgpub"} 4` + + + + + Reconnect attempts keyed by initiator & reason. + + **Unit:** 1 + **Labels:** `tunnel_id`, `initiator` (`client`|`server`), `reason` + **Emission path:** `telemetry.IncReconnect` + **Example:** `newt_tunnel_reconnects_total{initiator="client",reason="timeout",tunnel_id="wgpub"} 3` + + + + + + + Auth/WebSocket connection attempts keyed by transport & result. + + **Unit:** 1 + **Labels:** `transport`, `result` + **Emission path:** `telemetry.IncConnAttempt` + **Example:** `newt_connection_attempts_total{transport="websocket",result="failure"} 2` + + + + + Connection errors keyed by transport and type. + + **Unit:** 1 + **Labels:** `transport`, `error_type` + **Emission path:** `telemetry.IncConnError` + **Example:** `newt_connection_errors_total{transport="auth",error_type="auth_failed"} 1` + + + + + + + Dial latency for Pangolin WebSocket. + + **Unit:** seconds + **Labels:** `result`, `transport` + **Emission path:** `ObserveWSConnectLatency` + **Example:** `newt_websocket_connect_latency_seconds_bucket{result="success",transport="websocket",le="0.5"} 1` + + + + + WebSocket disconnects keyed by reason. + + **Unit:** 1 + **Labels:** `reason`, `tunnel_id` + **Emission path:** `IncWSDisconnect` + **Example:** `newt_websocket_disconnects_total{reason="remote_close",tunnel_id="wgpub"} 2` + + + + + Ping/Pong failures observed by keepalive. + + **Unit:** 1 + **Labels:** `reason` (e.g., `ping_write`, `pong_timeout`) + **Emission path:** `telemetry.IncWSKeepaliveFailure(ctx, "ping_write")` + **Example:** `newt_websocket_keepalive_failures_total{reason="ping_write"} 1` + + + + + Duration of established WS sessions keyed by result. + + **Unit:** seconds + **Labels:** `result` (`success`|`error`) + **Emission path:** `telemetry.ObserveWSSessionDuration(ctx, time.Since(start).Seconds(), "error")` + **Example:** `newt_websocket_session_duration_seconds_bucket{result="error",le="60"} 3` + + + + + Current WS connection state (0/1). + + **Unit:** 1 + **Labels:** — + **Emission path:** `telemetry.SetWSConnectionState(true|false)` + **Example:** `newt_websocket_connected 1` + + + + + WebSocket reconnect attempts keyed by reason. + + **Unit:** 1 + **Labels:** `reason` + **Emission path:** `telemetry.IncWSReconnect(ctx, "ping_write")` + **Example:** `newt_websocket_reconnects_total{reason="ping_write"} 1` + + + + + In/out WS messages keyed by direction & type. + + **Unit:** 1 + **Labels:** `direction` (`in`|`out`), `msg_type` (`ping`|`pong`|`text`|...) + **Emission path:** `IncWSMessage` + **Example:** `newt_websocket_messages_total{direction="out",msg_type="ping"} 4` + + + + + + + Active TCP/UDP proxy connections per tunnel/protocol. + + **Unit:** 1 + **Labels:** `protocol`, `tunnel_id` + **Emission path:** Proxy callback + **Example:** `newt_proxy_active_connections{protocol="tcp",tunnel_id="wgpub"} 3` + + + + + Proxy buffer pool size. + + **Unit:** bytes + **Labels:** `protocol`, `tunnel_id` + **Emission path:** Proxy callback + **Example:** `newt_proxy_buffer_bytes{protocol="tcp",tunnel_id="wgpub"} 10240` + + + + + Unflushed async byte backlog. + + **Unit:** bytes + **Labels:** `protocol`, `tunnel_id` + **Emission path:** Proxy callback + **Example:** `newt_proxy_async_backlog_bytes{protocol="udp",tunnel_id="wgpub"} 4096` + + + + + Proxy write drops keyed by protocol/tunnel. + + **Unit:** 1 + **Labels:** `protocol`, `tunnel_id` + **Emission path:** `IncProxyDrops` + **Example:** `newt_proxy_drops_total{protocol="udp",tunnel_id="wgpub"} 2` + + + + + Proxy accept events keyed by result/reason. + + **Unit:** 1 + **Labels:** `tunnel_id`, `protocol`, `result`, `reason` + **Emission path:** `telemetry.IncProxyAccept(ctx, tunnelID, "tcp", "failure", "timeout")` + **Example:** `newt_proxy_accept_total{protocol="tcp",result="failure",reason="timeout"} 1` + + + + + Lifecycle events (opened/closed) per connection. + + **Unit:** 1 + **Labels:** `tunnel_id`, `protocol`, `event` (`opened`|`closed`) + **Emission path:** `telemetry.IncProxyConnectionEvent(ctx, tunnelID, "tcp", telemetry.ProxyConnectionOpened)` + **Example:** `newt_proxy_connections_total{protocol="tcp",event="opened"} 1` + + + + + Duration of completed proxy connections. + + **Unit:** seconds + **Labels:** `tunnel_id`, `protocol`, `result` + **Emission path:** `telemetry.ObserveProxyConnectionDuration(ctx, tunnelID, "tcp", "success", seconds)` + **Example:** `newt_proxy_connection_duration_seconds_bucket{protocol="tcp",result="success",le="1"} 3` + + + + + + + + + Prometheus-style series for the same Newt metrics. Names, labels, and examples mirror the OTel tab. + + + + Counts Pangolin registration attempts keyed by result. + + **Labels:** `result`, `site_id` • **Unit:** 1 • **Path:** `telemetry.IncSiteRegistration` + **Example:** `newt_site_registrations_total{result="success",site_id="abc"} 1` + + + + + 0/1 heartbeat for the active site. + + **Labels:** `site_id` • **Unit:** 1 • **Path:** `state.TelemetryView` + **Example:** `newt_site_online{site_id="self"} 1` + + + + + Seconds since last Pangolin heartbeat. + + **Labels:** `site_id` • **Unit:** seconds • **Path:** `TouchHeartbeat` + **Example:** `newt_site_last_heartbeat_seconds{site_id="self"} 3.2` + + + + + Constant 1 with build metadata labels. + + **Labels:** `version`, `commit` • **Unit:** 1 • **Path:** Build info registration + **Example:** `newt_build_info{version="1.2.3",commit="abc123"} 1` + + + + + Process boot indicator (increments once). + + **Labels:** — • **Unit:** 1 • **Path:** `RegisterBuildInfo` + **Example:** `newt_restart_count_total 1` + + + + + Certificate rotation events keyed by result. + + **Labels:** `result` • **Unit:** 1 • **Path:** `IncCertRotation` + **Example:** `newt_cert_rotation_total{result="success"} 1` + + + + + Config reload attempts keyed by result. + + **Labels:** `result` • **Unit:** 1 • **Path:** `telemetry.IncConfigReload` + **Example:** `newt_config_reloads_total{result="success"} 1` + + + + + Duration per config-apply phase & result. + + **Labels:** `phase`, `result` • **Unit:** seconds • **Path:** `telemetry.ObserveConfigApply` + **Example:** `newt_config_apply_seconds_bucket{phase="peer",result="success",le="0.1"} 3` + + + + + + + Active sessions per tunnel (or collapsed). + + **Labels:** `site_id`, `tunnel_id` • **Unit:** 1 • **Path:** `RegisterStateView` + **Example:** `newt_tunnel_sessions{site_id="self",tunnel_id="wgpub"} 2` + + + + + Traffic per tunnel/direction/protocol. + + **Labels:** `tunnel_id`, `direction`, `protocol` • **Unit:** bytes • **Path:** Proxy manager + **Example:** `newt_tunnel_bytes_total{direction="egress",protocol="tcp",tunnel_id="wgpub"} 8192` + + + + + RTT samples per tunnel/transport. + + **Labels:** `tunnel_id`, `transport` • **Unit:** seconds • **Path:** Health checks + **Example:** `newt_tunnel_latency_seconds_bucket{transport="wireguard",le="0.05",tunnel_id="wgpub"} 4` + + + + + Reconnect attempts by initiator & reason. + + **Labels:** `tunnel_id`, `initiator`, `reason` • **Unit:** 1 • **Path:** `telemetry.IncReconnect` + **Example:** `newt_tunnel_reconnects_total{initiator="client",reason="timeout",tunnel_id="wgpub"} 3` + + + + + + + Auth/WebSocket attempts by transport & result. + + **Labels:** `transport`, `result` • **Unit:** 1 • **Path:** `telemetry.IncConnAttempt` + **Example:** `newt_connection_attempts_total{transport="websocket",result="failure"} 2` + + + + + Connection errors by transport and type. + + **Labels:** `transport`, `error_type` • **Unit:** 1 • **Path:** `telemetry.IncConnError` + **Example:** `newt_connection_errors_total{transport="auth",error_type="auth_failed"} 1` + + + + + + + Dial latency for Pangolin WebSocket. + + **Labels:** `result`, `transport` • **Unit:** seconds • **Path:** `ObserveWSConnectLatency` + **Example:** `newt_websocket_connect_latency_seconds_bucket{result="success",transport="websocket",le="0.5"} 1` + + + + + WS disconnects by reason. + + **Labels:** `reason`, `tunnel_id` • **Unit:** 1 • **Path:** `IncWSDisconnect` + **Example:** `newt_websocket_disconnects_total{reason="remote_close",tunnel_id="wgpub"} 2` + + + + + Keepalive Ping/Pong failures. + + **Labels:** `reason` • **Unit:** 1 • **Path:** `telemetry.IncWSKeepaliveFailure(ctx, "ping_write")` + **Example:** `newt_websocket_keepalive_failures_total{reason="ping_write"} 1` + + + + + Duration of established WebSocket sessions by result. + + **Labels:** `result` • **Unit:** seconds • **Path:** `telemetry.ObserveWSSessionDuration(...)` + **Example:** `newt_websocket_session_duration_seconds_bucket{result="error",le="60"} 3` + + + + + Current WS connection status (0/1). + + **Labels:** — • **Unit:** 1 • **Path:** `telemetry.SetWSConnectionState(true|false)` + **Example:** `newt_websocket_connected 1` + + + + + Reconnect attempts by reason. + + **Labels:** `reason` • **Unit:** 1 • **Path:** `telemetry.IncWSReconnect(ctx, "ping_write")` + **Example:** `newt_websocket_reconnects_total{reason="ping_write"} 1` + + + + + In/out WS messages by direction & type. + + **Labels:** `direction`, `msg_type` • **Unit:** 1 • **Path:** `IncWSMessage` + **Example:** `newt_websocket_messages_total{direction="out",msg_type="ping"} 4` + + + + + + + Active TCP/UDP proxy connections per tunnel/protocol. + + **Labels:** `protocol`, `tunnel_id` • **Unit:** 1 • **Path:** Proxy callback + **Example:** `newt_proxy_active_connections{protocol="tcp",tunnel_id="wgpub"} 3` + + + + + Proxy buffer pool size. + + **Labels:** `protocol`, `tunnel_id` • **Unit:** bytes • **Path:** Proxy callback + **Example:** `newt_proxy_buffer_bytes{protocol="tcp",tunnel_id="wgpub"} 10240` + + + + + Unflushed async byte backlog. + + **Labels:** `protocol`, `tunnel_id` • **Unit:** bytes • **Path:** Proxy callback + **Example:** `newt_proxy_async_backlog_bytes{protocol="udp",tunnel_id="wgpub"} 4096` + + + + + Proxy write drops per protocol/tunnel. + + **Labels:** `protocol`, `tunnel_id` • **Unit:** 1 • **Path:** `IncProxyDrops` + **Example:** `newt_proxy_drops_total{protocol="udp",tunnel_id="wgpub"} 2` + + + + + Proxy accept events by result/reason. + + **Labels:** `tunnel_id`, `protocol`, `result`, `reason` • **Unit:** 1 • **Path:** `telemetry.IncProxyAccept(...)` + **Example:** `newt_proxy_accept_total{protocol="tcp",result="failure",reason="timeout"} 1` + + + + + Connection lifecycle events (opened/closed). + + **Labels:** `tunnel_id`, `protocol`, `event` • **Unit:** 1 • **Path:** `telemetry.IncProxyConnectionEvent(...)` + **Example:** `newt_proxy_connections_total{protocol="tcp",event="opened"} 1` + + + + + Duration of completed proxy connections. + + **Labels:** `tunnel_id`, `protocol`, `result` • **Unit:** seconds • **Path:** `telemetry.ObserveProxyConnectionDuration(...)` + **Example:** `newt_proxy_connection_duration_seconds_bucket{protocol="tcp",result="success",le="1"} 3` + + + + + + + +--- + +## References + +* OpenTelemetry Documentation +* Prometheus Documentation + + +Have improvements or a missing metric? Open an issue or PR referencing this page. + +