Commit Graph

28 Commits

Author SHA1 Message Date
Zoltan Papp
91f0d5cefd [client] Feature/client metrics (#5512)
* Add client metrics

* Add client metrics system with OpenTelemetry and VictoriaMetrics support

Implements a comprehensive client metrics system to track peer connection
stages and performance. The system supports multiple backend implementations
(OpenTelemetry, VictoriaMetrics, and no-op) and tracks detailed connection
stage durations from creation through WireGuard handshake.

Key changes:
- Add metrics package with pluggable backend implementations
- Implement OpenTelemetry metrics backend
- Implement VictoriaMetrics metrics backend
- Add no-op metrics implementation for disabled state
- Track connection stages: creation, semaphore, signaling, connection ready, and WireGuard handshake
- Move WireGuard watcher functionality to conn.go
- Refactor engine to integrate metrics tracking
- Add metrics export endpoint in debug server

* Add signaling metrics tracking for initial and reconnection attempts

* Reset connection stage timestamps during reconnections to exclude unnecessary metrics tracking

* Delete otel lib from client

* Update unit tests

* Invoke callback on handshake success in WireGuard watcher

* Add Netbird version tracking to client metrics

Integrate Netbird version into VictoriaMetrics backend and metrics labels. Update `ClientMetrics` constructor and metric name formatting to include version information.

* Add sync duration tracking to client metrics

Introduce `RecordSyncDuration` for measuring sync message processing time. Update all metrics implementations (VictoriaMetrics, no-op) to support the new method. Refactor `ClientMetrics` to use `AgentInfo` for static agent data.

* Remove no-op metrics implementation and simplify ClientMetrics constructor

Eliminate unused `noopMetrics` and refactor `ClientMetrics` to always use the VictoriaMetrics implementation. Update associated logic to reflect these changes.

* Add total duration tracking for connection attempts

Calculate total duration for both initial connections and reconnections, accounting for different timestamp scenarios. Update `Export` method to include Prometheus HELP comments.

* Add metrics push support to VictoriaMetrics integration

* [client] anchor connection metrics to first signal received

* Remove creation_to_semaphore connection stage metric

The semaphore queuing stage (Created → SemaphoreAcquired) is no longer
tracked. Connection metrics now start from SignalingReceived. Updated
docs and Grafana dashboard accordingly.

* [client] Add remote push config for metrics with version-based eligibility

Introduce remoteconfig.Manager that fetches a remote JSON config to control
metrics push interval and restrict pushing to a specific agent version
range. When NB_METRICS_INTERVAL is set, remote config is bypassed
entirely for local override.

* [client] Add WASM-compatible NewClientMetrics implementation

Replace NewClientMetrics in metrics.go with a WASM-specific stub in metrics_js.go, returning nil for compatibility with JS builds. Simplify method usage for WASM targets.

* Add missing file

* Update default case in DeploymentType.String to return "unknown" instead of "selfhosted"

* [client] Rework metrics to use timestamped samples instead of histograms

Replace cumulative Prometheus histograms with timestamped point-in-time
samples that are pushed once and cleared. This fixes metrics for sparse
events (connections/syncs that happen once at startup) where rate() and
increase() produced incorrect or empty results.

Changes:
- Switch from VictoriaMetrics histogram library to raw Prometheus text
  format with explicit millisecond timestamps
- Reset samples after successful push (no resending stale data)
- Rename connection_to_handshake → connection_to_wg_handshake
- Add netbird_peer_connection_count metric for ICE vs Relay tracking
- Simplify dashboard: point-based scatter plots, donut pie chart
- Add maxStalenessInterval=1m to VictoriaMetrics to prevent forward-fill
- Fix deployment_type Unknown returning "selfhosted" instead of "unknown"
- Fix inverted shouldPush condition in push.go

* [client] Add InfluxDB metrics backend alongside VictoriaMetrics

Add influxdb.go with timestamped line protocol export for sparse
one-shot events. Restore victoria.go to use proper Prometheus
histograms. Update Grafana dashboards, add InfluxDB datasource,
and update docs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [client] Fix metrics issues and update dev docker setup

- Fix StopPush not clearing push state, preventing restart
- Fix race condition reading currentConnPriority without lock in recordConnectionMetrics
- Fix stale comment referencing old metrics server URL
- Update docker-compose for InfluxDB: add scoped tokens, .env config, init scripts
- Rename docker-compose.victoria.yml to docker-compose.yml

* [client] Add anonymised peer tracking to pushed metrics

Introduce peer_id and connection_pair_id tags to InfluxDB metrics.
Public keys are hashed (truncated SHA-256) for anonymisation. The
connection pair ID is deterministic regardless of which side computes
it, enabling deduplication of reconnections in the ICE vs Relay
dashboard. Also pin Grafana to v11.6.0 for file-based provisioning
and fix datasource UID references.

* Remove unused dependencies from go.mod and go.sum

* Refactor InfluxDB ingest pipeline: extract validation logic

- Move line validation logic to `validateLine` and `validateField` helper functions.
- Improve error handling with structured validation and clearer separation of concerns.
- Add stderr redirection for error messages in `create-tokens.sh`.

* Set non-root user in Dockerfile for Ingest service

* Fix Windows CI: command line too long

* Remove Victoria metrics

* Add hashed peer ID as Authorization header in metrics push

* Revert influxdb in docker compose

* Enable gzip compression and authorization validation for metrics push and ingest

* Reducate code of complexity

* Update debug documentation to include metrics.txt description

* Increase `maxBodySize` limit to 50 MB and update gzip reader wrapping logic

* Refactor deployment type detection to use URL parsing for improved accuracy

* Update readme

* Throttle remote config retries on fetch failure

* Preserve first WG handshake timestamp, ignore rekeys

* Skip adding empty metrics.txt to debug bundle in debug mode

* Update default metrics server URL to https://ingest.netbird.io

* Atomic metrics export-and-reset to prevent sample loss between Export and Reset calls

* Fix doc

* Refactor Push configuration to improve clarity and enforce minimum push interval

* Remove `minPushInterval` and update push interval validation logic

* Revert ExportAndReset, it is acceptable data loss

* Fix metrics review issues: rename env var, remove stale infra, add tests

- Rename NB_METRICS_ENABLED to NB_METRICS_PUSH_ENABLED to clarify that
  collection is always active (for debug bundles) and only push is opt-in
- Change default config URL from staging to production (ingest.netbird.io)
- Delete broken Prometheus dashboard (used non-existent metric names)
- Delete unused VictoriaMetrics datasource config
- Replace committed .env with .env.example containing placeholder values
- Wire Grafana admin credentials through env vars in docker-compose
- Make metricsStages a pointer to prevent reset-vs-write race on reconnect
- Fix typed-nil interface in debug bundle path (GetClientMetrics)
- Use deterministic field order in InfluxDB Export (sorted keys)
- Replace Authorization header with X-Peer-ID for metrics push
- Fix ingest server timeout to use time.Second instead of float
- Fix gzip double-close, stale comments, trim log levels
- Add tests for influxdb.go and MetricsStages

* Add login duration metric, ingest tag validation, and duration bounds

- Add netbird_login measurement recording login/auth duration to management
  server, with success/failure result tag
- Validate InfluxDB tags against per-measurement allowlists in ingest server
  to prevent arbitrary tag injection
- Cap all duration fields (*_seconds) at 300s instead of only total_seconds
- Add ingest server tests for tag/field validation, bounds, and auth

* Add arch tag to all metrics

* Fix Grafana dashboard: add arch to drop columns, add login panels

* Validate NB_METRICS_SERVER_URL is an absolute HTTP(S) URL

* Address review comments: fix README wording, update stale comments

* Clarify env var precedence does not bypass remote config eligibility

* Remove accidentally committed pprof files

---------

Co-authored-by: Viktor Liu <viktor@netbird.io>
2026-03-22 12:45:41 +01:00
Zoltan Papp
ee3a67d2d8 [client] Fix/health result in bundle (#5164)
* Add support for optional status refresh callback during debug bundle generation

* Always update wg status

* Remove duplicated wg status call
2026-01-23 17:06:07 +01:00
Viktor Liu
d0221a3e72 [client] Add cpu profile to debug bundle (#4700) 2026-01-22 12:24:12 +01:00
Zoltan Papp
58daa674ef [Management/Client] Trigger debug bundle runs from API/Dashboard (#4592) (#4832)
This PR adds the ability to trigger debug bundle generation remotely from the Management API/Dashboard.
2026-01-19 11:22:16 +01:00
Viktor Liu
520d9c66cf [client] Fix netstack upstream dns and add wasm debug methods (#4648) 2026-01-14 13:56:16 +01:00
Viktor Liu
1d5e871bdf [misc] Move shared components to shared directory (#4286)
Moved the following directories:

```
  - management/client → shared/management/client
  - management/domain → shared/management/domain
  - management/proto → shared/management/proto
  - signal/client → shared/signal/client
  - signal/proto → shared/signal/proto
  - relay/client → shared/relay/client
  - relay/auth → shared/relay/auth
```

and adjusted import paths
2025-08-05 15:22:58 +02:00
Viktor Liu
3d3c4c5844 [client] Add full sync response to debug bundle (#4287) 2025-08-05 14:55:50 +02:00
Viktor Liu
a7ea881900 [client] Add rotated logs flag for debug bundle generation (#4100) 2025-07-10 16:13:53 +02:00
Viktor Liu
01c3719c5d [client] Add debug for duration option to netbird ui (#3772) 2025-05-01 23:25:27 +02:00
Maycon Santos
2f44fe2e23 [client] Feature/upload bundle (#3734)
Add an upload bundle option with the flag --upload-bundle; by default, the upload will use a NetBird address, which can be replaced using the flag --upload-bundle-url.

The upload server is available under the /upload-server path. The release change will push a docker image to netbirdio/upload image repository.

The server supports using s3 with pre-signed URL for direct upload and local file for storing bundles.
2025-04-29 00:43:50 +02:00
Viktor Liu
a675531b5c [client] Set up signal to generate debug bundles (#3683) 2025-04-16 11:06:22 +02:00
Viktor Liu
a354004564 [client] Add remaining debug profiles (#3681) 2025-04-15 13:06:28 +02:00
Viktor Liu
b165f63327 [client] Add heap profile to debug bundle (#3679) 2025-04-15 11:36:41 +02:00
Maycon Santos
bd8f0c1ef3 [client] add profiling dumps to debug package (#3517)
enhances debugging capabilities by adding support for goroutine, mutex, and block profiling while updating state dump tracking and refining test and release settings.

- Adds pprof-based profiling for goroutine, mutex, and block profiles in the debug bundle.
- Updates state dump functionality by incorporating new status and key fields.
- Adjusts test validations and default flag/retention settings.
2025-03-23 13:46:09 +01:00
Viktor Liu
a2faae5d62 [client] Fix anonymized addresses documentation (#3505) 2025-03-14 11:38:16 +01:00
Viktor Liu
05415f72ec [client] Add experimental support for userspace routing (#3134) 2025-02-07 14:11:53 +01:00
Viktor Liu
2e61ce006d [client] Back up corrupted state files and present them in the debug bundle (#3227) 2025-01-23 17:59:44 +01:00
Viktor Liu
3cc485759e [client] Use correct stdout/stderr log paths for debug bundle on macOS (#3231) 2025-01-23 17:59:22 +01:00
Viktor Liu
78795a4a73 [client] Add block lan access flag for routers (#3171) 2025-01-15 17:39:47 +01:00
Viktor Liu
0dbaddc7be [client] Don't fail debug if log file is console (#3103) 2024-12-24 15:05:23 +01:00
Viktor Liu
05930ee6b1 [client] Add firewall rules to the debug bundle (#3089)
Adds the following to the debug bundle:
- iptables: `iptables-save`, `iptables -v -n -L`
- nftables: `nft list ruleset` or if not available formatted output from netlink (WIP)
2024-12-23 15:57:15 +01:00
Viktor Liu
17c20b45ce [client] Add network map to debug bundle (#2966) 2024-12-03 14:50:12 +01:00
Viktor Liu
6285e0d23e [client] Add netbird.err and netbird.out to debug bundle (#2971) 2024-12-03 12:43:17 +01:00
Viktor Liu
c7e7ad5030 [client] Add state file to debug bundle (#2969) 2024-12-02 18:04:02 +01:00
Zoltan Papp
0c039274a4 [relay] Feature/relay integration (#2244)
This update adds new relay integration for NetBird clients. The new relay is based on web sockets and listens on a single port.

- Adds new relay implementation with websocket with single port relaying mechanism
- refactor peer connection logic, allowing upgrade and downgrade from/to P2P connection
- peer connections are faster since it connects first to relay and then upgrades to P2P
- maintains compatibility with old clients by not using the new relay
- updates infrastructure scripts with new relay service
2024-09-08 12:06:14 +02:00
Viktor Liu
5ad4ae769a Extend client debug bundle (#2341)
Adds readme (with --anonymize)
Fixes archive file timestamps
Adds routes info
Adds interfaces
Adds client config
2024-08-02 11:47:12 +02:00
Viktor Liu
d13fb0e379 Restore netbird state and log level after debug (#2047) 2024-05-24 13:27:41 +02:00
Viktor Liu
4424162bce Add client debug features (#1884)
* Add status anonymization
* Add OS/arch to the status command
* Use human-friendly last-update status messages
* Add debug bundle command to collect (anonymized) logs
* Add debug log level command
* And debug for a certain time span command
2024-04-26 17:20:10 +02:00