* Add client metrics
* Add client metrics system with OpenTelemetry and VictoriaMetrics support
Implements a comprehensive client metrics system to track peer connection
stages and performance. The system supports multiple backend implementations
(OpenTelemetry, VictoriaMetrics, and no-op) and tracks detailed connection
stage durations from creation through WireGuard handshake.
Key changes:
- Add metrics package with pluggable backend implementations
- Implement OpenTelemetry metrics backend
- Implement VictoriaMetrics metrics backend
- Add no-op metrics implementation for disabled state
- Track connection stages: creation, semaphore, signaling, connection ready, and WireGuard handshake
- Move WireGuard watcher functionality to conn.go
- Refactor engine to integrate metrics tracking
- Add metrics export endpoint in debug server
* Add signaling metrics tracking for initial and reconnection attempts
* Reset connection stage timestamps during reconnections to exclude unnecessary metrics tracking
* Delete otel lib from client
* Update unit tests
* Invoke callback on handshake success in WireGuard watcher
* Add Netbird version tracking to client metrics
Integrate Netbird version into VictoriaMetrics backend and metrics labels. Update `ClientMetrics` constructor and metric name formatting to include version information.
* Add sync duration tracking to client metrics
Introduce `RecordSyncDuration` for measuring sync message processing time. Update all metrics implementations (VictoriaMetrics, no-op) to support the new method. Refactor `ClientMetrics` to use `AgentInfo` for static agent data.
* Remove no-op metrics implementation and simplify ClientMetrics constructor
Eliminate unused `noopMetrics` and refactor `ClientMetrics` to always use the VictoriaMetrics implementation. Update associated logic to reflect these changes.
* Add total duration tracking for connection attempts
Calculate total duration for both initial connections and reconnections, accounting for different timestamp scenarios. Update `Export` method to include Prometheus HELP comments.
* Add metrics push support to VictoriaMetrics integration
* [client] anchor connection metrics to first signal received
* Remove creation_to_semaphore connection stage metric
The semaphore queuing stage (Created → SemaphoreAcquired) is no longer
tracked. Connection metrics now start from SignalingReceived. Updated
docs and Grafana dashboard accordingly.
* [client] Add remote push config for metrics with version-based eligibility
Introduce remoteconfig.Manager that fetches a remote JSON config to control
metrics push interval and restrict pushing to a specific agent version
range. When NB_METRICS_INTERVAL is set, remote config is bypassed
entirely for local override.
* [client] Add WASM-compatible NewClientMetrics implementation
Replace NewClientMetrics in metrics.go with a WASM-specific stub in metrics_js.go, returning nil for compatibility with JS builds. Simplify method usage for WASM targets.
* Add missing file
* Update default case in DeploymentType.String to return "unknown" instead of "selfhosted"
* [client] Rework metrics to use timestamped samples instead of histograms
Replace cumulative Prometheus histograms with timestamped point-in-time
samples that are pushed once and cleared. This fixes metrics for sparse
events (connections/syncs that happen once at startup) where rate() and
increase() produced incorrect or empty results.
Changes:
- Switch from VictoriaMetrics histogram library to raw Prometheus text
format with explicit millisecond timestamps
- Reset samples after successful push (no resending stale data)
- Rename connection_to_handshake → connection_to_wg_handshake
- Add netbird_peer_connection_count metric for ICE vs Relay tracking
- Simplify dashboard: point-based scatter plots, donut pie chart
- Add maxStalenessInterval=1m to VictoriaMetrics to prevent forward-fill
- Fix deployment_type Unknown returning "selfhosted" instead of "unknown"
- Fix inverted shouldPush condition in push.go
* [client] Add InfluxDB metrics backend alongside VictoriaMetrics
Add influxdb.go with timestamped line protocol export for sparse
one-shot events. Restore victoria.go to use proper Prometheus
histograms. Update Grafana dashboards, add InfluxDB datasource,
and update docs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [client] Fix metrics issues and update dev docker setup
- Fix StopPush not clearing push state, preventing restart
- Fix race condition reading currentConnPriority without lock in recordConnectionMetrics
- Fix stale comment referencing old metrics server URL
- Update docker-compose for InfluxDB: add scoped tokens, .env config, init scripts
- Rename docker-compose.victoria.yml to docker-compose.yml
* [client] Add anonymised peer tracking to pushed metrics
Introduce peer_id and connection_pair_id tags to InfluxDB metrics.
Public keys are hashed (truncated SHA-256) for anonymisation. The
connection pair ID is deterministic regardless of which side computes
it, enabling deduplication of reconnections in the ICE vs Relay
dashboard. Also pin Grafana to v11.6.0 for file-based provisioning
and fix datasource UID references.
* Remove unused dependencies from go.mod and go.sum
* Refactor InfluxDB ingest pipeline: extract validation logic
- Move line validation logic to `validateLine` and `validateField` helper functions.
- Improve error handling with structured validation and clearer separation of concerns.
- Add stderr redirection for error messages in `create-tokens.sh`.
* Set non-root user in Dockerfile for Ingest service
* Fix Windows CI: command line too long
* Remove Victoria metrics
* Add hashed peer ID as Authorization header in metrics push
* Revert influxdb in docker compose
* Enable gzip compression and authorization validation for metrics push and ingest
* Reducate code of complexity
* Update debug documentation to include metrics.txt description
* Increase `maxBodySize` limit to 50 MB and update gzip reader wrapping logic
* Refactor deployment type detection to use URL parsing for improved accuracy
* Update readme
* Throttle remote config retries on fetch failure
* Preserve first WG handshake timestamp, ignore rekeys
* Skip adding empty metrics.txt to debug bundle in debug mode
* Update default metrics server URL to https://ingest.netbird.io
* Atomic metrics export-and-reset to prevent sample loss between Export and Reset calls
* Fix doc
* Refactor Push configuration to improve clarity and enforce minimum push interval
* Remove `minPushInterval` and update push interval validation logic
* Revert ExportAndReset, it is acceptable data loss
* Fix metrics review issues: rename env var, remove stale infra, add tests
- Rename NB_METRICS_ENABLED to NB_METRICS_PUSH_ENABLED to clarify that
collection is always active (for debug bundles) and only push is opt-in
- Change default config URL from staging to production (ingest.netbird.io)
- Delete broken Prometheus dashboard (used non-existent metric names)
- Delete unused VictoriaMetrics datasource config
- Replace committed .env with .env.example containing placeholder values
- Wire Grafana admin credentials through env vars in docker-compose
- Make metricsStages a pointer to prevent reset-vs-write race on reconnect
- Fix typed-nil interface in debug bundle path (GetClientMetrics)
- Use deterministic field order in InfluxDB Export (sorted keys)
- Replace Authorization header with X-Peer-ID for metrics push
- Fix ingest server timeout to use time.Second instead of float
- Fix gzip double-close, stale comments, trim log levels
- Add tests for influxdb.go and MetricsStages
* Add login duration metric, ingest tag validation, and duration bounds
- Add netbird_login measurement recording login/auth duration to management
server, with success/failure result tag
- Validate InfluxDB tags against per-measurement allowlists in ingest server
to prevent arbitrary tag injection
- Cap all duration fields (*_seconds) at 300s instead of only total_seconds
- Add ingest server tests for tag/field validation, bounds, and auth
* Add arch tag to all metrics
* Fix Grafana dashboard: add arch to drop columns, add login panels
* Validate NB_METRICS_SERVER_URL is an absolute HTTP(S) URL
* Address review comments: fix README wording, update stale comments
* Clarify env var precedence does not bypass remote config eligibility
* Remove accidentally committed pprof files
---------
Co-authored-by: Viktor Liu <viktor@netbird.io>
* [Client] Remove connection semaphore
Remove the semaphore and the initial random sleep time (300ms) from the connectivity logic to speed up the initial connection time.
Note: Implement limiter logic that can prioritize router peers and keep the fast connection option for the first few peers.
* Remove unused function
* Refactor WG endpoint setup with role-based proxy activation
For relay connections, the controller (initiator) now activates the
wgProxy before configuring the WG endpoint, while the non-controller
(responder) configures the endpoint first with a delayed update, then
activates the proxy after. This prevents the responder from sending
traffic through the proxy before WireGuard is ready to receive it,
avoiding handshake congestion when both sides try to initiate
simultaneously.
For ICE connections, pass hasRelayBackup as the setEndpointNow flag
so the responder sets the endpoint immediately when a relay fallback
exists (avoiding the delayed update path since relay is already
available as backup).
On ICE disconnect with relay fallback, remove the duplicate
wgProxyRelay.Work() calls — the relay proxy is already active from
initial setup, so re-activating it is unnecessary.
In EndpointUpdater, split ConfigureWGEndpoint into explicit
configureAsInitiator and configureAsResponder paths, and add the
setEndpointNow parameter to let the caller control whether the
responder applies the endpoint immediately or defers it. Add unused
SwitchWGEndpoint and RemoveEndpointAddress methods. Remove the
wgConfigWorkaround sleep from the relay setup path.
* Fix redundant wgProxyRelay.Work() call during relay fallback setup
* Simplify WireGuard endpoint configuration by removing unused parameters and redundant logic
When an ICE connection disconnects and falls back to relay, reset the
WireGuard endpoint and handshake watcher if the remote peer's ICE session
has changed. This ensures the controller re-establishes a fresh WireGuard
handshake rather than waiting on a stale endpoint from the previous session.
Add defensive nil checks in ThreadSafeAgent.Close() to prevent panic
when agent field is nil. This can occur during Windows suspend/resume
when network interfaces are disrupted or the pion/ice library returns
nil without error.
Also capture agent pointer in local variable before goroutine execution
to prevent race conditions.
Fixes service crashes on laptop wake-up.
Start the WireGuard watcher before configuring the WG endpoint to ensure it captures the initial handshake timestamp.
Previously, the watcher was started after endpoint configuration, causing it to miss the handshake that occurred during setup.
* [client] Add WGConfigurer interface
To allow Rosenpass to work both with kernel
WireGuard via wgctrl (default behavior) and
userspace WireGuard via IPC on Android/iOS
using WGUSPConfigurer
* [client] Remove Rosenpass debug logs
* [client] Return simpler peer configuration in outputKey method
ConfigureDevice, the method previously used in
outputKey via wgClient to update the device's
properties, is now defined in the WGConfigurer
interface and implemented both in kernel_unix and
usp configurers.
PresharedKey datatype was also changed from
boolean to [32]byte to compare it
to the original NetBird PSK, so that Rosenpass
may replace it with its own when necessary.
* [client] Remove unused field
* [client] Replace usage of WGConfigurer
Replaced with preshared key setter interface,
which only defines a method to set / update the preshared key.
Logic has been migrated from rosenpass/netbird_handler to client/iface.
* [client] Use same default peer keepalive value when setting preshared keys
* [client] Store PresharedKeySetter iface in rosenpass manager
To avoid no-op if SetInterface is called before generateConfig
* [client] Add mutex usage in rosenpass netbird handler
* [client] change implementation setting Rosenpass preshared key
Instead of providing a method to configure a device (device/interface.go),
it forwards the new parameters to the configurer (either
kernel_unix.go / usp.go).
This removes dependency on reading FullStats, and makes use of a common
method (buildPresharedKeyConfig in configurer/common.go) to build a
minimal WG config that only sets/updates the PSK.
netbird_handler.go now keeps s list of initializedPeers to choose whether
to set the value of "UpdateOnly" when calling iface.SetPresharedKey.
* [client] Address possible race condition
Between outputKey calls and peer removal; it
checks again if the peer still exists in the
peers map before inserting it in the
initializedPeers map.
* [client] Add psk Rosenpass-initialized check
On client/internal/peer/conn.go, the presharedKey
function would always return the current key
set in wgConfig.presharedKey.
This would eventually overwrite a key set
by Rosenpass if the feature is active.
The purpose here is to set a handler that will
check if a given peer has its psk initialized
by Rosenpass to skip updating the psk
via updatePeer (since it calls presharedKey
method in conn.go).
* Add missing updateOnly flag setup for usp peers
* Change common.go buildPresharedKeyConfig signature
PeerKey datatype changed from string to
wgTypes.Key. Callers are responsible for parsing
a peer key with string datatype.
- Remove WaitGroup, make SemaphoreGroup a pure semaphore
- Make Add() return error instead of silently failing on context cancel
- Remove context parameter from Done() to prevent slot leaks
- Fix missing Done() call in conn.go error path
When an ICE agent connection was in progress, new offers were being ignored. This was incorrect logic because the remote agent could be restarted at any time.
In this change, whenever a new session ID is received, the ongoing handshake is closed and a new one is started.
* When a peer disconnects, remove the endpoint address to avoid sending traffic to a non-existent address, but retain the status for the activity recorder.
The Relayed connection setup is optimistic. It does not have any confirmation of an established end-to-end connection. Peers start sending WireGuard handshake packets immediately after the successful offer-answer handshake.
Meanwhile, for successful P2P connection negotiation, we change the WireGuard endpoint address, but this change does not trigger new handshake initiation. Because the peer switched from Relayed connection to P2P, the packets from the Relay server are dropped and must wait for the next WireGuard handshake via P2P.
To avoid this scenario, the relayed WireGuard proxy no longer drops the packets. Instead, it rewrites the source address to the new P2P endpoint and continues forwarding the packets.
We still have one corner case: if the Relayed server negotiation chooses a server that has not been used before. In this case, one side of the peer connection will be slower to reach the Relay server, and the Relay server will drop the handshake packet.
If everything goes well we should see exactly 5 seconds improvements between the WireGuard configuration time and the handshake time.
Deduplicate STUN package sending.
Originally, because every peer shared the same UDP address, the library could not distinguish which STUN message was associated with which candidate. As a result, the Pion library responded from all candidates for every STUN message.
In this PR, speed up the GRPC message processing, force the recreation of the ICE agent when getting a new, remote offer (do not wait for local STUN timeout).
This will allow running netbird commands (including debugging) against the daemon and provide a flow similar to non-container usages.
It will by default both log to file and stderr so it can be handled more uniformly in container-native environments.
Avoid invalid disconnection notifications in case the closed race dials.
In this PR resolve multiple race condition questions. Easier to understand the fix based on commit by commit.
- Remove store dependency from notifier
- Enforce the notification orders
- Fix invalid disconnection notification
- Ensure the order of the events on the consumer side
- Clients now subscribe to peer status changes.
- The server manages and maintains these subscriptions.
- Replaced raw string peer IDs with a custom peer ID type for better type safety and clarity.
This PR introduces a new inactivity package responsible for monitoring peer activity and notifying when peers become inactive.
Introduces a new Signal message type to close the peer connection after the idle timeout is reached.
Periodically checks the last activity of registered peers via a Bind interface.
Notifies via a channel when peers exceed a configurable inactivity threshold.
Default settings
DefaultInactivityThreshold is set to 15 minutes, with a minimum allowed threshold of 1 minute.
Limitations
This inactivity check does not support kernel WireGuard integration. In kernel–user space communication, the user space side will always be responsible for closing the connection.
- Removed separate thread execution of GetStates during notifications.
- Updated notification handler to rely on state data included in the notification payload.
* Refactor peer state change subscription mechanism
Because the code generated new channel for every single event, was easy to miss notification.
Use single channel.
* Fix lint
* Avoid potential deadlock
* Fix test
* Add context
* Fix test
* Fix HA router switch.
- Simplify the notification filter logic.
Always send notification if a state has been changed
- Remove IP changes check because we never modify
* Notify only the proper listeners
* Fix test
* Fix TestGetPeerStateChangeNotifierLogic test
* Before lazy connection, when the peer disconnected, the status switched to disconnected.
After implementing lazy connection, the peer state is connecting, so we did not decrease the reference counters on the routes.
* When switch to idle notify the route mgr
With the lazy connection feature, the peer will connect to target peers on-demand. The trigger can be any IP traffic.
This feature can be enabled with the NB_ENABLE_EXPERIMENTAL_LAZY_CONN environment variable.
When the engine receives a network map, it binds a free UDP port for every remote peer, and the system configures WireGuard endpoints for these ports. When traffic appears on a UDP socket, the system removes this listener and starts the peer connection procedure immediately.
Key changes
Fix slow netbird status -d command
Move from engine.go file to conn_mgr.go the peer connection related code
Refactor the iface interface usage and moved interface file next to the engine code
Add new command line flag and UI option to enable feature
The peer.Conn struct is reusable after it has been closed.
Change connection states
Connection states
Idle: The peer is not attempting to establish a connection. This typically means it's in a lazy state or the remote peer is expired.
Connecting: The peer is actively trying to establish a connection. This occurs when the peer has entered an active state and is continuously attempting to reach the remote peer.
Connected: A successful peer-to-peer connection has been established and communication is active.