netbird

mirror of https://github.com/netbirdio/netbird.git synced 2026-04-16 15:26:40 +00:00

Author	SHA1	Message	Date
Viktor Liu	d33cd4c95b	[client] Add NAT-PMP/UPnP support (#5202 )	2026-04-08 15:29:32 +08:00
Zoltan Papp	91f0d5cefd	[client] Feature/client metrics (#5512 ) * Add client metrics * Add client metrics system with OpenTelemetry and VictoriaMetrics support Implements a comprehensive client metrics system to track peer connection stages and performance. The system supports multiple backend implementations (OpenTelemetry, VictoriaMetrics, and no-op) and tracks detailed connection stage durations from creation through WireGuard handshake. Key changes: - Add metrics package with pluggable backend implementations - Implement OpenTelemetry metrics backend - Implement VictoriaMetrics metrics backend - Add no-op metrics implementation for disabled state - Track connection stages: creation, semaphore, signaling, connection ready, and WireGuard handshake - Move WireGuard watcher functionality to conn.go - Refactor engine to integrate metrics tracking - Add metrics export endpoint in debug server * Add signaling metrics tracking for initial and reconnection attempts * Reset connection stage timestamps during reconnections to exclude unnecessary metrics tracking * Delete otel lib from client * Update unit tests * Invoke callback on handshake success in WireGuard watcher * Add Netbird version tracking to client metrics Integrate Netbird version into VictoriaMetrics backend and metrics labels. Update `ClientMetrics` constructor and metric name formatting to include version information. * Add sync duration tracking to client metrics Introduce `RecordSyncDuration` for measuring sync message processing time. Update all metrics implementations (VictoriaMetrics, no-op) to support the new method. Refactor `ClientMetrics` to use `AgentInfo` for static agent data. * Remove no-op metrics implementation and simplify ClientMetrics constructor Eliminate unused `noopMetrics` and refactor `ClientMetrics` to always use the VictoriaMetrics implementation. Update associated logic to reflect these changes. * Add total duration tracking for connection attempts Calculate total duration for both initial connections and reconnections, accounting for different timestamp scenarios. Update `Export` method to include Prometheus HELP comments. * Add metrics push support to VictoriaMetrics integration * [client] anchor connection metrics to first signal received * Remove creation_to_semaphore connection stage metric The semaphore queuing stage (Created → SemaphoreAcquired) is no longer tracked. Connection metrics now start from SignalingReceived. Updated docs and Grafana dashboard accordingly. * [client] Add remote push config for metrics with version-based eligibility Introduce remoteconfig.Manager that fetches a remote JSON config to control metrics push interval and restrict pushing to a specific agent version range. When NB_METRICS_INTERVAL is set, remote config is bypassed entirely for local override. * [client] Add WASM-compatible NewClientMetrics implementation Replace NewClientMetrics in metrics.go with a WASM-specific stub in metrics_js.go, returning nil for compatibility with JS builds. Simplify method usage for WASM targets. * Add missing file * Update default case in DeploymentType.String to return "unknown" instead of "selfhosted" * [client] Rework metrics to use timestamped samples instead of histograms Replace cumulative Prometheus histograms with timestamped point-in-time samples that are pushed once and cleared. This fixes metrics for sparse events (connections/syncs that happen once at startup) where rate() and increase() produced incorrect or empty results. Changes: - Switch from VictoriaMetrics histogram library to raw Prometheus text format with explicit millisecond timestamps - Reset samples after successful push (no resending stale data) - Rename connection_to_handshake → connection_to_wg_handshake - Add netbird_peer_connection_count metric for ICE vs Relay tracking - Simplify dashboard: point-based scatter plots, donut pie chart - Add maxStalenessInterval=1m to VictoriaMetrics to prevent forward-fill - Fix deployment_type Unknown returning "selfhosted" instead of "unknown" - Fix inverted shouldPush condition in push.go * [client] Add InfluxDB metrics backend alongside VictoriaMetrics Add influxdb.go with timestamped line protocol export for sparse one-shot events. Restore victoria.go to use proper Prometheus histograms. Update Grafana dashboards, add InfluxDB datasource, and update docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [client] Fix metrics issues and update dev docker setup - Fix StopPush not clearing push state, preventing restart - Fix race condition reading currentConnPriority without lock in recordConnectionMetrics - Fix stale comment referencing old metrics server URL - Update docker-compose for InfluxDB: add scoped tokens, .env config, init scripts - Rename docker-compose.victoria.yml to docker-compose.yml * [client] Add anonymised peer tracking to pushed metrics Introduce peer_id and connection_pair_id tags to InfluxDB metrics. Public keys are hashed (truncated SHA-256) for anonymisation. The connection pair ID is deterministic regardless of which side computes it, enabling deduplication of reconnections in the ICE vs Relay dashboard. Also pin Grafana to v11.6.0 for file-based provisioning and fix datasource UID references. * Remove unused dependencies from go.mod and go.sum * Refactor InfluxDB ingest pipeline: extract validation logic - Move line validation logic to `validateLine` and `validateField` helper functions. - Improve error handling with structured validation and clearer separation of concerns. - Add stderr redirection for error messages in `create-tokens.sh`. * Set non-root user in Dockerfile for Ingest service * Fix Windows CI: command line too long * Remove Victoria metrics * Add hashed peer ID as Authorization header in metrics push * Revert influxdb in docker compose * Enable gzip compression and authorization validation for metrics push and ingest * Reducate code of complexity * Update debug documentation to include metrics.txt description * Increase `maxBodySize` limit to 50 MB and update gzip reader wrapping logic * Refactor deployment type detection to use URL parsing for improved accuracy * Update readme * Throttle remote config retries on fetch failure * Preserve first WG handshake timestamp, ignore rekeys * Skip adding empty metrics.txt to debug bundle in debug mode * Update default metrics server URL to https://ingest.netbird.io * Atomic metrics export-and-reset to prevent sample loss between Export and Reset calls * Fix doc * Refactor Push configuration to improve clarity and enforce minimum push interval * Remove `minPushInterval` and update push interval validation logic * Revert ExportAndReset, it is acceptable data loss * Fix metrics review issues: rename env var, remove stale infra, add tests - Rename NB_METRICS_ENABLED to NB_METRICS_PUSH_ENABLED to clarify that collection is always active (for debug bundles) and only push is opt-in - Change default config URL from staging to production (ingest.netbird.io) - Delete broken Prometheus dashboard (used non-existent metric names) - Delete unused VictoriaMetrics datasource config - Replace committed .env with .env.example containing placeholder values - Wire Grafana admin credentials through env vars in docker-compose - Make metricsStages a pointer to prevent reset-vs-write race on reconnect - Fix typed-nil interface in debug bundle path (GetClientMetrics) - Use deterministic field order in InfluxDB Export (sorted keys) - Replace Authorization header with X-Peer-ID for metrics push - Fix ingest server timeout to use time.Second instead of float - Fix gzip double-close, stale comments, trim log levels - Add tests for influxdb.go and MetricsStages * Add login duration metric, ingest tag validation, and duration bounds - Add netbird_login measurement recording login/auth duration to management server, with success/failure result tag - Validate InfluxDB tags against per-measurement allowlists in ingest server to prevent arbitrary tag injection - Cap all duration fields (_seconds) at 300s instead of only total_seconds - Add ingest server tests for tag/field validation, bounds, and auth Add arch tag to all metrics * Fix Grafana dashboard: add arch to drop columns, add login panels * Validate NB_METRICS_SERVER_URL is an absolute HTTP(S) URL * Address review comments: fix README wording, update stale comments * Clarify env var precedence does not bypass remote config eligibility * Remove accidentally committed pprof files --------- Co-authored-by: Viktor Liu <viktor@netbird.io>	2026-03-22 12:45:41 +01:00
Zoltan Papp	4a54f0d670	[Client] Remove connection semaphore (#5419 ) * [Client] Remove connection semaphore Remove the semaphore and the initial random sleep time (300ms) from the connectivity logic to speed up the initial connection time. Note: Implement limiter logic that can prioritize router peers and keep the fast connection option for the first few peers. * Remove unused function	2026-02-23 20:58:53 +01:00
Zoltan Papp	2dbdb5c1a7	[client] Refactor WG endpoint setup with role-based proxy activation (#5277 ) * Refactor WG endpoint setup with role-based proxy activation For relay connections, the controller (initiator) now activates the wgProxy before configuring the WG endpoint, while the non-controller (responder) configures the endpoint first with a delayed update, then activates the proxy after. This prevents the responder from sending traffic through the proxy before WireGuard is ready to receive it, avoiding handshake congestion when both sides try to initiate simultaneously. For ICE connections, pass hasRelayBackup as the setEndpointNow flag so the responder sets the endpoint immediately when a relay fallback exists (avoiding the delayed update path since relay is already available as backup). On ICE disconnect with relay fallback, remove the duplicate wgProxyRelay.Work() calls — the relay proxy is already active from initial setup, so re-activating it is unnecessary. In EndpointUpdater, split ConfigureWGEndpoint into explicit configureAsInitiator and configureAsResponder paths, and add the setEndpointNow parameter to let the caller control whether the responder applies the endpoint immediately or defers it. Add unused SwitchWGEndpoint and RemoveEndpointAddress methods. Remove the wgConfigWorkaround sleep from the relay setup path. * Fix redundant wgProxyRelay.Work() call during relay fallback setup * Simplify WireGuard endpoint configuration by removing unused parameters and redundant logic	2026-02-17 19:28:26 +01:00
Zoltan Papp	baed6e46ec	Reset WireGuard endpoint on ICE session change during relay fallback (#5283 ) When an ICE connection disconnects and falls back to relay, reset the WireGuard endpoint and handshake watcher if the remote peer's ICE session has changed. This ensures the controller re-establishes a fresh WireGuard handshake rather than waiting on a stale endpoint from the previous session.	2026-02-16 20:59:29 +01:00
Zoltan Papp	d2f9653cea	Fix nil pointer panic in ICE agent during sleep/wake cycles (#5261 ) Add defensive nil checks in ThreadSafeAgent.Close() to prevent panic when agent field is nil. This can occur during Windows suspend/resume when network interfaces are disrupted or the pion/ice library returns nil without error. Also capture agent pointer in local variable before goroutine execution to prevent race conditions. Fixes service crashes on laptop wake-up.	2026-02-05 12:06:28 +01:00
Zoltan Papp	5333e55a81	Fix WG watcher missing initial handshake (#5213 ) Start the WireGuard watcher before configuring the WG endpoint to ensure it captures the initial handshake timestamp. Previously, the watcher was started after endpoint configuration, causing it to miss the handshake that occurred during setup.	2026-01-29 16:58:10 +01:00
Zoltan Papp	ee3a67d2d8	[client] Fix/health result in bundle (#5164 ) * Add support for optional status refresh callback during debug bundle generation * Always update wg status * Remove duplicated wg status call	2026-01-23 17:06:07 +01:00
Viktor Liu	ee54827f94	[client] Add IPv6 support to usersace bind (#5147 )	2026-01-22 10:20:43 +08:00
Zoltan Papp	e908dea702	[client] Extend WG watcher for ICE connection too (#5133 ) Extend WG watcher for ICE connection too	2026-01-21 10:42:13 +01:00
Diego Romar	b3a2992a10	[client/android] - Fix Rosenpass connectivity for Android peers (#5044 ) * [client] Add WGConfigurer interface To allow Rosenpass to work both with kernel WireGuard via wgctrl (default behavior) and userspace WireGuard via IPC on Android/iOS using WGUSPConfigurer * [client] Remove Rosenpass debug logs * [client] Return simpler peer configuration in outputKey method ConfigureDevice, the method previously used in outputKey via wgClient to update the device's properties, is now defined in the WGConfigurer interface and implemented both in kernel_unix and usp configurers. PresharedKey datatype was also changed from boolean to [32]byte to compare it to the original NetBird PSK, so that Rosenpass may replace it with its own when necessary. * [client] Remove unused field * [client] Replace usage of WGConfigurer Replaced with preshared key setter interface, which only defines a method to set / update the preshared key. Logic has been migrated from rosenpass/netbird_handler to client/iface. * [client] Use same default peer keepalive value when setting preshared keys * [client] Store PresharedKeySetter iface in rosenpass manager To avoid no-op if SetInterface is called before generateConfig * [client] Add mutex usage in rosenpass netbird handler * [client] change implementation setting Rosenpass preshared key Instead of providing a method to configure a device (device/interface.go), it forwards the new parameters to the configurer (either kernel_unix.go / usp.go). This removes dependency on reading FullStats, and makes use of a common method (buildPresharedKeyConfig in configurer/common.go) to build a minimal WG config that only sets/updates the PSK. netbird_handler.go now keeps s list of initializedPeers to choose whether to set the value of "UpdateOnly" when calling iface.SetPresharedKey. * [client] Address possible race condition Between outputKey calls and peer removal; it checks again if the peer still exists in the peers map before inserting it in the initializedPeers map. * [client] Add psk Rosenpass-initialized check On client/internal/peer/conn.go, the presharedKey function would always return the current key set in wgConfig.presharedKey. This would eventually overwrite a key set by Rosenpass if the feature is active. The purpose here is to set a handler that will check if a given peer has its psk initialized by Rosenpass to skip updating the psk via updatePeer (since it calls presharedKey method in conn.go). * Add missing updateOnly flag setup for usp peers * Change common.go buildPresharedKeyConfig signature PeerKey datatype changed from string to wgTypes.Key. Callers are responsible for parsing a peer key with string datatype.	2026-01-20 13:26:51 -03:00
Viktor Liu	520d9c66cf	[client] Fix netstack upstream dns and add wasm debug methods (#4648 )	2026-01-14 13:56:16 +01:00
Zoltan Papp	d9118eb239	[client] Fix WASM peer connection to lazy peers (#5097 ) WASM peers now properly initiate relay connections instead of waiting for offers that lazy peers won't send.	2026-01-13 13:33:15 +01:00
Zoltan Papp	9ba067391f	[client] Fix semaphore slot leaks (#5018 ) - Remove WaitGroup, make SemaphoreGroup a pure semaphore - Make Add() return error instead of silently failing on context cancel - Remove context parameter from Done() to prevent slot leaks - Fix missing Done() call in conn.go error path	2026-01-03 09:10:02 +01:00
Zoltan Papp	537151e0f3	Remove redundant lock in peer update logic to avoid deadlock with exported functions (#4953 )	2025-12-17 13:55:33 +01:00
Viktor Liu	d71a82769c	[client,management] Rewrite the SSH feature (#4015 )	2025-11-17 17:10:41 +01:00
Viktor Liu	9cc9462cd5	[client] Use stdnet with a context to avoid DNS deadlocks (#4781 )	2025-11-13 20:16:45 +01:00
Viktor Liu	27957036c9	[client] Fix shutdown blocking on stuck ICE agent close (#4780 )	2025-11-13 13:24:51 +01:00
Zoltan Papp	c28275611b	Fix agent reference (#4776 )	2025-11-11 13:59:32 +01:00
Viktor Liu	c92e6c1b5f	[client] Block on all subsystems on shutdown (#4709 )	2025-11-05 12:15:37 +01:00
Zoltan Papp	9021bb512b	[client] Recreate agent when receive new session id (#4564 ) When an ICE agent connection was in progress, new offers were being ignored. This was incorrect logic because the remote agent could be restarted at any time. In this change, whenever a new session ID is received, the ongoing handshake is closed and a new one is started.	2025-10-08 17:14:24 +02:00
Zoltan Papp	4d33567888	[client] Remove endpoint address on peer disconnect, retain status for activity recording (#4228 ) * When a peer disconnects, remove the endpoint address to avoid sending traffic to a non-existent address, but retain the status for the activity recorder.	2025-10-08 03:12:16 +02:00
Zoltan Papp	5e1a40c33f	[client] Order the list of candidates for proper comparison (#4561 ) Order the list of candidates for proper comparison	2025-09-30 23:40:46 +02:00
Zoltan Papp	e8d301fdc9	[client] Fix/pkg loss (#3338 ) The Relayed connection setup is optimistic. It does not have any confirmation of an established end-to-end connection. Peers start sending WireGuard handshake packets immediately after the successful offer-answer handshake. Meanwhile, for successful P2P connection negotiation, we change the WireGuard endpoint address, but this change does not trigger new handshake initiation. Because the peer switched from Relayed connection to P2P, the packets from the Relay server are dropped and must wait for the next WireGuard handshake via P2P. To avoid this scenario, the relayed WireGuard proxy no longer drops the packets. Instead, it rewrites the source address to the new P2P endpoint and continues forwarding the packets. We still have one corner case: if the Relayed server negotiation chooses a server that has not been used before. In this case, one side of the peer connection will be slower to reach the Relay server, and the Relay server will drop the handshake packet. If everything goes well we should see exactly 5 seconds improvements between the WireGuard configuration time and the handshake time.	2025-09-30 15:31:18 +02:00
Zoltan Papp	bd23ab925e	[client] Fix ICE latency handling (#4501 ) The GetSelectedCandidatePair() does not carry the latency information.	2025-09-15 15:08:53 +02:00
Zoltan Papp	9e81e782e5	[client] Fix/v4 stun routing (#4430 ) Deduplicate STUN package sending. Originally, because every peer shared the same UDP address, the library could not distinguish which STUN message was associated with which candidate. As a result, the Pion library responded from all candidates for every STUN message.	2025-09-11 10:08:54 +02:00
Zoltan Papp	7aef0f67df	[client] Implement environment variable handling for Android (#4440 ) Some features can only be manipulated via environment variables. With this PR, environment variables can be managed from Android.	2025-09-08 18:42:42 +02:00
Zoltan Papp	69d87343d2	[client] Debug information for connection (#4439 ) Improve logging Print the exact time when the first WireGuard handshake occurs Print the steps for gathering system information	2025-09-08 14:51:34 +02:00
Zoltan Papp	786ca6fc79	Do not block Offer processing from relay worker (#4435 ) - do not miss ICE offers when relay worker busy - close p2p connection before recreate agent	2025-09-05 11:02:29 +02:00
Zoltan Papp	21368b38d9	[client] Update Pion ICE to the latest version (#4388 ) - Update Pion version - Update protobuf version	2025-09-01 10:42:01 +02:00
Zoltan Papp	f425870c8e	[client] Avoid duplicated agent close (#4383 )	2025-08-20 18:50:51 +02:00
Zoltan Papp	12cad854b2	[client] Fix/ice handshake (#4281 ) In this PR, speed up the GRPC message processing, force the recreation of the ICE agent when getting a new, remote offer (do not wait for local STUN timeout).	2025-08-18 20:09:50 +02:00
Viktor Liu	1022a5015c	[client] Eliminate upstream server strings in dns code (#4267 )	2025-08-11 11:57:21 +02:00
Viktor Liu	1d5e871bdf	[misc] Move shared components to shared directory (#4286 ) Moved the following directories: ``` - management/client → shared/management/client - management/domain → shared/management/domain - management/proto → shared/management/proto - signal/client → shared/signal/client - signal/proto → shared/signal/proto - relay/client → shared/relay/client - relay/auth → shared/relay/auth ``` and adjusted import paths	2025-08-05 15:22:58 +02:00
Krzysztof Nazarewski (kdn)	af8687579b	client: container: support CLI with entrypoint addition (#4126 ) This will allow running netbird commands (including debugging) against the daemon and provide a flow similar to non-container usages. It will by default both log to file and stderr so it can be handled more uniformly in container-native environments.	2025-07-25 11:44:30 +02:00
Louis Li	3f82698089	[client] make ICE failed timeout configurable (#4211 )	2025-07-25 10:36:11 +02:00
Zoltan Papp	86c16cf651	[server, relay] Fix/relay race disconnection (#4174 ) Avoid invalid disconnection notifications in case the closed race dials. In this PR resolve multiple race condition questions. Easier to understand the fix based on commit by commit. - Remove store dependency from notifier - Enforce the notification orders - Fix invalid disconnection notification - Ensure the order of the events on the consumer side	2025-07-21 19:58:17 +02:00
Viktor Liu	d6ed9c037e	[client] Fix bind exclusion routes (#4154 )	2025-07-21 12:13:21 +02:00
Zoltan Papp	0dab03252c	[client, relay-server] Feature/relay notification (#4083 ) - Clients now subscribe to peer status changes. - The server manages and maintains these subscriptions. - Replaced raw string peer IDs with a custom peer ID type for better type safety and clarity.	2025-07-15 10:43:42 +02:00
Zoltan Papp	fbb1b55beb	[client] refactor lazy detection (#4050 ) This PR introduces a new inactivity package responsible for monitoring peer activity and notifying when peers become inactive. Introduces a new Signal message type to close the peer connection after the idle timeout is reached. Periodically checks the last activity of registered peers via a Bind interface. Notifies via a channel when peers exceed a configurable inactivity threshold. Default settings DefaultInactivityThreshold is set to 15 minutes, with a minimum allowed threshold of 1 minute. Limitations This inactivity check does not support kernel WireGuard integration. In kernel–user space communication, the user space side will always be responsible for closing the connection.	2025-07-04 19:52:27 +02:00
Viktor Liu	2a51609436	[client] Handle lazy routing peers that are part of HA groups (#3943 ) * Activate new lazy routing peers if the HA group is active * Prevent lazy peers going to idle if HA group members are active (#3948)	2025-06-20 18:07:19 +02:00
Viktor Liu	d4a800edd5	[client] Fix status recorder panic (#3988 )	2025-06-17 01:20:26 +02:00
Zoltan Papp	9d11257b1a	[client] Carry the peer's actual state with the notification. (#3929 ) - Removed separate thread execution of GetStates during notifications. - Updated notification handler to rely on state data included in the notification payload.	2025-06-11 13:33:38 +02:00
Viktor Liu	3c535cdd2b	[client] Add lazy connections to routed networks (#3908 )	2025-06-08 14:10:34 +02:00
Zoltan Papp	9424b88db2	[client] Add output similar to wg show to the debug package (#3922 )	2025-06-05 11:51:39 +02:00
Zoltan Papp	af27aaf9af	[client] Refactor peer state change subscription mechanism (#3910 ) * Refactor peer state change subscription mechanism Because the code generated new channel for every single event, was easy to miss notification. Use single channel. * Fix lint * Avoid potential deadlock * Fix test * Add context * Fix test	2025-06-03 09:20:33 +02:00
Zoltan Papp	f16f0c7831	[client] Fix HA router switch (#3889 ) * Fix HA router switch. - Simplify the notification filter logic. Always send notification if a state has been changed - Remove IP changes check because we never modify * Notify only the proper listeners * Fix test * Fix TestGetPeerStateChangeNotifierLogic test * Before lazy connection, when the peer disconnected, the status switched to disconnected. After implementing lazy connection, the peer state is connecting, so we did not decrease the reference counters on the routes. * When switch to idle notify the route mgr	2025-06-01 16:08:27 +02:00
Zoltan Papp	aa07b3b87b	Fix deadlock (#3904 )	2025-05-30 23:38:02 +02:00
Zoltan Papp	0492c1724a	[client, android] Fix/notifier threading (#3807 ) - Fix potential deadlocks - When adding a listener, immediately notify with the last known IP and fqdn.	2025-05-27 17:12:04 +02:00
Zoltan Papp	daa8380df9	[client] Feature/lazy connection (#3379 ) With the lazy connection feature, the peer will connect to target peers on-demand. The trigger can be any IP traffic. This feature can be enabled with the NB_ENABLE_EXPERIMENTAL_LAZY_CONN environment variable. When the engine receives a network map, it binds a free UDP port for every remote peer, and the system configures WireGuard endpoints for these ports. When traffic appears on a UDP socket, the system removes this listener and starts the peer connection procedure immediately. Key changes Fix slow netbird status -d command Move from engine.go file to conn_mgr.go the peer connection related code Refactor the iface interface usage and moved interface file next to the engine code Add new command line flag and UI option to enable feature The peer.Conn struct is reusable after it has been closed. Change connection states Connection states Idle: The peer is not attempting to establish a connection. This typically means it's in a lazy state or the remote peer is expired. Connecting: The peer is actively trying to establish a connection. This occurs when the peer has entered an active state and is continuously attempting to reach the remote peer. Connected: A successful peer-to-peer connection has been established and communication is active.	2025-05-21 11:12:28 +02:00

1 2 3 4

172 Commits