Each cache consumer (IDP cache, token store, PKCE store, secrets manager,
EDR validator) was independently calling NewStore, creating separate Redis
clients with their own connection pools — up to 1400 potential connections
from a single management server process.
Introduce a shared CacheStore() singleton on BaseServer that creates one
store at boot and injects it into all consumers. Consumer constructors now
receive a store.StoreInterface instead of creating their own.
For Redis mode, all consumers share one connection pool (1000 max conns).
For in-memory mode, all consumers share one GoCache instance.
extraInitialRoutes() was meant to preserve only the fake IP route
(240.0.0.0/8) across TUN rebuilds, but it re-injected any initial
route missing from the current set. When the management server
advertised exit node routes (0.0.0.0/0) that were later filtered
by the route selector, extraInitialRoutes() re-added them, causing
the Android VPN to capture all traffic with no peer to handle it.
Store the fake IP route explicitly and append only that in notify(),
removing the overly broad initial route diffing.
- Add GetSelectedClientRoutes() to the route manager that filters through FilterSelectedExitNodes, returning only active routes instead of all management routes
- Use GetSelectedClientRoutes() in the DNS route checker so deselected exit nodes' 0.0.0.0/0 no longer matches upstream DNS IPs — this prevented the resolver from switching
away from the utun-bound socket after exit node deselection
- Initialize iOS DNS server with host DNS fallback addresses (1.1.1.1:53, 1.0.0.1:53) and a permanent root zone handler, matching Android's behavior — without this, unmatched
DNS queries arriving via the 0.0.0.0/0 tunnel route had no handler and were silently dropped
Update the mgmProber interface to use HealthCheck() instead of the
now-unexported GetServerPublicKey(), aligning with the changes in the
management client API.
* Unexport GetServerPublicKey, add HealthCheck method
Internalize server key fetching into Login, Register,
GetDeviceAuthorizationFlow, and GetPKCEAuthorizationFlow methods,
removing the need for callers to fetch and pass the key separately.
Replace the exported GetServerPublicKey with a HealthCheck() error
method for connection validation, keeping IsHealthy() bool for
non-blocking background monitoring.
Fix test encryption to use correct key pairs (client public key as
remotePubKey instead of server private key).
* Refactor `doMgmLogin` to return only error, removing unused response
- DNS resolution broke after deselecting an exit node because the route checker used all client routes (including deselected ones) to decide how to forward upstream DNS
queries
- Added GetSelectedClientRoutes() to the route manager that filters out deselected exit nodes, and switched the DNS route checker to use it
- Confirmed fix via device testing: after deselecting exit node, DNS queries now correctly use a regular network socket instead of binding to the utun interface
* [client] Support embed.Client on Android with netstack mode
embed.Client.Start() calls ConnectClient.Run() which passes an empty
MobileDependency{}. On Android, the engine dereferences nil fields
(IFaceDiscover, NetworkChangeListener, DnsReadyListener) causing panics.
Provide complete no-op stubs so the engine's existing Android code
paths work unchanged — zero modifications to engine.go:
- Add androidRunOverride hook in Run() for Android-specific dispatch
- Add runOnAndroidEmbed() with complete MobileDependency (all stubs)
- Wire default stubs via init() in connect_android_default.go:
noopIFaceDiscover, noopNetworkChangeListener, noopDnsReadyListener
- Forward logPath to c.run()
Tested: embed.Client starts on Android arm64, joins mesh via relay,
discovers peers, localhost proxy works for TCP+UDP forwarding.
* [client] Fix TestServiceParamsPath for Windows path separators
Use filepath.Join in test assertions instead of hardcoded POSIX paths
so the test passes on Windows where filepath.Join uses backslashes.
Remove client secret from gRPC auth flow. The secret was originally included to support providers like Google Workspace that don't offer a proper PKCE flow, but this is no longer necessary with the embedded IdP. Deployments using such providers should migrate to the embedded IdP instead.
* [client] Add Expose support to embed library
Add ability to expose local services via the NetBird reverse proxy
from embedded client code.
Introduce ExposeSession with a blocking Wait method that keeps
the session alive until the context is cancelled.
Extract ProtocolType with ParseProtocolType into the expose package
and use it across CLI and embed layers.
* Fix TestNewRequest assertion to use ProtocolType instead of int
* Add documentation for Request and KeepAlive in expose manager
* Refactor ExposeSession to pass context explicitly in Wait method
* Refactor ExposeSession Wait method to explicitly pass context
* Update client/embed/expose.go
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
* Fix build
* Update client/embed/expose.go
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
---------
Co-authored-by: Viktor Liu <viktor@netbird.io>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Viktor Liu <17948409+lixmal@users.noreply.github.com>
* Add client metrics
* Add client metrics system with OpenTelemetry and VictoriaMetrics support
Implements a comprehensive client metrics system to track peer connection
stages and performance. The system supports multiple backend implementations
(OpenTelemetry, VictoriaMetrics, and no-op) and tracks detailed connection
stage durations from creation through WireGuard handshake.
Key changes:
- Add metrics package with pluggable backend implementations
- Implement OpenTelemetry metrics backend
- Implement VictoriaMetrics metrics backend
- Add no-op metrics implementation for disabled state
- Track connection stages: creation, semaphore, signaling, connection ready, and WireGuard handshake
- Move WireGuard watcher functionality to conn.go
- Refactor engine to integrate metrics tracking
- Add metrics export endpoint in debug server
* Add signaling metrics tracking for initial and reconnection attempts
* Reset connection stage timestamps during reconnections to exclude unnecessary metrics tracking
* Delete otel lib from client
* Update unit tests
* Invoke callback on handshake success in WireGuard watcher
* Add Netbird version tracking to client metrics
Integrate Netbird version into VictoriaMetrics backend and metrics labels. Update `ClientMetrics` constructor and metric name formatting to include version information.
* Add sync duration tracking to client metrics
Introduce `RecordSyncDuration` for measuring sync message processing time. Update all metrics implementations (VictoriaMetrics, no-op) to support the new method. Refactor `ClientMetrics` to use `AgentInfo` for static agent data.
* Remove no-op metrics implementation and simplify ClientMetrics constructor
Eliminate unused `noopMetrics` and refactor `ClientMetrics` to always use the VictoriaMetrics implementation. Update associated logic to reflect these changes.
* Add total duration tracking for connection attempts
Calculate total duration for both initial connections and reconnections, accounting for different timestamp scenarios. Update `Export` method to include Prometheus HELP comments.
* Add metrics push support to VictoriaMetrics integration
* [client] anchor connection metrics to first signal received
* Remove creation_to_semaphore connection stage metric
The semaphore queuing stage (Created → SemaphoreAcquired) is no longer
tracked. Connection metrics now start from SignalingReceived. Updated
docs and Grafana dashboard accordingly.
* [client] Add remote push config for metrics with version-based eligibility
Introduce remoteconfig.Manager that fetches a remote JSON config to control
metrics push interval and restrict pushing to a specific agent version
range. When NB_METRICS_INTERVAL is set, remote config is bypassed
entirely for local override.
* [client] Add WASM-compatible NewClientMetrics implementation
Replace NewClientMetrics in metrics.go with a WASM-specific stub in metrics_js.go, returning nil for compatibility with JS builds. Simplify method usage for WASM targets.
* Add missing file
* Update default case in DeploymentType.String to return "unknown" instead of "selfhosted"
* [client] Rework metrics to use timestamped samples instead of histograms
Replace cumulative Prometheus histograms with timestamped point-in-time
samples that are pushed once and cleared. This fixes metrics for sparse
events (connections/syncs that happen once at startup) where rate() and
increase() produced incorrect or empty results.
Changes:
- Switch from VictoriaMetrics histogram library to raw Prometheus text
format with explicit millisecond timestamps
- Reset samples after successful push (no resending stale data)
- Rename connection_to_handshake → connection_to_wg_handshake
- Add netbird_peer_connection_count metric for ICE vs Relay tracking
- Simplify dashboard: point-based scatter plots, donut pie chart
- Add maxStalenessInterval=1m to VictoriaMetrics to prevent forward-fill
- Fix deployment_type Unknown returning "selfhosted" instead of "unknown"
- Fix inverted shouldPush condition in push.go
* [client] Add InfluxDB metrics backend alongside VictoriaMetrics
Add influxdb.go with timestamped line protocol export for sparse
one-shot events. Restore victoria.go to use proper Prometheus
histograms. Update Grafana dashboards, add InfluxDB datasource,
and update docs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [client] Fix metrics issues and update dev docker setup
- Fix StopPush not clearing push state, preventing restart
- Fix race condition reading currentConnPriority without lock in recordConnectionMetrics
- Fix stale comment referencing old metrics server URL
- Update docker-compose for InfluxDB: add scoped tokens, .env config, init scripts
- Rename docker-compose.victoria.yml to docker-compose.yml
* [client] Add anonymised peer tracking to pushed metrics
Introduce peer_id and connection_pair_id tags to InfluxDB metrics.
Public keys are hashed (truncated SHA-256) for anonymisation. The
connection pair ID is deterministic regardless of which side computes
it, enabling deduplication of reconnections in the ICE vs Relay
dashboard. Also pin Grafana to v11.6.0 for file-based provisioning
and fix datasource UID references.
* Remove unused dependencies from go.mod and go.sum
* Refactor InfluxDB ingest pipeline: extract validation logic
- Move line validation logic to `validateLine` and `validateField` helper functions.
- Improve error handling with structured validation and clearer separation of concerns.
- Add stderr redirection for error messages in `create-tokens.sh`.
* Set non-root user in Dockerfile for Ingest service
* Fix Windows CI: command line too long
* Remove Victoria metrics
* Add hashed peer ID as Authorization header in metrics push
* Revert influxdb in docker compose
* Enable gzip compression and authorization validation for metrics push and ingest
* Reducate code of complexity
* Update debug documentation to include metrics.txt description
* Increase `maxBodySize` limit to 50 MB and update gzip reader wrapping logic
* Refactor deployment type detection to use URL parsing for improved accuracy
* Update readme
* Throttle remote config retries on fetch failure
* Preserve first WG handshake timestamp, ignore rekeys
* Skip adding empty metrics.txt to debug bundle in debug mode
* Update default metrics server URL to https://ingest.netbird.io
* Atomic metrics export-and-reset to prevent sample loss between Export and Reset calls
* Fix doc
* Refactor Push configuration to improve clarity and enforce minimum push interval
* Remove `minPushInterval` and update push interval validation logic
* Revert ExportAndReset, it is acceptable data loss
* Fix metrics review issues: rename env var, remove stale infra, add tests
- Rename NB_METRICS_ENABLED to NB_METRICS_PUSH_ENABLED to clarify that
collection is always active (for debug bundles) and only push is opt-in
- Change default config URL from staging to production (ingest.netbird.io)
- Delete broken Prometheus dashboard (used non-existent metric names)
- Delete unused VictoriaMetrics datasource config
- Replace committed .env with .env.example containing placeholder values
- Wire Grafana admin credentials through env vars in docker-compose
- Make metricsStages a pointer to prevent reset-vs-write race on reconnect
- Fix typed-nil interface in debug bundle path (GetClientMetrics)
- Use deterministic field order in InfluxDB Export (sorted keys)
- Replace Authorization header with X-Peer-ID for metrics push
- Fix ingest server timeout to use time.Second instead of float
- Fix gzip double-close, stale comments, trim log levels
- Add tests for influxdb.go and MetricsStages
* Add login duration metric, ingest tag validation, and duration bounds
- Add netbird_login measurement recording login/auth duration to management
server, with success/failure result tag
- Validate InfluxDB tags against per-measurement allowlists in ingest server
to prevent arbitrary tag injection
- Cap all duration fields (*_seconds) at 300s instead of only total_seconds
- Add ingest server tests for tag/field validation, bounds, and auth
* Add arch tag to all metrics
* Fix Grafana dashboard: add arch to drop columns, add login panels
* Validate NB_METRICS_SERVER_URL is an absolute HTTP(S) URL
* Address review comments: fix README wording, update stale comments
* Clarify env var precedence does not bypass remote config eligibility
* Remove accidentally committed pprof files
---------
Co-authored-by: Viktor Liu <viktor@netbird.io>
Replace fmt.Sprintf("%s:%d", ip, port) with net.JoinHostPort() to
properly handle IPv6 addresses that need bracket wrapping (e.g.,
[2606:4700:4700::1111]:53 instead of 2606:4700:4700::1111:53).
Without this fix, configuring IPv6 nameservers causes "too many colons
in address" errors because Go's net.Dial cannot parse the malformed
address string.
Fixes#5601
Related to #4074
Co-authored-by: easonysliu <easonysliu@tencent.com>
Auto-update logic moved out of the UI into a dedicated updatemanager.Manager service that runs in the connection layer. The
UI no longer polls or checks for updates independently.
The update manager supports three modes driven by the management server's auto-update policy:
No policy set by mgm: checks GitHub for the latest version and notifies the user (previous behavior, now centralized)
mgm enforces update: the "About" menu triggers installation directly instead of just downloading the file — user still initiates the action
mgm forces update: installation proceeds automatically without user interaction
updateManager lifecycle is now owned by daemon, giving the daemon server direct control via a new TriggerUpdate RPC
Introduces EngineServices struct to group external service dependencies passed to NewEngine, reducing its argument count from 11 to 4
* Fix DNS probe thread safety and avoid blocking engine sync
Refactor ProbeAvailability to prevent blocking the engine's sync mutex
during slow DNS probes. The probe now derives its context from the
server's own context (s.ctx) instead of accepting one from the caller,
and uses a mutex to ensure only one probe runs at a time — new calls
cancel the previous probe before starting. Also fixes a data race in
Stop() when accessing probeCancel without the probe mutex.
* Ensure DNS probe thread safety by locking critical sections
Add proper locking to prevent data races when accessing shared resources during DNS probe execution and Stop(). Update handlers snapshot logic to avoid conflicts with concurrent writers.
* Rename context and remove redundant cancellation
* Cancel first and lock
* Add locking to ensure thread safety when reactivating upstream servers
Wrap peerStateUpdate send in a nested select to prevent goroutine
blocking when the consumer has exited, which could fill the
subscription buffer and deadlock the Status mutex.
- Automatic Unix daemon address discovery: if the default socket is missing, the client can find and use a single available socket.
- Client startup now resolves daemon addresses more robustly while preserving non-Unix behavior.
Consolidate all expose business logic (validation, permission checks, TTL tracking, reaping) into the manager layer, making the gRPC layer a pure transport adapter that only handles proto conversion and authentication.
- Add ExposeServiceRequest/ExposeServiceResponse domain types with validation in the reverseproxy package
- Move expose tracker (TTL tracking, reaping, per-peer limits) from gRPC server into manager/expose_tracker.go
- Internalize tracking in CreateServiceFromPeer, RenewServiceFromPeer, and new StopServiceFromPeer so callers don't manage tracker state
- Untrack ephemeral services in DeleteService/DeleteAllServices to keep tracker in sync when services are deleted via API
- Simplify gRPC expose handlers to parse, auth, convert, delegate
- Remove tracker methods from Manager interface (internal detail)
CLI: new expose command to publish a local port with flags for PIN, password, user groups, custom domain, name prefix and protocol (HTTP default).
Management/API: create/renew/stop expose sessions (streamed status), automatic naming/domain, TTL renewals, background expiration, new management RPCs and client methods.
UI/API: account settings now include peer_expose_enabled and peer_expose_groups; new activity codes for peer expose events.
could interleave with a sleep/wake event causing out-of-order state
transitions. The mutex now covers the full duration of each handler
including the status check, the Up/Down call, and the flag update.
Note: if Up or Down commands are triggered in parallel with sleep/wake
events, the overall ordering of up/down/sleep/wake operations is still
not guaranteed beyond what the mutex provides within the handler itself.
* [Client] Remove connection semaphore
Remove the semaphore and the initial random sleep time (300ms) from the connectivity logic to speed up the initial connection time.
Note: Implement limiter logic that can prioritize router peers and keep the fast connection option for the first few peers.
* Remove unused function
* [client] fix busy-loop in network monitor routing socket on macOS/BSD
After system wakeup, the AF_ROUTE socket created by Go's unix.Socket()
is non-blocking, causing unix.Read to return EAGAIN immediately and spin
at 100% CPU filling the log with thousands of warnings per second.
Replace the tight read loop with a unix.Select call that blocks until
the fd is readable, checking ctx cancellation on each 1-second timeout.
Fatal errors (EBADF, EINVAL) now return an error instead of looping.
* [client] add fd range validation in waitReadable to prevent out-of-bound errors
* Ensure route settlement on iOS before handling DNS responses to prevent bypassing the tunnel.
* add more logs
* rollback debug changes
* rollback changes
* [client] Improve logging and add comments for iOS route settlement logic
- Switch iOS route settlement log level from Debug to Trace for finer control.
- Add clarifying comments for `waitForRouteSettlement` on non-iOS platforms.
---------
Co-authored-by: mlsmaycon <mlsmaycon@gmail.com>
* [client] Batch macOS DNS domains across multiple scutil keys to avoid truncation
scutil has undocumented limits: 99-element cap on d.add arrays and ~2048
byte value buffer for SupplementalMatchDomains. Users with 60+ domains
hit silent domain loss. This applies the same batching approach used on
Windows (nrptMaxDomainsPerRule=50), splitting domains into indexed
resolver keys (NetBird-Match-0, NetBird-Match-1, etc.) with 50-element
and 1500-byte limits per key.
* check for all keys on getRemovableKeysWithDefaults
* use multi error
* Refactor WG endpoint setup with role-based proxy activation
For relay connections, the controller (initiator) now activates the
wgProxy before configuring the WG endpoint, while the non-controller
(responder) configures the endpoint first with a delayed update, then
activates the proxy after. This prevents the responder from sending
traffic through the proxy before WireGuard is ready to receive it,
avoiding handshake congestion when both sides try to initiate
simultaneously.
For ICE connections, pass hasRelayBackup as the setEndpointNow flag
so the responder sets the endpoint immediately when a relay fallback
exists (avoiding the delayed update path since relay is already
available as backup).
On ICE disconnect with relay fallback, remove the duplicate
wgProxyRelay.Work() calls — the relay proxy is already active from
initial setup, so re-activating it is unnecessary.
In EndpointUpdater, split ConfigureWGEndpoint into explicit
configureAsInitiator and configureAsResponder paths, and add the
setEndpointNow parameter to let the caller control whether the
responder applies the endpoint immediately or defers it. Add unused
SwitchWGEndpoint and RemoveEndpointAddress methods. Remove the
wgConfigWorkaround sleep from the relay setup path.
* Fix redundant wgProxyRelay.Work() call during relay fallback setup
* Simplify WireGuard endpoint configuration by removing unused parameters and redundant logic
When an ICE connection disconnects and falls back to relay, reset the
WireGuard endpoint and handshake watcher if the remote peer's ICE session
has changed. This ensures the controller re-establishes a fresh WireGuard
handshake rather than waiting on a stale endpoint from the previous session.
* Optimize Windows DNS performance with domain batching and batch mode
Implement two-layer optimization to reduce Windows NRPT registry operations:
1. Domain Batching (host_windows.go):
- Batch domains per NRPT
- Reduces NRPT rules by ~97% (e.g., 184 domains: 184 rules → 4 rules)
- Modified addDNSMatchPolicy() to create batched NRPT entries
- Added comprehensive tests in host_windows_test.go
2. Batch Mode (server.go):
- Added BeginBatch/EndBatch methods to defer DNS updates
- Modified RegisterHandler/DeregisterHandler to skip applyHostConfig in batch mode
- Protected all applyHostConfig() calls with batch mode checks
- Updated route manager to wrap route operations with batch calls
* Update tests
* Fix log line
* Fix NRPT rule index to ensure cleanup covers partially created rules
* Ensure NRPT entry count updates even on errors to improve cleanup reliability
* Switch DNS batch mode logging from Info to Debug level
* Fix batch mode to not suppress critical DNS config updates
Batch mode should only defer applyHostConfig() for RegisterHandler/
DeregisterHandler operations. Management updates and upstream nameserver
failures (deactivate/reactivate callbacks) need immediate DNS config
updates regardless of batch mode to ensure timely failover.
Without this fix, if a nameserver goes down during a route update,
the system DNS config won't be updated until EndBatch(), potentially
delaying failover by several seconds.
Or if you prefer a shorter version:
Fix batch mode to allow immediate DNS updates for critical paths
Batch mode now only affects RegisterHandler/DeregisterHandler.
Management updates and nameserver failures always trigger immediate
DNS config updates to ensure timely failover.
* Add DNS batch cancellation to rollback partial changes on errors
Introduces CancelBatch() method to the DNS server interface to handle error
scenarios during batch operations. When route updates fail partway through, the DNS
server can now discard accumulated changes instead of applying partial state. This
prevents leaving the DNS configuration in an inconsistent state when route manager
operations encounter errors.
The changes add error-aware batch handling to prevent partial DNS configuration
updates when route operations fail, which improves system reliability.