Bump the IsHealthy() context timeout from 1s to 5s for both the
management and signal gRPC clients to reduce false negatives on
slower or congested connections.
* [debug] fix port collision in TestUpload
TestUpload hardcoded :8080, so it failed deterministically when anything
was already on that port and collided across concurrent test runs.
Bind a :0 listener in the test to get a kernel-assigned free port, and
add Server.Serve so tests can hand the listener in without reaching
into unexported state.
* [debug] drop test-only Server.Serve, use SERVER_ADDRESS env
The previous commit added a Server.Serve method on the upload-server,
used only by TestUpload. That left production with an unused function.
Reserve an ephemeral loopback port in the test, release it, and pass
the address through SERVER_ADDRESS (which the server already reads).
A small wait helper ensures the server is accepting connections before
the upload runs, so the close/rebind gap does not cause a false failure.
The Receive goroutine could outlive the test and call t.Logf after
teardown, panicking with "Log in goroutine after ... has completed".
Register a cleanup that waits for the goroutine to exit; ordering is
LIFO so it runs after client.Close, which is what unblocks Receive.
The test writes 500 packets per family and asserted exact-count
delivery within a 5s window, even though its own comment says "Some
packet loss is acceptable for UDP". On FreeBSD/QEMU runners the writer
loops cannot always finish all 500 before the 5s deadline closes the
readers (we have seen 411/500 in CI).
The real assertion of this test is the routing check — IPv4 peer only
gets v4- packets, IPv6 peer only gets v6- packets — which remains
strict. Replace the exact-count assertions with a >=80% delivery
threshold so runner speed variance no longer causes false failures.
* [client] Suppress ICE signaling and periodic offers in force-relay mode
When NB_FORCE_RELAY is enabled, skip WorkerICE creation entirely,
suppress ICE credentials in offer/answer messages, disable the
periodic ICE candidate monitor, and fix isConnectedOnAllWay to
only check relay status so the guard stops sending unnecessary offers.
* [client] Dynamically suppress ICE based on remote peer's offer credentials
Track whether the remote peer includes ICE credentials in its
offers/answers. When remote stops sending ICE credentials, skip
ICE listener dispatch, suppress ICE credentials in responses, and
exclude ICE from the guard connectivity check. When remote resumes
sending ICE credentials, re-enable all ICE behavior.
* [client] Fix nil SessionID panic and force ICE teardown on relay-only transition
Fix nil pointer dereference in signalOfferAnswer when SessionID is nil
(relay-only offers). Close stale ICE agent immediately when remote peer
stops sending ICE credentials to avoid traffic black-hole during the
ICE disconnect timeout.
* [client] Add relay-only fallback check when ICE is unavailable
Ensure the relay connection is supported with the peer when ICE is disabled to prevent connectivity issues.
* [client] Add tri-state connection status to guard for smarter ICE retry (#5828)
* [client] Add tri-state connection status to guard for smarter ICE retry
Refactor isConnectedOnAllWay to return a ConnStatus enum (Connected,
Disconnected, PartiallyConnected) instead of a boolean. When relay is
up but ICE is not (PartiallyConnected), limit ICE offers to 3 retries
with exponential backoff then fall back to hourly attempts, reducing
unnecessary signaling traffic. Fully disconnected peers continue to
retry aggressively. External events (relay/ICE disconnect, signal/relay
reconnect) reset retry state to give ICE a fresh chance.
* [client] Clarify guard ICE retry state and trace log trigger
Split iceRetryState.attempt into shouldRetry (pure predicate) and
enterHourlyMode (explicit state transition) so the caller in
reconnectLoopWithRetry reads top-to-bottom. Restore the original
trace-log behavior in isConnectedOnAllWay so it only logs on full
disconnection, not on the new PartiallyConnected state.
* [client] Extract pure evalConnStatus and add unit tests
Split isConnectedOnAllWay into a thin method that snapshots state and
a pure evalConnStatus helper that takes a connStatusInputs struct, so
the tri-state decision logic can be exercised without constructing
full Worker or Handshaker objects. Add table-driven tests covering
force-relay, ICE-unavailable and fully-available code paths, plus
unit tests for iceRetryState budget/hourly transitions and reset.
* [client] Improve grammar in logs and refactor ICE credential checks
* fix(client): skip MAC address filter for network addresses on iOS
iOS does not expose hardware (MAC) addresses due to Apple's privacy
restrictions (since iOS 14), causing networkAddresses() to return an
empty list because all interfaces are filtered out by the HardwareAddr
check. Move networkAddresses() to platform-specific files so iOS can
skip this filter.
WGIface.Close() took w.mu and held it across w.tun.Close(). The
underlying wireguard-go device waits for its send/receive goroutines to
drain before Close() returns, and some of those goroutines re-enter
WGIface during shutdown. In particular, the userspace packet filter DNS
hook in client/internal/dns.ServiceViaMemory.filterDNSTraffic calls
s.wgInterface.GetDevice() on every packet, which also needs w.mu. With
the Close-side holding the mutex, the read goroutine blocks in
GetDevice and Close waits forever for that goroutine to exit:
goroutine N (TestDNSPermanent_updateUpstream):
WGIface.Close -> holds w.mu -> tun.Close -> sync.WaitGroup.Wait
goroutine M (wireguard read routine):
FilteredDevice.Read -> filterOutbound -> udpHooksDrop ->
filterDNSTraffic.func1 -> WGIface.GetDevice -> sync.Mutex.Lock
This surfaces as a 5 minute test timeout on the macOS Client/Unit
CI job (panic: test timed out after 5m0s, running tests:
TestDNSPermanent_updateUpstream).
Release w.mu before calling w.tun.Close(). The other Close steps
(wgProxyFactory.Free, waitUntilRemoved, Destroy) do not mutate any
fields guarded by w.mu beyond what Free() already does, so the lock
is not needed once the tun has started shutting down. A new unit test
in iface_close_test.go uses a fake WGTunDevice to reproduce the
deadlock deterministically without requiring CAP_NET_ADMIN.
* Add support for legacy IDP cache environment variable
* Centralize cache store creation to reuse a single Redis connection pool
Each cache consumer (IDP cache, token store, PKCE store, secrets manager,
EDR validator) was independently calling NewStore, creating separate Redis
clients with their own connection pools — up to 1400 potential connections
from a single management server process.
Introduce a shared CacheStore() singleton on BaseServer that creates one
store at boot and injects it into all consumers. Consumer constructors now
receive a store.StoreInterface instead of creating their own.
For Redis mode, all consumers share one connection pool (1000 max conns).
For in-memory mode, all consumers share one GoCache instance.
* Update management-integrations module to latest version
* sync go.sum
* Export `GetAddrFromEnv` to allow reuse across packages
* Update management-integrations module version in go.mod and go.sum
* Update management-integrations module version in go.mod and go.sum
The iOS GetInfo() function never populated NetworkAddresses, causing
the peer_network_range_check posture check to fail for all iOS clients.
This adds the same networkAddresses() call that macOS, Linux, Windows,
and FreeBSD already use.
Fixes: #3968Fixes: #4657
* [client] Fix flow client Receive retry loop not stopping after Close
Use backoff.Permanent for canceled gRPC errors so Receive returns
immediately instead of retrying until context deadline when the
connection is already closed. Add TestNewClient_PermanentClose to
verify the behavior.
The connectivity.Shutdown check was meaningless because when the connection is
shut down, c.realClient.Events(ctx, grpc.WaitForReady(true)) on the nex line
already fails with codes.Canceled — which is now handled as a permanent error.
The explicit state check was just duplicating what gRPC already reports
through its normal error path.
* [client] remove WaitForReady from stream open call
grpc.WaitForReady(true) parks the RPC call internally until the
connection reaches READY, only unblocking on ctx cancellation.
This means the external backoff.Retry loop in Receive() never gets
control back during a connection outage — it cannot tick, log, or
apply its retry intervals while WaitForReady is blocking.
Removing it restores fail-fast behaviour: Events() returns immediately
with codes.Unavailable when the connection is not ready, which is
exactly what the backoff loop expects. The backoff becomes the single
authority over retry timing and cadence, as originally intended.
* [client] Add connection recreation and improve flow client error handling
Store gRPC dial options on the client to enable connection recreation
on Internal errors (RST_STREAM/PROTOCOL_ERROR). Treat Unauthenticated,
PermissionDenied, and Unimplemented as permanent failures. Unify mutex
usage and add reconnection logging for better observability.
* [client] Remove Unauthenticated, PermissionDenied, and Unimplemented from permanent error handling
* [client] Fix error handling in Receive to properly re-establish stream and improve reconnection messaging
* Fix test
* [client] Add graceful shutdown handling and test for concurrent Close during Receive
Prevent reconnection attempts after client closure by tracking a `closed` flag. Use `backoff.Permanent` for errors caused by operations on a closed client. Add a test to ensure `Close` does not block when `Receive` is actively running.
* [client] Fix connection swap to properly close old gRPC connection
Close the old `gRPC.ClientConn` after successfully swapping to a new connection during reconnection.
* [client] Reset backoff
* [client] Ensure stream closure on error during initialization
* [client] Add test for handling server-side stream closure and reconnection
Introduce `TestReceive_ServerClosesStream` to verify the client's ability to recover and process acknowledgments after the server closes the stream. Enhance test server with a controlled stream closure mechanism.
* [client] Add protocol error simulation and enhance reconnection test
Introduce `connTrackListener` to simulate HTTP/2 RST_STREAM with PROTOCOL_ERROR for testing. Refactor and rename `TestReceive_ServerClosesStream` to `TestReceive_ProtocolErrorStreamReconnect` to verify client recovery on protocol errors.
* [client] Update Close error message in test for clarity
* [client] Fine-tune the tests
* [client] Adjust connection tracking in reconnection test
* [client] Wait for Events handler to exit in RST_STREAM reconnection test
Ensure the old `Events` handler exits fully before proceeding in the reconnection test to avoid dropped acknowledgments on a broken stream. Add a `handlerDone` channel to synchronize handler exits.
* [client] Prevent panic on nil connection during Close
* [client] Refactor connection handling to use explicit target tracking
Introduce `target` field to store the gRPC connection target directly, simplifying reconnections and ensuring consistent connection reuse logic.
* [client] Rename `isCancellation` to `isContextDone` and extend handling for `DeadlineExceeded`
Refactor error handling to include `DeadlineExceeded` scenarios alongside `Canceled`. Update related condition checks for consistency.
* [client] Add connection generation tracking to prevent stale reconnections
Introduce `connGen` to track connection generations and ensure that stale `recreateConnection` calls do not override newer connections. Update stream establishment and reconnection logic to incorporate generation validation.
* [client] Add backoff reset condition to prevent short-lived retry cycles
Refine backoff reset logic to ensure it only occurs for sufficiently long-lived stream connections, avoiding interference with `MaxElapsedTime`.
* [client] Introduce `minHealthyDuration` to refine backoff reset logic
Add `minHealthyDuration` constant to ensure stream retries only reset the backoff timer if the stream survives beyond a minimum duration. Prevents unhealthy, short-lived streams from interfering with `MaxElapsedTime`.
* [client] IPv6 friendly connection
parsedURL.Hostname() strips IPv6 brackets. For http://[::1]:443, this turns it into ::1:443, which is not a valid host:port target for gRPC. Additionally, fmt.Sprintf("%s:%s", hostname, port) produces a trailing colon when the URL has no explicit port—http://example.com becomes example.com:. Both cases break the initial dial and reconnect paths. Use parsedURL.Host directly instead.
* [client] Add `handlerStarted` channel to synchronize stream establishment in tests
Introduce `handlerStarted` channel in the test server to signal when the server-side handler begins, ensuring robust synchronization between client and server during stream establishment. Update relevant test cases to wait for this signal before proceeding.
* [client] Replace `receivedAcks` map with atomic counter and improve stream establishment sync in tests
Refactor acknowledgment tracking in tests to use an `atomic.Int32` counter instead of a map. Replace fixed sleep with robust synchronization by waiting on `handlerStarted` signal for stream establishment.
* [client] Extract `handleReceiveError` to simplify receive logic
Refactor error handling in `receive` to a dedicated `handleReceiveError` method. Streamlines the main logic and isolates error recovery, including backoff reset and connection recreation.
* [client] recreate gRPC ClientConn on every retry to prevent dual backoff
The flow client had two competing retry loops: our custom exponential
backoff and gRPC's internal subchannel reconnection. When establishStream
failed, the same ClientConn was reused, allowing gRPC's internal backoff
state to accumulate and control dial timing independently.
Changes:
- Consolidate error handling into handleRetryableError, which now
handles context cancellation, permanent errors, backoff reset,
and connection recreation in a single path
- Call recreateConnection on every retryable error so each retry
gets a fresh ClientConn with no internal backoff state
- Remove connGen tracking since Receive is sequential and protected
by a new receiving guard against concurrent calls
- Reduce RandomizationFactor from 1 to 0.5 to avoid near-zero
backoff intervals
extraInitialRoutes() was meant to preserve only the fake IP route
(240.0.0.0/8) across TUN rebuilds, but it re-injected any initial
route missing from the current set. When the management server
advertised exit node routes (0.0.0.0/0) that were later filtered
by the route selector, extraInitialRoutes() re-added them, causing
the Android VPN to capture all traffic with no peer to handle it.
Store the fake IP route explicitly and append only that in notify(),
removing the overly broad initial route diffing.
* Add network map benchmark and correctness test files
* Add tests for network map components correctness and edge cases
* Skip benchmarks in CI and enhance network map test coverage with new helper functions
* Remove legacy network map benchmarks and tests; refactor components-based test coverage for clarity and scalability.