When an exit-node peer's network-map installs a 0.0.0.0/0 default route
on the overlay interface before that peer's WireGuard key material is
active, any UDP socket dialing an off-link address is routed into wt0
and the kernel returns ENOKEY.
Two places needed fixing:
1. The mgmt cache refresh path. It reactively refreshes the
control-plane FQDNs advertised by the mgmt (api/signal/stun/turn/
the Relay pool root) after the daemon has installed its own
resolv.conf pointing at the overlay listener. Previously the
refresh dial followed the chain's upstream handler, which followed
the overlay default route and deadlocked on ENOKEY.
2. Foreign relay FQDN resolution. When a remote peer is homed on a
different relay instance than us, we need to resolve a streamline-*
subdomain that is not in the cache. That lookup went through the
same overlay-routed upstream and failed identically, deadlocking
the exit-node test whenever the relay LB put the two peers on
different instances.
Fix both by giving the mgmt cache a dedicated net.Resolver that dials
the original pre-NetBird system nameservers through nbnet.NewDialer.
The dialer marks the socket as control-plane (SO_MARK on Linux,
IP_BOUND_IF on darwin, IP_UNICAST_IF on Windows); the routemanager's
policy rules keep those sockets on the underlay regardless of the
overlay default.
Pool-root domains (the Relay entries in ServerDomains) now register
through a subdomain-matching wrapper so that instance subdomains like
streamline-de-fra1-0.relay.netbird.io also hit the mgmt cache handler.
On cache miss under a pool root, ServeDNS resolves the FQDN on demand
through the bypass resolver, caches the result, and returns it.
Pool-root membership is derived dynamically from mgmt-advertised
ServerDomains.Relay[] — no hardcoded domain lists, no protocol change.
No hardcoded fallback nameservers: if the host had no original system
resolver at all, the bypass resolver stays nil and the stale-while-
revalidate cache keeps serving. The general upstream forwarder and
the user DNS path are unchanged.
in some cases iOS and macOS may be locked when looking for management domains during network changes
This change introduce an additional timeout on top of the context call