Commit Graph

5 Commits

Author SHA1 Message Date
Zoltán Papp
77ec25796e client/dns/mgmt: bypass overlay for control-plane FQDN resolution
When an exit-node peer's network-map installs a 0.0.0.0/0 default route
on the overlay interface before that peer's WireGuard key material is
active, any UDP socket dialing an off-link address is routed into wt0
and the kernel returns ENOKEY.

Two places needed fixing:

 1. The mgmt cache refresh path. It reactively refreshes the
    control-plane FQDNs advertised by the mgmt (api/signal/stun/turn/
    the Relay pool root) after the daemon has installed its own
    resolv.conf pointing at the overlay listener. Previously the
    refresh dial followed the chain's upstream handler, which followed
    the overlay default route and deadlocked on ENOKEY.

 2. Foreign relay FQDN resolution. When a remote peer is homed on a
    different relay instance than us, we need to resolve a streamline-*
    subdomain that is not in the cache. That lookup went through the
    same overlay-routed upstream and failed identically, deadlocking
    the exit-node test whenever the relay LB put the two peers on
    different instances.

Fix both by giving the mgmt cache a dedicated net.Resolver that dials
the original pre-NetBird system nameservers through nbnet.NewDialer.
The dialer marks the socket as control-plane (SO_MARK on Linux,
IP_BOUND_IF on darwin, IP_UNICAST_IF on Windows); the routemanager's
policy rules keep those sockets on the underlay regardless of the
overlay default.

Pool-root domains (the Relay entries in ServerDomains) now register
through a subdomain-matching wrapper so that instance subdomains like
streamline-de-fra1-0.relay.netbird.io also hit the mgmt cache handler.
On cache miss under a pool root, ServeDNS resolves the FQDN on demand
through the bypass resolver, caches the result, and returns it.

Pool-root membership is derived dynamically from mgmt-advertised
ServerDomains.Relay[] — no hardcoded domain lists, no protocol change.
No hardcoded fallback nameservers: if the host had no original system
resolver at all, the bypass resolver stays nil and the stale-while-
revalidate cache keeps serving. The general upstream forwarder and
the user DNS path are unchanged.
2026-04-24 17:40:33 +02:00
Viktor Liu
801de8c68d [client] Add TTL-based refresh to mgmt DNS cache via handler chain (#5945) 2026-04-22 15:10:14 +02:00
Zoltan Papp
d18747e846 [client] Exclude Flow domain from caching to prevent TLS failures (#5433)
* Exclude Flow domain from caching to prevent TLS failures due to stale records.

* Fix test
2026-02-24 16:48:38 +01:00
Maycon Santos
433bc4ead9 [client] lookup for management domains using an additional timeout (#4983)
in some cases iOS and macOS may be locked when looking for management domains during network changes

This change introduce an additional timeout on top of the context call
2025-12-22 20:04:52 +01:00
Viktor Liu
d4c067f0af [client] Don't deactivate upstream resolvers on failure (#4128) 2025-08-29 17:40:05 +02:00