SynthesizePrivateServiceZones emits A records keyed on the proxy peer's
Status.Connected flag and tunnel IP, so the synth output changes every
time an embedded `netbird proxy` peer flips state. The trigger was
missing: MarkPeerConnected only called OnPeersUpdated when the peer was
LoginExpired, and MarkPeerDisconnected never called it at all. Result:
when a fresh proxy reconnects, user peers in the account hold their
stale netmap (or no synth at all) until some unrelated change pokes the
controller.
Fire OnPeersUpdated whenever an embedded proxy peer transitions
connected/disconnected. OnPeersUpdated routes through
bufferSendUpdateAccountPeers so consecutive flaps coalesce and don't
storm the controller.
AddPeer already calls OnPeersAdded for the new peer ID but that only
recomputes the proxy peer's own netmap — user peers still need this
new account-wide refresh to pick up the proxy peer's tunnel IP for
their private-service DNS records.
- status_test.go TestStatus_PeerStateByIP: replace
require := assert.New(t) shadowing pattern with req := require.New(t)
so setup assertions are fail-fast and the require package isn't shadowed.
Add TestStatus_PeerStateByIP_MatchesIPv6 for the IPv6-only path.
- status.go PeerStateByIP: match against both State.IP and State.IPv6 so
IPv6-only peers are found by the private-service tunnel lookup. Empty
input short-circuits before the loop and empty State.IP/State.IPv6
fields are treated as non-matches.
- proxy.go ValidateTunnelPeer: call enforceAccountScope(ctx, service.AccountID)
after the service lookup, mirroring ValidateSession. Without it, an
account-scoped (BYOP) proxy token could mint session JWTs for another
account's domain.
- sql_store.go getClusterCapability: thread the caller's context into the
GORM query via WithContext(ctx) so the lookup is cancellable and honours
request deadlines. (Pre-existing on origin/main; included here because
GetClusterSupportsPrivate added by this PR is now a caller.)
Skipped:
- proxyAcceptsMapping SupportsCustomPorts == true: the existing != nil
check is intentional. The accompanying test in this PR
(TestSendServiceUpdateToCluster_FiltersOnCapability) explicitly asserts
"new proxy with SupportsCustomPorts=false should still receive mapping"
— the non-nil check encodes "proxy is new enough to understand the
protocol", not "proxy can bind custom ports". Tightening to *bool==true
would break that design and the test.
- openapi.yml: declare default: false on ServiceTargetOptions.direct_upstream
so generated clients/validators reflect the documented default.
- proto/proxy_service.proto: ValidateTunnelPeer doc + denied_reason list
said "distribution_groups" (bearer-auth field) but the actual gate is
service.access_groups. Replaced both occurrences to match the code path
in checkPeerGroupAccess.
- peers/manager.go (GetPeerWithGroups) + users/manager.go (GetUserWithGroups):
on store error after a successful first lookup, both now return
(nil, nil, err) so callers can't get a valid entity alongside a non-nil
error.
Findings skipped with reasons:
- embedded.go merged CLI/Dashboard redirect URIs: pre-existing on
origin/main, not introduced by this PR.
- account_mock.go MarkPeerDisconnected zero-time UnixNano: same — pre-existing.
- openapi Service schema if/then conditionals: Go-side Validate() already
enforces these invariants (Private + non-empty AccessGroups, mode=http,
mutually-exclusive with bearer), and oapi-codegen on OpenAPI 3.1.x
doesn't honour allOf/if/then anyway.
- *.patch / *.diff / b-n-p.sh: untracked personal artifacts, not part of
any commit.
Cluster targets dial the upstream via the host network stack, so an
empty Host leaves the proxy with nothing to dial and DirectUpstream=false
would route the request through the embedded NetBird client (wrong
network for a cluster address). Validate() and validateTargetReferences
now reject both shapes.
Tests:
- TestValidate_HTTPClusterTarget / _RequiresTargetId /
TestValidate_Private_{AcceptsClusterTargetWithAccessGroups,
RequiresAccessGroups, RejectsBearerAuth} updated to populate Host and
DirectUpstream so they exercise the path past the new gates.
- TestValidate_HTTPClusterTarget_RequiresHost and _RequiresDirectUpstream
pin the two new error paths.
- TestValidateTargetReferences_ClusterTargetSkipsLookup updated to set
DirectUpstream on its fixture; new _ClusterTargetRequiresDirectUpstream
test covers the store-side rejection.
Drive-bys (no behavior change beyond what existing tests cover):
- proxy/proxy.go: shortened the Capabilities.Private / Cluster.Private
doc comments.
- users/manager.go: moved the GetUserWithGroups doc from the interface
to the impl.
- proxy/cmd/proxy/cmd/root.go: removed unused NewRootCmd.
- tunnel_cache.go: bumped tunnelCacheTTL from 30s to 300s (matches the
"5 minutes" target documented on the constant; existing TTL-expiry
test uses the constant directly so the bump is picked up automatically).
Adds a new "private" service mode for the reverse proxy: services
reachable exclusively over the embedded WireGuard tunnel, gated by
per-peer group membership instead of operator auth schemes.
Wire contract
- ProxyMapping.private (field 13): the proxy MUST call
ValidateTunnelPeer and fail closed; operator schemes are bypassed.
- ProxyCapabilities.private (4) + supports_private_service (5):
capability gate. Management never streams private mappings to
proxies that don't claim the capability; the broadcast path applies
the same filter via filterMappingsForProxy.
- ValidateTunnelPeer RPC: resolves an inbound tunnel IP to a peer,
checks the peer's groups against service.AccessGroups, and mints
a session JWT on success. checkPeerGroupAccess fails closed when
a private service has empty AccessGroups.
- ValidateSession/ValidateTunnelPeer responses now carry
peer_group_ids + peer_group_names so the proxy can authorise
policy-aware middlewares without an extra management round-trip.
- ProxyInboundListener + SendStatusUpdate.inbound_listener: per-account
inbound listener state surfaced to dashboards.
- PathTargetOptions.direct_upstream (11): bypass the embedded NetBird
client and dial the target via the proxy host's network stack for
upstreams reachable without WireGuard.
Data model
- Service.Private (bool) + Service.AccessGroups ([]string, JSON-
serialised). Validate() rejects bearer auth on private services.
Copy() deep-copies AccessGroups. pgx getServices loads the columns.
- DomainConfig.Private threaded into the proxy auth middleware.
Request handler routes private services through forwardWithTunnelPeer
and returns 403 on validation failure.
- Account-level SynthesizePrivateServiceZones (synthetic DNS) and
injectPrivateServicePolicies (synthetic ACL) gate on
len(svc.AccessGroups) > 0.
Proxy
- /netbird proxy --private (embedded mode) flag; Config.Private in
proxy/lifecycle.go.
- Per-account inbound listener (proxy/inbound.go) binding HTTP/HTTPS
on the embedded NetBird client's WireGuard tunnel netstack.
- proxy/internal/auth/tunnel_cache: ValidateTunnelPeer response cache
with single-flight de-duplication and per-account eviction.
- Local peerstore short-circuit: when the inbound IP isn't in the
account roster, deny fast without an RPC.
- proxy/server.go reports SupportsPrivateService=true and redacts the
full ProxyMapping JSON from info logs (auth_token + header-auth
hashed values now only at debug level).
Identity forwarding
- ValidateSessionJWT returns user_id, email, method, groups,
group_names. sessionkey.Claims carries Email + Groups + GroupNames
so the proxy can stamp identity onto upstream requests without an
extra management round-trip on every cookie-bearing request.
- CapturedData carries userEmail / userGroups / userGroupNames; the
proxy stamps X-NetBird-User and X-NetBird-Groups on r.Out from the
authenticated identity (strips client-supplied values first to
prevent spoofing).
- AccessLog.UserGroups: access-log enrichment captures the user's
group memberships at write time so the dashboard can render group
context without reverse-resolving stale memberships.
OpenAPI/dashboard surface
- ReverseProxyService gains private + access_groups; ReverseProxyCluster
gains private + supports_private. ReverseProxyTarget target_type
enum gains "cluster". ServiceTargetOptions gains direct_upstream.
ProxyAccessLog gains user_groups.
The cluster listing now answers three questions in one round-trip
instead of forcing the dashboard to cross-reference the domains API:
which clusters can this account see, are they currently up, and what
do they support. The ProxyCluster wire type drops the boolean
self_hosted in favour of a `type` enum (`account` / `shared`) plus
explicit `online`, `supports_custom_ports`, `require_subdomain`, and
`supports_crowdsec` fields.
Store query reworked so offline clusters still appear (no last_seen
WHERE), with online and connected_proxies both derived from the
existing 2-min active window via portable CASE expressions; the
1-hour heartbeat reaper still removes long-stale rows. Service
manager enriches each cluster with the capability flags via the
existing per-cluster lookups (CapabilityProvider now also exposes
ClusterSupportsCrowdSec).
GetActiveClusterAddresses* keep their tight 2-min filter so service
routing and domain enumeration aren't pulled into the wider window.
The hard cut removes self_hosted from the response — the dashboard is
the only consumer and is updated in the matching PR; no transitional
field is shipped.
Adds a cross-engine regression test asserting offline clusters
surface, connected_proxies counts only fresh proxies, and
account-scoped BYOP clusters never leak across accounts.
* [management] Ensure SessionStartedAt has a default value
Avoid null values for the new column
* [management] Add PeerStatus with LastSeen in peer_test
* [management] Add migration for PeerStatusSessionStartedAt default value
* [management] Add PeerStatus with LastSeen in migration tests
* [management] Add metrics for peer status updates and ephemeral cleanup
The session-fenced MarkPeerConnected / MarkPeerDisconnected path and
the ephemeral peer cleanup loop both run silently today: when fencing
rejects a stale stream, when a cleanup tick deletes peers, or when a
batch delete fails, we have no operational signal beyond log lines.
Add OpenTelemetry counters and a histogram so the same SLO-style
dashboards that already exist for the network-map controller can cover
peer connect/disconnect and ephemeral cleanup too.
All new attributes are bounded enums: operation in {connect,disconnect}
and outcome in {applied,stale,error,peer_not_found}. No account, peer,
or user ID is ever written as a metric label — total cardinality is
fixed at compile time (8 counter series, 2 histogram series, 4 unlabeled
ephemeral series).
Metric methods are nil-receiver safe so test composition that doesn't
wire telemetry (the bulk of the existing tests) works unchanged. The
ephemeral manager exposes a SetMetrics setter rather than taking the
collector through its constructor, keeping the constructor signature
stable across all test call sites.
* [management] Add OpenTelemetry metrics for ephemeral peer cleanup
Introduce counters for tracking ephemeral peer cleanup, including peers pending deletion, cleanup runs, successful deletions, and failed batches. Metrics are nil-receiver safe to ensure compatibility with test setups without telemetry.
* [management] Fence peer status updates with a session token
The connect/disconnect path used a best-effort LastSeen-after-streamStart
comparison to decide whether a status update should land. Under contention
— a re-sync arriving while the previous stream's disconnect was still in
flight, or two management replicas seeing the same peer at once — the
check was a read-then-decide-then-write window: any UPDATE in between
caused the wrong row to be written. The Go-side time.Now() that fed the
comparison also drifted under lock contention, since it was captured
seconds before the write actually committed.
Replace it with an integer-nanosecond fencing token stored alongside the
status. Every gRPC sync stream uses its open time (UnixNano) as its token.
Connects only land when the incoming token is strictly greater than the
stored one; disconnects only land when the incoming token equals the
stored one (i.e. we're the stream that owns the current session). Both
are single optimistic-locked UPDATEs — no read-then-write, no transaction
wrapper.
LastSeen is now written by the database itself (CURRENT_TIMESTAMP). The
caller never supplies it, so the value always reflects the real moment
of the UPDATE rather than the moment the caller queued the work — which
was already off by minutes under heavy lock contention.
Side effects (geo lookup, peer-login-expiration scheduling, network-map
fan-out) are explicitly documented as running after the fence UPDATE
commits, never inside it. Geo also skips the update when realIP equals
the stored ConnectionIP, dropping a redundant SavePeerLocation call on
same-IP reconnects.
Tests cover the three semantic cases (matched disconnect lands, stale
disconnect dropped, stale connect dropped) plus a 16-goroutine race test
that asserts the highest token always wins.
* [management] Add SessionStartedAt to peer status updates
Stored `SessionStartedAt` for fencing token propagation across goroutines and updated database queries/functions to handle the new field. Removed outdated geolocation handling logic and adjusted tests for concurrency safety.
* Rename `peer_status_required_approval` to `peer_status_requires_approval` in SQL store fields
This change enables admins to configure posture checks for connecting public IPs of their peers.
It changes the behavior of the check as well and now the evaluation is if the received network is part of the configured network.
* enable pat creation on setup
* remove logic from handler towards setup service
* fix lint issue
* fix rollback on account id returning empty
* fix coderabbit comments
* fix setup PAT rollback behavior
peerShouldReceiveUpdate waited 500ms for the expected update message,
and every outer wrapper across the management/server test suite paired
it with a 1s goroutine-drain timeout. Both were too tight for slower
CI runners (MySQL, FreeBSD, loaded sqlite), producing intermittent
"Timed out waiting for update message" failures in tests like
TestDNSAccountPeersUpdate, TestPeerAccountPeersUpdate, and
TestNameServerAccountPeersUpdate.
Introduce peerUpdateTimeout (5s) next to the helper and use it both in
the helper and in every outer wrapper so the two timeouts stay in sync.
Only runs down on failure; passing tests return as soon as the channel
delivers, so there is no slowdown on green runs.
* Add support for legacy IDP cache environment variable
* Centralize cache store creation to reuse a single Redis connection pool
Each cache consumer (IDP cache, token store, PKCE store, secrets manager,
EDR validator) was independently calling NewStore, creating separate Redis
clients with their own connection pools — up to 1400 potential connections
from a single management server process.
Introduce a shared CacheStore() singleton on BaseServer that creates one
store at boot and injects it into all consumers. Consumer constructors now
receive a store.StoreInterface instead of creating their own.
For Redis mode, all consumers share one connection pool (1000 max conns).
For in-memory mode, all consumers share one GoCache instance.
* Update management-integrations module to latest version
* sync go.sum
* Export `GetAddrFromEnv` to allow reuse across packages
* Update management-integrations module version in go.mod and go.sum
* Update management-integrations module version in go.mod and go.sum
* Add network map benchmark and correctness test files
* Add tests for network map components correctness and edge cases
* Skip benchmarks in CI and enhance network map test coverage with new helper functions
* Remove legacy network map benchmarks and tests; refactor components-based test coverage for clarity and scalability.