Adds a new "private" service mode for the reverse proxy: services
reachable exclusively over the embedded WireGuard tunnel, gated by
per-peer group membership instead of operator auth schemes.
Wire contract
- ProxyMapping.private (field 13): the proxy MUST call
ValidateTunnelPeer and fail closed; operator schemes are bypassed.
- ProxyCapabilities.private (4) + supports_private_service (5):
capability gate. Management never streams private mappings to
proxies that don't claim the capability; the broadcast path applies
the same filter via filterMappingsForProxy.
- ValidateTunnelPeer RPC: resolves an inbound tunnel IP to a peer,
checks the peer's groups against service.AccessGroups, and mints
a session JWT on success. checkPeerGroupAccess fails closed when
a private service has empty AccessGroups.
- ValidateSession/ValidateTunnelPeer responses now carry
peer_group_ids + peer_group_names so the proxy can authorise
policy-aware middlewares without an extra management round-trip.
- ProxyInboundListener + SendStatusUpdate.inbound_listener: per-account
inbound listener state surfaced to dashboards.
- PathTargetOptions.direct_upstream (11): bypass the embedded NetBird
client and dial the target via the proxy host's network stack for
upstreams reachable without WireGuard.
Data model
- Service.Private (bool) + Service.AccessGroups ([]string, JSON-
serialised). Validate() rejects bearer auth on private services.
Copy() deep-copies AccessGroups. pgx getServices loads the columns.
- DomainConfig.Private threaded into the proxy auth middleware.
Request handler routes private services through forwardWithTunnelPeer
and returns 403 on validation failure.
- Account-level SynthesizePrivateServiceZones (synthetic DNS) and
injectPrivateServicePolicies (synthetic ACL) gate on
len(svc.AccessGroups) > 0.
Proxy
- /netbird proxy --private (embedded mode) flag; Config.Private in
proxy/lifecycle.go.
- Per-account inbound listener (proxy/inbound.go) binding HTTP/HTTPS
on the embedded NetBird client's WireGuard tunnel netstack.
- proxy/internal/auth/tunnel_cache: ValidateTunnelPeer response cache
with single-flight de-duplication and per-account eviction.
- Local peerstore short-circuit: when the inbound IP isn't in the
account roster, deny fast without an RPC.
- proxy/server.go reports SupportsPrivateService=true and redacts the
full ProxyMapping JSON from info logs (auth_token + header-auth
hashed values now only at debug level).
Identity forwarding
- ValidateSessionJWT returns user_id, email, method, groups,
group_names. sessionkey.Claims carries Email + Groups + GroupNames
so the proxy can stamp identity onto upstream requests without an
extra management round-trip on every cookie-bearing request.
- CapturedData carries userEmail / userGroups / userGroupNames; the
proxy stamps X-NetBird-User and X-NetBird-Groups on r.Out from the
authenticated identity (strips client-supplied values first to
prevent spoofing).
- AccessLog.UserGroups: access-log enrichment captures the user's
group memberships at write time so the dashboard can render group
context without reverse-resolving stale memberships.
OpenAPI/dashboard surface
- ReverseProxyService gains private + access_groups; ReverseProxyCluster
gains private + supports_private. ReverseProxyTarget target_type
enum gains "cluster". ServiceTargetOptions gains direct_upstream.
ProxyAccessLog gains user_groups.
The cluster listing now answers three questions in one round-trip
instead of forcing the dashboard to cross-reference the domains API:
which clusters can this account see, are they currently up, and what
do they support. The ProxyCluster wire type drops the boolean
self_hosted in favour of a `type` enum (`account` / `shared`) plus
explicit `online`, `supports_custom_ports`, `require_subdomain`, and
`supports_crowdsec` fields.
Store query reworked so offline clusters still appear (no last_seen
WHERE), with online and connected_proxies both derived from the
existing 2-min active window via portable CASE expressions; the
1-hour heartbeat reaper still removes long-stale rows. Service
manager enriches each cluster with the capability flags via the
existing per-cluster lookups (CapabilityProvider now also exposes
ClusterSupportsCrowdSec).
GetActiveClusterAddresses* keep their tight 2-min filter so service
routing and domain enumeration aren't pulled into the wider window.
The hard cut removes self_hosted from the response — the dashboard is
the only consumer and is updated in the matching PR; no transitional
field is shipped.
Adds a cross-engine regression test asserting offline clusters
surface, connected_proxies counts only fresh proxies, and
account-scoped BYOP clusters never leak across accounts.
* [management] Ensure SessionStartedAt has a default value
Avoid null values for the new column
* [management] Add PeerStatus with LastSeen in peer_test
* [management] Add migration for PeerStatusSessionStartedAt default value
* [management] Add PeerStatus with LastSeen in migration tests
* [management] Add metrics for peer status updates and ephemeral cleanup
The session-fenced MarkPeerConnected / MarkPeerDisconnected path and
the ephemeral peer cleanup loop both run silently today: when fencing
rejects a stale stream, when a cleanup tick deletes peers, or when a
batch delete fails, we have no operational signal beyond log lines.
Add OpenTelemetry counters and a histogram so the same SLO-style
dashboards that already exist for the network-map controller can cover
peer connect/disconnect and ephemeral cleanup too.
All new attributes are bounded enums: operation in {connect,disconnect}
and outcome in {applied,stale,error,peer_not_found}. No account, peer,
or user ID is ever written as a metric label — total cardinality is
fixed at compile time (8 counter series, 2 histogram series, 4 unlabeled
ephemeral series).
Metric methods are nil-receiver safe so test composition that doesn't
wire telemetry (the bulk of the existing tests) works unchanged. The
ephemeral manager exposes a SetMetrics setter rather than taking the
collector through its constructor, keeping the constructor signature
stable across all test call sites.
* [management] Add OpenTelemetry metrics for ephemeral peer cleanup
Introduce counters for tracking ephemeral peer cleanup, including peers pending deletion, cleanup runs, successful deletions, and failed batches. Metrics are nil-receiver safe to ensure compatibility with test setups without telemetry.
* [management] Fence peer status updates with a session token
The connect/disconnect path used a best-effort LastSeen-after-streamStart
comparison to decide whether a status update should land. Under contention
— a re-sync arriving while the previous stream's disconnect was still in
flight, or two management replicas seeing the same peer at once — the
check was a read-then-decide-then-write window: any UPDATE in between
caused the wrong row to be written. The Go-side time.Now() that fed the
comparison also drifted under lock contention, since it was captured
seconds before the write actually committed.
Replace it with an integer-nanosecond fencing token stored alongside the
status. Every gRPC sync stream uses its open time (UnixNano) as its token.
Connects only land when the incoming token is strictly greater than the
stored one; disconnects only land when the incoming token equals the
stored one (i.e. we're the stream that owns the current session). Both
are single optimistic-locked UPDATEs — no read-then-write, no transaction
wrapper.
LastSeen is now written by the database itself (CURRENT_TIMESTAMP). The
caller never supplies it, so the value always reflects the real moment
of the UPDATE rather than the moment the caller queued the work — which
was already off by minutes under heavy lock contention.
Side effects (geo lookup, peer-login-expiration scheduling, network-map
fan-out) are explicitly documented as running after the fence UPDATE
commits, never inside it. Geo also skips the update when realIP equals
the stored ConnectionIP, dropping a redundant SavePeerLocation call on
same-IP reconnects.
Tests cover the three semantic cases (matched disconnect lands, stale
disconnect dropped, stale connect dropped) plus a 16-goroutine race test
that asserts the highest token always wins.
* [management] Add SessionStartedAt to peer status updates
Stored `SessionStartedAt` for fencing token propagation across goroutines and updated database queries/functions to handle the new field. Removed outdated geolocation handling logic and adjusted tests for concurrency safety.
* Rename `peer_status_required_approval` to `peer_status_requires_approval` in SQL store fields
This change enables admins to configure posture checks for connecting public IPs of their peers.
It changes the behavior of the check as well and now the evaluation is if the received network is part of the configured network.
* enable pat creation on setup
* remove logic from handler towards setup service
* fix lint issue
* fix rollback on account id returning empty
* fix coderabbit comments
* fix setup PAT rollback behavior
peerShouldReceiveUpdate waited 500ms for the expected update message,
and every outer wrapper across the management/server test suite paired
it with a 1s goroutine-drain timeout. Both were too tight for slower
CI runners (MySQL, FreeBSD, loaded sqlite), producing intermittent
"Timed out waiting for update message" failures in tests like
TestDNSAccountPeersUpdate, TestPeerAccountPeersUpdate, and
TestNameServerAccountPeersUpdate.
Introduce peerUpdateTimeout (5s) next to the helper and use it both in
the helper and in every outer wrapper so the two timeouts stay in sync.
Only runs down on failure; passing tests return as soon as the channel
delivers, so there is no slowdown on green runs.
* Add support for legacy IDP cache environment variable
* Centralize cache store creation to reuse a single Redis connection pool
Each cache consumer (IDP cache, token store, PKCE store, secrets manager,
EDR validator) was independently calling NewStore, creating separate Redis
clients with their own connection pools — up to 1400 potential connections
from a single management server process.
Introduce a shared CacheStore() singleton on BaseServer that creates one
store at boot and injects it into all consumers. Consumer constructors now
receive a store.StoreInterface instead of creating their own.
For Redis mode, all consumers share one connection pool (1000 max conns).
For in-memory mode, all consumers share one GoCache instance.
* Update management-integrations module to latest version
* sync go.sum
* Export `GetAddrFromEnv` to allow reuse across packages
* Update management-integrations module version in go.mod and go.sum
* Update management-integrations module version in go.mod and go.sum
* Add network map benchmark and correctness test files
* Add tests for network map components correctness and edge cases
* Skip benchmarks in CI and enhance network map test coverage with new helper functions
* Remove legacy network map benchmarks and tests; refactor components-based test coverage for clarity and scalability.