SynthesizePrivateServiceZones emits A records keyed on the proxy peer's
Status.Connected flag and tunnel IP, so the synth output changes every
time an embedded `netbird proxy` peer flips state. The trigger was
missing: MarkPeerConnected only called OnPeersUpdated when the peer was
LoginExpired, and MarkPeerDisconnected never called it at all. Result:
when a fresh proxy reconnects, user peers in the account hold their
stale netmap (or no synth at all) until some unrelated change pokes the
controller.
Fire OnPeersUpdated whenever an embedded proxy peer transitions
connected/disconnected. OnPeersUpdated routes through
bufferSendUpdateAccountPeers so consecutive flaps coalesce and don't
storm the controller.
AddPeer already calls OnPeersAdded for the new peer ID but that only
recomputes the proxy peer's own netmap — user peers still need this
new account-wide refresh to pick up the proxy peer's tunnel IP for
their private-service DNS records.
When a proxy connects with capabilities, the registerProxyConnection
path constructs a *proxy.Capabilities literal but only copies three of
the four fields (SupportsCustomPorts, RequireSubdomain, SupportsCrowdsec)
— Private is silently dropped. Effect: the DB's proxies.private column
stays NULL for every connection, GetClusterSupportsPrivate sees no
proxy reported the capability and returns nil, and the cluster API
response strips "private" via openapi omitempty. The dashboard then
can't tell a private cluster apart from a centralised one.
Add the missing field to the struct literal. No schema change — the
Capabilities.Private column already exists (gorm:"embedded"); only
the write path was broken.
After redeploy + a proxy reconnect, /api/reverse-proxies/clusters now
returns "private": true for clusters where an embedded `netbird proxy`
is connected.
The Cluster struct carried a Private *bool field and the proxy-manager
forwarder ClusterSupportsPrivate already existed, but the service manager's
CapabilityProvider interface didn't declare it and GetClusters never
called it. Result: clusters[i].Private stayed nil and the openapi
omitempty stripped the field from the JSON response, hiding the
private-cluster signal from the dashboard.
- CapabilityProvider gains ClusterSupportsPrivate.
- GetClusters populates clusters[i].Private alongside the other capability
flags so the dashboard's clusters page can render the private indicator.
The concrete CapabilityProvider impl (proxy.Manager) already provides
the forwarder, and proxy.MockManager (used by existing tests) is
regenerated with the new method already present.
- status_test.go TestStatus_PeerStateByIP: replace
require := assert.New(t) shadowing pattern with req := require.New(t)
so setup assertions are fail-fast and the require package isn't shadowed.
Add TestStatus_PeerStateByIP_MatchesIPv6 for the IPv6-only path.
- status.go PeerStateByIP: match against both State.IP and State.IPv6 so
IPv6-only peers are found by the private-service tunnel lookup. Empty
input short-circuits before the loop and empty State.IP/State.IPv6
fields are treated as non-matches.
- proxy.go ValidateTunnelPeer: call enforceAccountScope(ctx, service.AccountID)
after the service lookup, mirroring ValidateSession. Without it, an
account-scoped (BYOP) proxy token could mint session JWTs for another
account's domain.
- sql_store.go getClusterCapability: thread the caller's context into the
GORM query via WithContext(ctx) so the lookup is cancellable and honours
request deadlines. (Pre-existing on origin/main; included here because
GetClusterSupportsPrivate added by this PR is now a caller.)
Skipped:
- proxyAcceptsMapping SupportsCustomPorts == true: the existing != nil
check is intentional. The accompanying test in this PR
(TestSendServiceUpdateToCluster_FiltersOnCapability) explicitly asserts
"new proxy with SupportsCustomPorts=false should still receive mapping"
— the non-nil check encodes "proxy is new enough to understand the
protocol", not "proxy can bind custom ports". Tightening to *bool==true
would break that design and the test.
- openapi.yml: declare default: false on ServiceTargetOptions.direct_upstream
so generated clients/validators reflect the documented default.
- proto/proxy_service.proto: ValidateTunnelPeer doc + denied_reason list
said "distribution_groups" (bearer-auth field) but the actual gate is
service.access_groups. Replaced both occurrences to match the code path
in checkPeerGroupAccess.
- peers/manager.go (GetPeerWithGroups) + users/manager.go (GetUserWithGroups):
on store error after a successful first lookup, both now return
(nil, nil, err) so callers can't get a valid entity alongside a non-nil
error.
Findings skipped with reasons:
- embedded.go merged CLI/Dashboard redirect URIs: pre-existing on
origin/main, not introduced by this PR.
- account_mock.go MarkPeerDisconnected zero-time UnixNano: same — pre-existing.
- openapi Service schema if/then conditionals: Go-side Validate() already
enforces these invariants (Private + non-empty AccessGroups, mode=http,
mutually-exclusive with bearer), and oapi-codegen on OpenAPI 3.1.x
doesn't honour allOf/if/then anyway.
- *.patch / *.diff / b-n-p.sh: untracked personal artifacts, not part of
any commit.
Cluster targets dial the upstream via the host network stack, so an
empty Host leaves the proxy with nothing to dial and DirectUpstream=false
would route the request through the embedded NetBird client (wrong
network for a cluster address). Validate() and validateTargetReferences
now reject both shapes.
Tests:
- TestValidate_HTTPClusterTarget / _RequiresTargetId /
TestValidate_Private_{AcceptsClusterTargetWithAccessGroups,
RequiresAccessGroups, RejectsBearerAuth} updated to populate Host and
DirectUpstream so they exercise the path past the new gates.
- TestValidate_HTTPClusterTarget_RequiresHost and _RequiresDirectUpstream
pin the two new error paths.
- TestValidateTargetReferences_ClusterTargetSkipsLookup updated to set
DirectUpstream on its fixture; new _ClusterTargetRequiresDirectUpstream
test covers the store-side rejection.
Drive-bys (no behavior change beyond what existing tests cover):
- proxy/proxy.go: shortened the Capabilities.Private / Cluster.Private
doc comments.
- users/manager.go: moved the GetUserWithGroups doc from the interface
to the impl.
- proxy/cmd/proxy/cmd/root.go: removed unused NewRootCmd.
- tunnel_cache.go: bumped tunnelCacheTTL from 30s to 300s (matches the
"5 minutes" target documented on the constant; existing TTL-expiry
test uses the constant directly so the bump is picked up automatically).
The SyncMappings restore in 036e91cde kept the metric definitions
(RecordSnapshotSyncDuration, RecordSnapshotBatchDuration,
RecordAddPeerDuration) and the corresponding callbacks (OnAddPeer)
but lost their call sites — they shipped as dead code.
- proxy/server.go: introduce snapshotTracker (the type PR #6207 added
to share batch/sync timing between handleMappingStream and
handleSyncMappingsStream); both stream handlers now go through it.
- proxy/internal/roundtrip/netbird.go: add OnAddPeer struct field and
invoke it after createClientEntry with the per-call duration.
- proxy/server.go: wire s.netbird.OnAddPeer = s.meter.RecordAddPeerDuration
alongside the existing NetBird construction.
No new test coverage — PR #6207's bench tests already exercise the
batch/sync paths and continue to pass.
The MultiTransport's job is per-request dispatch between the embedded
NetBird transport and the stdlib transport based on the direct_upstream
context flag — about 25 lines of code. The header/body debug logging
that was bundled in pulls in:
- io.ReadAll on every request body, even when log level is above debug.
Forces buffering of streaming POSTs (LLM completions, file uploads)
before they reach the upstream transport.
- A header redaction list and a body-snippet cap that duplicate concerns
already covered by netbird.go's per-roundtrip log.
netbird.go already emits method/host/url/account/duration/status/err at
debug level on every roundtrip; nothing in the private-service feature
needs the extra header+body dump.
- Drop logUpstreamRequest, formatHeaders, redactHeaderValue,
snapshotRequestBody, and the upstreamLogBodyMax constant.
- Drop the logger field and the trailing nil arg from NewMultiTransport;
proxy/server.go and the tests updated accordingly.
UserGroups on AccessLogEntry was a server-side enrichment artefact: the
proto AccessLog message never carried it, so the only writer was
manager.enrichUserGroups at save time. Without that writer the field
stays nil forever and the dashboard's user_groups column is always
empty — better to remove the dead surface than ship an unused field.
The dashboard can still reverse-resolve groups from UserId when it
needs them, accepting the tradeoff that memberships are resolved at
display time rather than captured at write time.
- AccessLogEntry.UserGroups field removed (no GORM column either).
- ToAPIResponse stops emitting the user_groups key.
- openapi.yml user_groups field removed; types.gen.go regenerated.
- enrichUserGroups + its test removed.
The previous commit dropped enrichment_test.go and the
manager.enrichUserGroups helper that it covers. Re-adds both:
- SaveAccessLog now invokes enrichUserGroups before persisting so the
access log entry carries the user's AutoGroups when UserId is set
(the dashboard's Proxy Events table can then render group context
without reverse-resolving stale memberships).
- enrichUserGroups itself is best-effort: store lookup failures and
missing users are logged at debug and don't block the save.
- Switch header literals to the headerNetBirdUser / headerNetBirdGroups
constants so a future rename can't silently desync tests.
- Add GroupsOnlyWhenEmailEmpty: unattached tunnel peer (machine agent)
case — groups must still be stamped while X-NetBird-User stays unset.
- Add EmailOnlyWhenGroupsEmpty: symmetric case for users without
resolved group memberships.
- Add CapturedDataPresentButEmpty: client-supplied headers are stripped
even when CapturedData carries no identity fields.
- Extend the group-id fallback test to also exercise an explicit
empty-string entry in userGroupNames (not just a shorter slice).
Reinstates the SyncMappings RPC that landed on origin/main and the
client-side fallback to GetMappingUpdate.
- proto: SyncMappings RPC + SyncMappingsRequest{Init|Ack} +
SyncMappingsResponse messages.
- management proxy.go: SyncMappings server handler, recvSyncInit,
sendSnapshotSync (per-batch send-then-wait-for-ack), drainRecv,
waitForAck; proxyConnection.syncStream + sendResponse routes the
same sendChan onto the bidi stream when set.
- proxy/server.go: trySyncMappings + handleSyncMappingsStream that
acks after each batch is processed; outer loop tries SyncMappings
first and falls back to GetMappingUpdate on Unimplemented.
Capabilities lifted into proxyCapabilities() so both code paths
use the same flags.
Adds a new "private" service mode for the reverse proxy: services
reachable exclusively over the embedded WireGuard tunnel, gated by
per-peer group membership instead of operator auth schemes.
Wire contract
- ProxyMapping.private (field 13): the proxy MUST call
ValidateTunnelPeer and fail closed; operator schemes are bypassed.
- ProxyCapabilities.private (4) + supports_private_service (5):
capability gate. Management never streams private mappings to
proxies that don't claim the capability; the broadcast path applies
the same filter via filterMappingsForProxy.
- ValidateTunnelPeer RPC: resolves an inbound tunnel IP to a peer,
checks the peer's groups against service.AccessGroups, and mints
a session JWT on success. checkPeerGroupAccess fails closed when
a private service has empty AccessGroups.
- ValidateSession/ValidateTunnelPeer responses now carry
peer_group_ids + peer_group_names so the proxy can authorise
policy-aware middlewares without an extra management round-trip.
- ProxyInboundListener + SendStatusUpdate.inbound_listener: per-account
inbound listener state surfaced to dashboards.
- PathTargetOptions.direct_upstream (11): bypass the embedded NetBird
client and dial the target via the proxy host's network stack for
upstreams reachable without WireGuard.
Data model
- Service.Private (bool) + Service.AccessGroups ([]string, JSON-
serialised). Validate() rejects bearer auth on private services.
Copy() deep-copies AccessGroups. pgx getServices loads the columns.
- DomainConfig.Private threaded into the proxy auth middleware.
Request handler routes private services through forwardWithTunnelPeer
and returns 403 on validation failure.
- Account-level SynthesizePrivateServiceZones (synthetic DNS) and
injectPrivateServicePolicies (synthetic ACL) gate on
len(svc.AccessGroups) > 0.
Proxy
- /netbird proxy --private (embedded mode) flag; Config.Private in
proxy/lifecycle.go.
- Per-account inbound listener (proxy/inbound.go) binding HTTP/HTTPS
on the embedded NetBird client's WireGuard tunnel netstack.
- proxy/internal/auth/tunnel_cache: ValidateTunnelPeer response cache
with single-flight de-duplication and per-account eviction.
- Local peerstore short-circuit: when the inbound IP isn't in the
account roster, deny fast without an RPC.
- proxy/server.go reports SupportsPrivateService=true and redacts the
full ProxyMapping JSON from info logs (auth_token + header-auth
hashed values now only at debug level).
Identity forwarding
- ValidateSessionJWT returns user_id, email, method, groups,
group_names. sessionkey.Claims carries Email + Groups + GroupNames
so the proxy can stamp identity onto upstream requests without an
extra management round-trip on every cookie-bearing request.
- CapturedData carries userEmail / userGroups / userGroupNames; the
proxy stamps X-NetBird-User and X-NetBird-Groups on r.Out from the
authenticated identity (strips client-supplied values first to
prevent spoofing).
- AccessLog.UserGroups: access-log enrichment captures the user's
group memberships at write time so the dashboard can render group
context without reverse-resolving stale memberships.
OpenAPI/dashboard surface
- ReverseProxyService gains private + access_groups; ReverseProxyCluster
gains private + supports_private. ReverseProxyTarget target_type
enum gains "cluster". ServiceTargetOptions gains direct_upstream.
ProxyAccessLog gains user_groups.
The cluster listing now answers three questions in one round-trip
instead of forcing the dashboard to cross-reference the domains API:
which clusters can this account see, are they currently up, and what
do they support. The ProxyCluster wire type drops the boolean
self_hosted in favour of a `type` enum (`account` / `shared`) plus
explicit `online`, `supports_custom_ports`, `require_subdomain`, and
`supports_crowdsec` fields.
Store query reworked so offline clusters still appear (no last_seen
WHERE), with online and connected_proxies both derived from the
existing 2-min active window via portable CASE expressions; the
1-hour heartbeat reaper still removes long-stale rows. Service
manager enriches each cluster with the capability flags via the
existing per-cluster lookups (CapabilityProvider now also exposes
ClusterSupportsCrowdSec).
GetActiveClusterAddresses* keep their tight 2-min filter so service
routing and domain enumeration aren't pulled into the wider window.
The hard cut removes self_hosted from the response — the dashboard is
the only consumer and is updated in the matching PR; no transitional
field is shipped.
Adds a cross-engine regression test asserting offline clusters
surface, connected_proxies counts only fresh proxies, and
account-scoped BYOP clusters never leak across accounts.
* [management] Ensure SessionStartedAt has a default value
Avoid null values for the new column
* [management] Add PeerStatus with LastSeen in peer_test
* [management] Add migration for PeerStatusSessionStartedAt default value
* [management] Add PeerStatus with LastSeen in migration tests
* [management] Add metrics for peer status updates and ephemeral cleanup
The session-fenced MarkPeerConnected / MarkPeerDisconnected path and
the ephemeral peer cleanup loop both run silently today: when fencing
rejects a stale stream, when a cleanup tick deletes peers, or when a
batch delete fails, we have no operational signal beyond log lines.
Add OpenTelemetry counters and a histogram so the same SLO-style
dashboards that already exist for the network-map controller can cover
peer connect/disconnect and ephemeral cleanup too.
All new attributes are bounded enums: operation in {connect,disconnect}
and outcome in {applied,stale,error,peer_not_found}. No account, peer,
or user ID is ever written as a metric label — total cardinality is
fixed at compile time (8 counter series, 2 histogram series, 4 unlabeled
ephemeral series).
Metric methods are nil-receiver safe so test composition that doesn't
wire telemetry (the bulk of the existing tests) works unchanged. The
ephemeral manager exposes a SetMetrics setter rather than taking the
collector through its constructor, keeping the constructor signature
stable across all test call sites.
* [management] Add OpenTelemetry metrics for ephemeral peer cleanup
Introduce counters for tracking ephemeral peer cleanup, including peers pending deletion, cleanup runs, successful deletions, and failed batches. Metrics are nil-receiver safe to ensure compatibility with test setups without telemetry.
* [management] Fence peer status updates with a session token
The connect/disconnect path used a best-effort LastSeen-after-streamStart
comparison to decide whether a status update should land. Under contention
— a re-sync arriving while the previous stream's disconnect was still in
flight, or two management replicas seeing the same peer at once — the
check was a read-then-decide-then-write window: any UPDATE in between
caused the wrong row to be written. The Go-side time.Now() that fed the
comparison also drifted under lock contention, since it was captured
seconds before the write actually committed.
Replace it with an integer-nanosecond fencing token stored alongside the
status. Every gRPC sync stream uses its open time (UnixNano) as its token.
Connects only land when the incoming token is strictly greater than the
stored one; disconnects only land when the incoming token equals the
stored one (i.e. we're the stream that owns the current session). Both
are single optimistic-locked UPDATEs — no read-then-write, no transaction
wrapper.
LastSeen is now written by the database itself (CURRENT_TIMESTAMP). The
caller never supplies it, so the value always reflects the real moment
of the UPDATE rather than the moment the caller queued the work — which
was already off by minutes under heavy lock contention.
Side effects (geo lookup, peer-login-expiration scheduling, network-map
fan-out) are explicitly documented as running after the fence UPDATE
commits, never inside it. Geo also skips the update when realIP equals
the stored ConnectionIP, dropping a redundant SavePeerLocation call on
same-IP reconnects.
Tests cover the three semantic cases (matched disconnect lands, stale
disconnect dropped, stale connect dropped) plus a 16-goroutine race test
that asserts the highest token always wins.
* [management] Add SessionStartedAt to peer status updates
Stored `SessionStartedAt` for fencing token propagation across goroutines and updated database queries/functions to handle the new field. Removed outdated geolocation handling logic and adjusted tests for concurrency safety.
* Rename `peer_status_required_approval` to `peer_status_requires_approval` in SQL store fields
When closing go routines and handling peer disconnect, we should avoid canceling the flow due to parent gRPC context cancellation.
This change triggers disconnection handling with a context that is not bound to the parent gRPC cancellation.
* [client] iOS: structured ResolvedIPs collection for domain routes
Replace comma-joined ResolvedIPs string with a gomobile-friendly
ResolvedIPs collection (Add/Get/Size), mirroring the Android bridge
in client/android/network_domains.go.
This allows the iOS app to match domain-route resolved IPs against
connected peer routes without parsing CSV strings, fixing the route
status indicator for dynamic (DNS) routes.
* [client] iOS: align dynamic route exposure with Android bridge
For dynamic (DNS) routes the Swift side previously received
"invalid Prefix" as the Network value, forcing UI code to special-case
that sentinel. The Android bridge uses Domains.SafeString() instead so
peer.routes entries (which also derive from Domains.SafeString()) match
directly. Mirror that here.
Also fix the resolved IP lookup: resolvedDomains is keyed by the
resolved domain (e.g. api.ipify.org), not the configured pattern
(e.g. *.ipify.org). Group entries by ParentDomain like the daemon does
in client/server/network.go, so wildcard route patterns get their
resolved IPs populated.