Turn on sync response persistence before starting the engine so
DebugBundle can include the network map. On iOS the store is disk-backed
(see syncstore) to keep the map out of the constrained process memory.
* Persist sync response via pluggable store (disk on iOS)
The latest Management sync response (which carries the network map) was
kept in memory for debug bundle generation. On memory-constrained
platforms like iOS the network map can be large enough to matter.
Introduce a syncstore package with a Store interface and two backends:
a memory backend (the previous behavior) and a disk backend that
serializes the response to a file in the state directory. The backend
is selected per-platform at build time: disk on iOS, memory elsewhere.
The disk store clears any leftover file on construction so a fresh
store never reads stale data from an earlier run (e.g. another
profile's network map).
In the engine, drop the separate persistSyncResponse bool: the store is
only instantiated while persistence is enabled, and its presence is
what marks persistence as active. The store is also cleared on engine
close so the file does not linger on disk.
* syncstore: silence nilnil linter on "nothing stored" returns
Get returns (nil, nil) to signal that nothing is stored, which is part
of the Store contract and preserves the original behaviour. Annotate
both backends with //nolint:nilnil so golangci-lint does not flag it.
* syncstore: hold syncRespMux for the whole store Set/Get
Both handleSync and GetLatestSyncResponse snapshotted e.syncStore under
the read lock and then released it before calling Set/Get. That allowed
SetSyncResponsePersistence(false) or engine close to clear the store
mid-call. In particular a concurrent Clear()+nil followed by a late
Set could re-create the file that was just removed, defeating the
leak/lingering protection.
Hold syncRespMux for the duration of the store operation in both spots
so the store cannot be cleared while a Set/Get is in flight.
* syncstore: avoid StateDir "." when state path is empty
On mobile the state path may be empty (the engine tolerates a missing
state file). filepath.Dir("") returns ".", which would make a
disk-backed syncstore write into the working directory instead of
letting NewDiskStore fall back to os.TempDir().
Only set engineConfig.StateDir when path is non-empty.
* Refactor to use a common checker for development version
* Adds commit sha to development version for cobra command only
Leave dashboard unaffected
* Adjust for "v0.31.1-dev" test case
which must be considered pre-release
* Drop synthetic "dev"/"0.50.0-dev" firewall feature-gate fixtures
These test cases encoded the loose strings.Contains(v, "dev")
semantics inherited from peerSupportedFirewallFeatures, but
NetbirdVersion() never produces those values — only the literal
"development" (and now "development-<sha>[-dirty]") ever flows
through the wire. The agent owns the semantics of an ephemeral
development build, so the tests should exercise the strings we
actually emit.
Replaced with development, development-<sha> and
development-<sha>-dirty cases that match the HasPrefix("development")
predicate introduced upstream.
* Remove unexistent tests on wire format
The sha / dirty flag are added only when the CLI asks the version.
Account versions is unaffacted and can only strictly match "development"
* Adds tests for IsDevelopmentVersion
* fix(proxy): gate tunnel-peer fast-path on inbound listener marker
forwardWithTunnelPeer previously accepted any RFC1918 / ULA / CGNAT
source IP, so a public client whose address happened to fall in those
ranges could bypass the configured operator auth scheme by colliding
with a known tunnel IP. The fast-path is now gated on
TunnelLookupFromContext(r.Context()) being present — that context value
is attached only by the per-account inbound (overlay) listener, so the
host-facing listener never enters this branch.
Tests updated to reflect the new requirement: requests that don't
carry the inbound marker now fall through to the regular auth flow.
* fix(proxy): harden inbound listener resource + startup-ctx handling
Three correctness fixes on the per-account inbound path, with tests:
- Close the logrus ErrorLog PipeWriter on tearDown. WriterLevel hands
back an *io.PipeWriter backed by a pipe + scanner goroutine that the
caller owns; the two writers per account (https + plain) were never
closed, leaking the pipe and goroutine on every teardown.
- Run the post-Start hooks on context.Background(). runClientStartup
is launched in a goroutine from AddPeer and was inheriting the
caller's request-scoped ctx, so a cancelled request could abort the
inbound bring-up or fail the management status notification. The
tail is split into notifyClientReady so the contract is testable.
Tests cover the PipeWriter close behaviour and assert the readyHandler
+ NotifyStatus calls receive a non-cancelled background context.
* feat(proxy): short-circuit peer-own-target loops with 421
When a peer that hosts the target of a private service dials its own
service URL the request was being looped through the proxy and back
over WireGuard to the same peer — twice the WG round-trip for no
benefit, with no signal to the caller that something was wrong.
Add isSelfTargetLoop to ReverseProxy.ServeHTTP: when the request
arrived on the per-account overlay listener (IsOverlayOrigin) and the
source tunnel IP matches the target host, refuse the request with 421
Misdirected Request and a body pointing the operator at the backend
directly.
The gate is scoped to overlay origin so requests on the public
listener that happen to share a source IP with the target host are
forwarded normally.
* fix(management): private-service validation + tunnel-IP lookup semantics
- Require an explicit port for L4 cluster targets. validateL4Target
exempted TargetTypeCluster from the port check, but buildPathMappings
serializes every L4 target via net.JoinHostPort(host, port) — port=0
shipped a ":0" upstream. Cluster targets use the same Host/Port
fields, so the same requirement applies.
- GetPeerByIP returns NotFound on a tunnel-IP miss instead of mapping
every error to Internal. The proxy's ValidateTunnelPeer probes IPs
that legitimately aren't in the roster; the miss is expected and now
distinguishable from a real store failure.
- Thread ctx into getClusterCapability's gorm query so a cancelled
request doesn't keep the store busy.
Tests updated for the L4-cluster port requirement and the GetPeerByIP
NotFound path.
* fix(client): include offlinePeers in PeerStateByIP lookup
ReplaceOfflinePeers moves peers into d.offlinePeers but PeerStateByIP
only scanned d.peers. Callers (the local DNS filter via
localPeerConnectivity, embed.Client.IdentityForIP used by the
proxy's tunnel-peer validator) were treating known-but-offline peers
as unknown, which:
- causes the DNS filter to keep returning records pointing at peers
that have no live tunnel, AND
- makes the proxy's local-roster check deny a request from such a
peer rather than letting the cached management RPC carry the
authorisation decision.
Search both slices in PeerStateByIP. Adds a unit test for the IPv4
and IPv6 offline-match paths.
* fix(rest): reject empty Delete path params in reverse-proxy clients
ReverseProxyClustersAPI.Delete and ReverseProxyTokensAPI.Delete passed
the path parameter into url.PathEscape without an empty check.
PathEscape("") returns "" which collapses the request onto the
collection endpoint ("/api/reverse-proxies/clusters/" /
"/api/reverse-proxies/proxy-tokens/"), so a caller bug delete with no
id reached a routable URL with surprising semantics (typically 405).
Short-circuit with a typed error before the request is built. Tests
mount a handler on the collection path that fails the test if hit, so
the regression is impossible to reintroduce silently.
* chore(api,ci,docs,test): private-service schema, proto-check, fixups
Non-functional cleanups and contract/CI hardening around the
private-service work:
API schema (openapi.yml):
- Require a non-empty access_groups and mode=http when private=true,
on both Service and ServiceRequest, mirroring
validatePrivateRequirements. mode stays optional-but-constrained
(empty defaults to http server-side), matching runtime.
CI (proto-version-check.yml):
- Cover renamed .pb.go files (read base via previous_filename).
- Match protoc-gen-go-grpc version headers (optional "- " prefix and
-gen-go-grpc suffix) so grpc-generated files are in scope.
Docs / comments:
- Reword Config field docs to say defaults are applied at Server.Start
(initDefaults), not New.
- Rename the obsolete --private-inbound flag to --private across
comments and the proto doc.
Pre-existing test fixups surfaced by review:
- Repair the integration-tagged validate_session_test.go (SignToken
signature growth + new Manager interface methods).
- Fix the CI-skip boolean precedence so Windows isn't skipped
unconditionally.
- Guard the router.HTTPListener type assertion with comma-ok.
* fix(proxy): background ctx for already-started AddPeer notification
The earlier ctx fix covered the async runClientStartup path but missed
the synchronous branch: when a service is added to an already-started
client, AddPeer called NotifyStatus with the caller's request-scoped
ctx. A cancelled request/stream could drop the connected notification
to management. Use context.Background() here too, matching
notifyClientReady.
Extends TestNetBird_AddPeer_ExistingStartedClient_NotifiesStatus to
pass a pre-cancelled caller ctx and assert the notification still ran
on a non-cancelled context.
* use the cmd context for roundtripper
* Pin actions with SHA, replace unmaintained, add dependabot for actions
* Update FreeBSD to version 15 for tests
* Use shared actions
* Update sign-pipelines version
* add SSO session extend flow (management)
Adds the management-server half of the SSO session-extension feature:
- New ExtendAuthSession gRPC RPC that refreshes a peer's session expiry
using a fresh JWT, validated through the same pipeline as Login but
without tearing down the tunnel or redoing the NetworkMap sync.
- Per-peer SessionExpiresAt timestamp on every LoginResponse and
SyncResponse so connected clients learn the deadline on the existing
long-lived stream, and admin-side changes (toggling expiration,
changing the expiration window) reach every peer within seconds.
- SessionExpiresAt(...) helper on Peer that derives the absolute UTC
deadline from LastLogin + the account-level PeerLoginExpiration
setting, returning zero when the peer is not SSO-tracked or expiration
is disabled.
The matching client-side consumer of these fields lands separately.
* encode SessionExpiresAt as 3-state on the wire
Previously the `sessionExpiresAt` field on LoginResponse, SyncResponse
and ExtendAuthSessionResponse was 2-state: a valid timestamp meant
"new deadline", and nil meant "clear". That conflated two distinct
meanings — "no info in this snapshot" vs "expiry is explicitly off /
peer is not SSO-tracked" — so a Sync push that legitimately couldn't
compute the deadline (settings lookup failed) would silently clear the
client's anchor and lose the warning window.
Three states now, encoded on the same field number (no .proto schema
churn — only comments and the server-side encoder change):
- nil pointer (field absent) → "no info"; client preserves anchor
- &Timestamp{} (seconds=0, nanos=0) → explicit "disabled / not SSO"
sentinel; client clears
- valid timestamp → new absolute UTC deadline
A new encodeSessionExpiresAt helper centralises the zero/non-zero
encoding and is shared by the Sync, Login and ExtendAuthSession
builders. The Sync builder still emits nil when settings are missing.
Login and ExtendAuthSession always carry an authoritative value.
The matching client-side decoder lands on feature/session-extend.
* add UserExtendedPeerSession activity event
ExtendAuthSession previously reused UserLoggedInPeer for its audit
record, which conflated two distinct user actions: a full interactive
SSO login (tunnel re-established, network map resync) versus an
in-place deadline refresh (tunnel untouched). Auditors reading the log
couldn't tell which one happened, and downstream dashboards/alerts on
"login" volume were polluted by routine extends.
Adds a dedicated UserExtendedPeerSession Activity (code 125,
"user.peer.session.extend") and switches ExtendPeerSession over to it.
The peer-extend audit trail is now distinguishable from interactive
logins.
* make ExtendAuthSession JWT-retry backoff cancellable
Skip the retry log and 200ms wait on the final attempt, and replace the
uncancellable time.Sleep with a select on time.After/ctx.Done so an
upstream cancellation aborts the wait instead of running it to
completion.
* Updates rosenpass version
go-rosenpass v0.4.0 → v0.5.42 bump — detailed findings
Change summary
cunicu.li/go-rosenpass v0.4.0 → v0.5.42 (target)
cilium/ebpf v0.15.0 → v0.19.0 (transitive)
gopacket/gopacket v1.1.1 → v1.4.0 (transitive)
wireguard 2023-07 → 2023-12 (transitive)
wireguard/wgctrl 2023-04 → 2024-12 (transitive)
Wire interop
v0.4.0 (in v0.70.5) <-> v0.5.42 OK
v0.5.42 <-> v0.5.42 OK
Quantum resistance: true both ends
---
**Replay error eliminated.**
Before (on v0.4.0):
`ERROR Failed to handle message: failed to load biscuit (ICR1): detected replay`
Recurring every ~50ms for minutes at a time. Gone entirely after both ends upgraded to v0.5.42. Upstream fix in biscuit/replay handling between v0.4.x and v0.5.x series.
* Fixup [::]:port socket trying to send to v4
* Adds more tests on netbird<->rosenpass interactions
* Anticipates rp handler creation before generateConfig
* [client] Moves deterministic key gen into rosenpass
* go mod tidy
* Adds reminder to reason about rosenpass surface area
* Apply code rabbit suggestions
addStateFile() resolved the state path via ServiceManager.GetStatePath(),
which on iOS points at a hard-coded default that does not exist in the app
sandbox, so the state file was silently skipped.
Add an optional StatePath to GeneratorDependencies and use it when set,
falling back to the ServiceManager default otherwise. The iOS DebugBundle
passes the client's actual state file path (the App Group profile state),
matching the Android bundle which includes the state file.
Thread cacheDir through NewClient -> RunOniOS -> MobileDependency.TempDir
so the iOS client can pass its sandbox-writable cache directory for
debug bundle zip file creation instead of os.TempDir().
Move log collection into platform-dispatched addPlatformLog():
- iOS: adds the file-based Go client log (with rotation, stderr/stdout
companions and anonymization handled by addLogfile) plus the Swift app
log (swift-log.log) written by the iOS app into the same log directory
- Other non-Android platforms: existing file-based log + systemd fallback
Narrow the debug_nonandroid.go build tag to !android && !ios so iOS no
longer attempts the systemd journal fallback.
Add a DebugBundle() entry point to the iOS Go client that generates a
bundle, uploads it and returns the upload key. It works with or without
a running engine: when the engine is up it reuses the live config, sync
response and client metrics; otherwise it loads the config from disk (or
the preloaded tvOS config). Guard the live config/ConnectClient behind a
state mutex since DebugBundle may run on a different thread.
Adds a new "private" service mode for the reverse proxy: services reachable exclusively over the embedded WireGuard tunnel, gated by per-peer group membership instead of operator auth schemes.
Wire contract
- ProxyMapping.private (field 13): the proxy MUST call ValidateTunnelPeer and fail closed; operator schemes are bypassed.
- ProxyCapabilities.private (4) + supports_private_service (5): capability gate. Management never streams private mappings to proxies that don't claim the capability; the broadcast path applies the same filter via filterMappingsForProxy.
- ValidateTunnelPeer RPC: resolves an inbound tunnel IP to a peer, checks the peer's groups against service.AccessGroups, and mints a session JWT on success. checkPeerGroupAccess fails closed when a private service has empty AccessGroups.
- ValidateSession/ValidateTunnelPeer responses now carry peer_group_ids + peer_group_names so the proxy can authorise policy-aware middlewares without an extra management round-trip.
- ProxyInboundListener + SendStatusUpdate.inbound_listener: per-account inbound listener state surfaced to dashboards.
- PathTargetOptions.direct_upstream (11): bypass the embedded NetBird client and dial the target via the proxy host's network stack for upstreams reachable without WireGuard.
Data model
- Service.Private (bool) + Service.AccessGroups ([]string, JSON- serialised). Validate() rejects bearer auth on private services. Copy() deep-copies AccessGroups. pgx getServices loads the columns.
- DomainConfig.Private threaded into the proxy auth middleware. Request handler routes private services through forwardWithTunnelPeer and returns 403 on validation failure.
- Account-level SynthesizePrivateServiceZones (synthetic DNS) and injectPrivateServicePolicies (synthetic ACL) gate on len(svc.AccessGroups) > 0.
Proxy
- /netbird proxy --private (embedded mode) flag; Config.Private in proxy/lifecycle.go.
- Per-account inbound listener (proxy/inbound.go) binding HTTP/HTTPS on the embedded NetBird client's WireGuard tunnel netstack.
- proxy/internal/auth/tunnel_cache: ValidateTunnelPeer response cache with single-flight de-duplication and per-account eviction.
- Local peerstore short-circuit: when the inbound IP isn't in the account roster, deny fast without an RPC.
- proxy/server.go reports SupportsPrivateService=true and redacts the full ProxyMapping JSON from info logs (auth_token + header-auth hashed values now only at debug level).
Identity forwarding
- ValidateSessionJWT returns user_id, email, method, groups, group_names. sessionkey.Claims carries Email + Groups + GroupNames so the proxy can stamp identity onto upstream requests without an extra management round-trip on every cookie-bearing request.
- CapturedData carries userEmail / userGroups / userGroupNames; the proxy stamps X-NetBird-User and X-NetBird-Groups on r.Out from the authenticated identity (strips client-supplied values first to prevent spoofing).
- AccessLog.UserGroups: access-log enrichment captures the user's group memberships at write time so the dashboard can render group context without reverse-resolving stale memberships.
OpenAPI/dashboard surface
- ReverseProxyService gains private + access_groups; ReverseProxyCluster gains private + supports_private. ReverseProxyTarget target_type enum gains "cluster". ServiceTargetOptions gains direct_upstream. ProxyAccessLog gains user_groups.
The cluster listing now answers three questions in one round-trip
instead of forcing the dashboard to cross-reference the domains API:
which clusters can this account see, are they currently up, and what
do they support. The ProxyCluster wire type drops the boolean
self_hosted in favour of a `type` enum (`account` / `shared`) plus
explicit `online`, `supports_custom_ports`, `require_subdomain`, and
`supports_crowdsec` fields.
Store query reworked so offline clusters still appear (no last_seen
WHERE), with online and connected_proxies both derived from the
existing 2-min active window via portable CASE expressions; the
1-hour heartbeat reaper still removes long-stale rows. Service
manager enriches each cluster with the capability flags via the
existing per-cluster lookups (CapabilityProvider now also exposes
ClusterSupportsCrowdSec).
GetActiveClusterAddresses* keep their tight 2-min filter so service
routing and domain enumeration aren't pulled into the wider window.
The hard cut removes self_hosted from the response — the dashboard is
the only consumer and is updated in the matching PR; no transitional
field is shipped.
Adds a cross-engine regression test asserting offline clusters
surface, connected_proxies counts only fresh proxies, and
account-scoped BYOP clusters never leak across accounts.
* [management] Ensure SessionStartedAt has a default value
Avoid null values for the new column
* [management] Add PeerStatus with LastSeen in peer_test
* [management] Add migration for PeerStatusSessionStartedAt default value
* [management] Add PeerStatus with LastSeen in migration tests
* [management] Add metrics for peer status updates and ephemeral cleanup
The session-fenced MarkPeerConnected / MarkPeerDisconnected path and
the ephemeral peer cleanup loop both run silently today: when fencing
rejects a stale stream, when a cleanup tick deletes peers, or when a
batch delete fails, we have no operational signal beyond log lines.
Add OpenTelemetry counters and a histogram so the same SLO-style
dashboards that already exist for the network-map controller can cover
peer connect/disconnect and ephemeral cleanup too.
All new attributes are bounded enums: operation in {connect,disconnect}
and outcome in {applied,stale,error,peer_not_found}. No account, peer,
or user ID is ever written as a metric label — total cardinality is
fixed at compile time (8 counter series, 2 histogram series, 4 unlabeled
ephemeral series).
Metric methods are nil-receiver safe so test composition that doesn't
wire telemetry (the bulk of the existing tests) works unchanged. The
ephemeral manager exposes a SetMetrics setter rather than taking the
collector through its constructor, keeping the constructor signature
stable across all test call sites.
* [management] Add OpenTelemetry metrics for ephemeral peer cleanup
Introduce counters for tracking ephemeral peer cleanup, including peers pending deletion, cleanup runs, successful deletions, and failed batches. Metrics are nil-receiver safe to ensure compatibility with test setups without telemetry.
* [management] Fence peer status updates with a session token
The connect/disconnect path used a best-effort LastSeen-after-streamStart
comparison to decide whether a status update should land. Under contention
— a re-sync arriving while the previous stream's disconnect was still in
flight, or two management replicas seeing the same peer at once — the
check was a read-then-decide-then-write window: any UPDATE in between
caused the wrong row to be written. The Go-side time.Now() that fed the
comparison also drifted under lock contention, since it was captured
seconds before the write actually committed.
Replace it with an integer-nanosecond fencing token stored alongside the
status. Every gRPC sync stream uses its open time (UnixNano) as its token.
Connects only land when the incoming token is strictly greater than the
stored one; disconnects only land when the incoming token equals the
stored one (i.e. we're the stream that owns the current session). Both
are single optimistic-locked UPDATEs — no read-then-write, no transaction
wrapper.
LastSeen is now written by the database itself (CURRENT_TIMESTAMP). The
caller never supplies it, so the value always reflects the real moment
of the UPDATE rather than the moment the caller queued the work — which
was already off by minutes under heavy lock contention.
Side effects (geo lookup, peer-login-expiration scheduling, network-map
fan-out) are explicitly documented as running after the fence UPDATE
commits, never inside it. Geo also skips the update when realIP equals
the stored ConnectionIP, dropping a redundant SavePeerLocation call on
same-IP reconnects.
Tests cover the three semantic cases (matched disconnect lands, stale
disconnect dropped, stale connect dropped) plus a 16-goroutine race test
that asserts the highest token always wins.
* [management] Add SessionStartedAt to peer status updates
Stored `SessionStartedAt` for fencing token propagation across goroutines and updated database queries/functions to handle the new field. Removed outdated geolocation handling logic and adjusted tests for concurrency safety.
* Rename `peer_status_required_approval` to `peer_status_requires_approval` in SQL store fields
When closing go routines and handling peer disconnect, we should avoid canceling the flow due to parent gRPC context cancellation.
This change triggers disconnection handling with a context that is not bound to the parent gRPC cancellation.