* [management] Add metrics for peer status updates and ephemeral cleanup
The session-fenced MarkPeerConnected / MarkPeerDisconnected path and
the ephemeral peer cleanup loop both run silently today: when fencing
rejects a stale stream, when a cleanup tick deletes peers, or when a
batch delete fails, we have no operational signal beyond log lines.
Add OpenTelemetry counters and a histogram so the same SLO-style
dashboards that already exist for the network-map controller can cover
peer connect/disconnect and ephemeral cleanup too.
All new attributes are bounded enums: operation in {connect,disconnect}
and outcome in {applied,stale,error,peer_not_found}. No account, peer,
or user ID is ever written as a metric label — total cardinality is
fixed at compile time (8 counter series, 2 histogram series, 4 unlabeled
ephemeral series).
Metric methods are nil-receiver safe so test composition that doesn't
wire telemetry (the bulk of the existing tests) works unchanged. The
ephemeral manager exposes a SetMetrics setter rather than taking the
collector through its constructor, keeping the constructor signature
stable across all test call sites.
* [management] Add OpenTelemetry metrics for ephemeral peer cleanup
Introduce counters for tracking ephemeral peer cleanup, including peers pending deletion, cleanup runs, successful deletions, and failed batches. Metrics are nil-receiver safe to ensure compatibility with test setups without telemetry.
* [management] Fence peer status updates with a session token
The connect/disconnect path used a best-effort LastSeen-after-streamStart
comparison to decide whether a status update should land. Under contention
— a re-sync arriving while the previous stream's disconnect was still in
flight, or two management replicas seeing the same peer at once — the
check was a read-then-decide-then-write window: any UPDATE in between
caused the wrong row to be written. The Go-side time.Now() that fed the
comparison also drifted under lock contention, since it was captured
seconds before the write actually committed.
Replace it with an integer-nanosecond fencing token stored alongside the
status. Every gRPC sync stream uses its open time (UnixNano) as its token.
Connects only land when the incoming token is strictly greater than the
stored one; disconnects only land when the incoming token equals the
stored one (i.e. we're the stream that owns the current session). Both
are single optimistic-locked UPDATEs — no read-then-write, no transaction
wrapper.
LastSeen is now written by the database itself (CURRENT_TIMESTAMP). The
caller never supplies it, so the value always reflects the real moment
of the UPDATE rather than the moment the caller queued the work — which
was already off by minutes under heavy lock contention.
Side effects (geo lookup, peer-login-expiration scheduling, network-map
fan-out) are explicitly documented as running after the fence UPDATE
commits, never inside it. Geo also skips the update when realIP equals
the stored ConnectionIP, dropping a redundant SavePeerLocation call on
same-IP reconnects.
Tests cover the three semantic cases (matched disconnect lands, stale
disconnect dropped, stale connect dropped) plus a 16-goroutine race test
that asserts the highest token always wins.
* [management] Add SessionStartedAt to peer status updates
Stored `SessionStartedAt` for fencing token propagation across goroutines and updated database queries/functions to handle the new field. Removed outdated geolocation handling logic and adjusted tests for concurrency safety.
* Rename `peer_status_required_approval` to `peer_status_requires_approval` in SQL store fields
- Peers.Get returns Status{Status: DaemonUnavailable} on Unavailable
instead of an error so the React useStatus initial refresh picks up
the same string the live event stream emits — the overlay no longer
depends on receiving the synthetic event during boot.
- ProfileContext.refresh swallows Unavailable so the redundant
"Load Profiles Failed" popup does not overlap the overlay.
- Tray Profiles submenu is disabled while the daemon is unavailable,
matching the existing settings/debug/connect gating.
- gRPC client uses a 5s ConnectParams MaxDelay; the default 120s cap
was keeping the SubChannel in backoff for tens of seconds after the
daemon came back, masking the recovery.
Brings two main-side PRs' UI behavior across the Fyne→Wails rewrite:
- #5631 (IPv6 overlay support): add "Enable IPv6" row to the polished
SettingsNetwork tab; the legacy screens/Settings.tsx already had it,
but modules/settings/SettingsNetwork.tsx (the user-visible Settings
window) was missing the toggle.
- #6150 (mirror v4 exit selection onto v6 pair): replace the literal
"0.0.0.0/0" || "::/0" filter in screens/Networks.tsx with an
isDefaultRoute() helper that handles the daemon's merged-range
display string (e.g. "0.0.0.0/0, ::/0"), so paired v4/v6 exit
nodes are classified correctly.
When closing go routines and handling peer disconnect, we should avoid canceling the flow due to parent gRPC context cancellation.
This change triggers disconnection handling with a context that is not bound to the parent gRPC cancellation.
The bindings under client/ui/frontend/bindings are gitignored (1ebb507),
so the release UI job has to regenerate them before pnpm build — the
@wailsio/runtime Vite plugin refuses to build without them. Add a
wails3 CLI install step (version derived from go.mod via go list -m,
so it stays in sync with the runtime the binary links against), plus a
goreleaser before-hook that runs `wails3 generate bindings -clean=true
-ts` ahead of the existing pnpm install + pnpm build pair.
Bump github.com/wailsapp/wails/v3 to v3.0.0-alpha.91 in the process.
The @wailsio/runtime npm package stays at "latest" since the registry
only goes up to alpha.79 — the binding generator and the runtime stay
compatible across that gap until the binding shape changes.
The auto-update feature was driven by two narrow Wails events
(netbird:update:available and :progress) plus a SystemEvent-metadata
iteration on the React side. Both surfaces had to know the daemon
metadata schema (new_version_available, enforced, progress_window),
and the frontend had no pull endpoint to seed its state on mount.
Extract the state machine into a new client/ui/updater package, mirroring
how i18n and preferences are split between domain logic and a thin
services facade. The package owns the State type, the metadata-key
parsing, the mutex-guarded Holder, and the single netbird:update:state
event. services.Update keeps the daemon RPCs (Trigger, GetInstallerResult,
Quit) and gains GetState as a Wails pull endpoint.
Tray-side update behaviour moves out of tray.go into a dedicated
trayUpdater (tray_update.go): owns its menu item, OS notification,
click handler, and the /update window opener triggered by the
daemon's progress_window:show. tray.go drops three callbacks and four
fields, and reads hasUpdate through the updater.
Frontend ClientVersionContext now seeds from Update.GetState() and
subscribes to netbird:update:state; the status.events iteration and
metadata-key string literals are gone. UpdateAvailableBanner renders
only for the enforced && !installing branch and labels its action
"Install now"; UpdateVersionCard splits the install vs. download
branches by Enforced so the disabled flow routes to GitHub.
Adds a tray + React translation pipeline driven by a single JSON locale
tree (frontend/src/i18n/locales) embedded into the Go binary. The tray
re-renders on language switch via a Localizer that subscribes to the
preferences store.
Layout:
- client/ui/i18n: Bundle, LanguageCode, Language, errors, embedded-FS
loader. Pure domain, no Wails/daemon deps.
- client/ui/preferences: Store + UIPreferences for user-scope UI state,
persisted under os.UserConfigDir()/netbird/ui-preferences.json with
atomic writes and a subscribe/broadcast channel.
- client/ui/services: thin Wails-binding facades (services.I18n,
services.Preferences) so React sees ctx-first signatures.
- client/ui/localizer.go: tray bridge that owns the active language,
exposes T()/StatusLabel() and re-paints the menu on prefs change.
- tray.go: every user-facing const replaced by translation keys via
t.loc.T(...); menu rebuild + state replay on language switch.
- main.go: //go:embed all:frontend/src/i18n/locales, wires Bundle ->
Store -> Localizer -> Wails facades in order.
Frontend API exposed via Wails bindings: I18n.Languages, I18n.Bundle,
Preferences.Get, Preferences.SetLanguage, plus the
netbird:preferences:changed event.
Includes regenerated Wails TS bindings (peers/profileswitcher/etc.
re-emitted as part of the build) and en/hu seed bundles.
The optimistic Connecting paint and the Idle/stale-Connected
suppression lived in the tray's applyStatus, so only the tray got the
smoothed-out transition during a profile switch — the React Status
page (useStatus hook in frontend) subscribes to the same
netbird:status event and was seeing the raw daemon stream, complete
with the Disconnected blink.
Move the policy one layer up into the Peers service, between
SubscribeStatus and the Wails event bus, so every consumer downstream
sees the same filtered stream:
* Peers gains BeginProfileSwitch / CancelProfileSwitch / shouldSuppress.
BeginProfileSwitch sets the in-progress flag and emits a synthetic
Connecting status so both the tray and React paint Connecting
immediately. shouldSuppress swallows the daemon's stale Connected
(peer-count teardown) and transient Idle (Down between flows)
until Connecting / NeedsLogin / LoginFailed / SessionExpired /
DaemonUnavailable indicates the new profile's flow has started,
or a 30s safety timeout fires.
* ProfileSwitcher.SwitchActive calls peers.BeginProfileSwitch when
wasActive (prevStatus was Connected or Connecting) — the only
cases where the daemon emits the blink-inducing sequence. Other
prevStatuses already terminate cleanly on Idle.
* Tray loses its switchInProgress fields, applyOptimisticConnecting
helper, applyStatus suppression switch, and switchProfile's
optimistic-paint call. handleDisconnect now calls
Peers.CancelProfileSwitch alongside cancelling switchCancel, so
the abort path bypasses the suppression filter and the daemon's
Idle paints through immediately.
The full prevStatus -> action / optimistic label / suppressed events
matrix now lives in the ProfileSwitcher struct godoc, with the
suppression-rule-per-incoming-status table on the Peers struct
godoc — together they describe the click-time policy and the
stream-filter behaviour without duplication.
Wails bindings need regenerating to pick up Peers.BeginProfileSwitch
and Peers.CancelProfileSwitch.
Three UX fixes for the tray's profile-switch flow:
* Optimistic Connecting paint when switching from Connected/Connecting.
Previously the daemon's Down step emitted Idle before the new
profile's Up emitted Connecting, leaving the tray flashing
"Disconnected" for the duration of the Down. switchProfile now sets a
flag and synthesizes a Connecting paint at click time; applyStatus
suppresses the transient Idle and the stale Connected updates that
arrive during the old profile's teardown, clearing the flag only when
the new profile's flow surfaces (Connecting, NeedsLogin, LoginFailed,
SessionExpired, DaemonUnavailable, or a 30s safety timeout).
* Disconnect during an in-flight switch now actually disconnects. The
switchCancel context cancels the ProfileSwitcher.SwitchActive
goroutine so its queued Up RPC never fires, and the
switchInProgress flag is cleared so the daemon's Idle push paints
through immediately. Without this, the user's Disconnect click was
followed seconds later by the switcher's Up bringing the new
profile back online.
* The first menu row is informational only. SetEnabled(false) is
re-applied to t.statusItem (initial build, applyStatus, and the
optimistic paint) and the openRoute("/login") OnClick handler is
dropped — every actionable transition flows through the
Connect/Disconnect entries below.
The switchProfile and applyStatus godocs carry the full
prevStatus → suppressed-events / final-state transition tables so
future readers don't have to rebuild the policy from the code.
The trailing close(giveUpChan) at the bottom of the function only ran on
the backoff.Retry path. The DisableAutoConnect path returned early via
the if-block, skipping the close entirely. That branch is hit whenever
the active profile has auto-connect disabled — so every Down for those
profiles waited the full 5s timeout in the Down RPC select (and twice
when two Downs queued up, since both snapped the same never-closing
chan).
Move close(giveUpChan) into the existing defer so it fires on every
exit path: DisableAutoConnect return, backoff.Retry return, or panic.
The close happens after clientRunning=false is committed under the
mutex, so a Down/Up that wakes on the chan-close doesn't observe a
half-state where the chan is closed but clientRunning is still true.
Updates the Down RPC comment to point at the deferred close as the
signal source, and reframes the 5s timeout warning as "the goroutine
is wedged in a slow teardown step" rather than the expected case.
The Fyne UI used to write the active profile to both fronts on every
switch (profile.go:264-273): the daemon SwitchProfile RPC for
/var/lib/netbird/active_profile.json, then profileManager.SwitchProfile
for the user-side ~/Library/Application Support/netbird/active_profile.
The Wails ProfileSwitcher only kept the first.
Without the user-side mirror, a UI tray switch updates the daemon's
state but the CLI ProfileManager.GetActiveProfile() still returns the
stale "default". The next "netbird up" then sends ProfileName="default"
in the Login/Up request, and the daemon silently switches back to
default, reverting whatever the user just picked in the tray.
Mirror the daemon switch with profilemanager.NewProfileManager().
SwitchProfile after the daemon RPC succeeds. The daemon stays the
authority — a user-side write failure is logged as a warning, not a
hard error.
Three tray fixes after the SubscribeStatus stream refactor:
* loadProfiles now serializes via a dedicated profileLoadMu and runs
AFTER the SetHidden/SetEnabled writes inside applyStatus's iconChanged
block. Previously the status-driven refresh fired before the menu-item
writes finished, so processMenu's clearMenu/re-add NSMenu rebuild
raced against SetHidden on darwin — the Disconnect entry could end
up visible-but-disabled even when applyStatus had requested it hidden.
* The status row is no longer a hidden "Login" entry. It now renders
as a plain enabled label (so the text isn't greyed out) but has no
OnClick handler — clicks are no-ops, matching the legacy Fyne UI.
All actionable transitions flow through Connect/Disconnect.
* handleConnect routes NeedsLogin/SessionExpired/LoginFailed to the
frontend's /login route (which already runs Login → WaitSSOLogin →
Up) instead of calling Up directly. The plain Up RPC errors with
"up already in progress: current status NeedsLogin" in those
states; the legacy Fyne UI drove the SSO dance from the Connect
button as well.
Two related daemon-side status-stream fixes that together keep the UI's
status in sync with the daemon's contextState:
* state.Set previously only mutated the in-memory enum — transitions
that weren't accompanied by a Mark{Management,Signal,...} call (e.g.
StatusNeedsLogin after a PermissionDenied login, StatusLoginFailed
after OAuth init failure, StatusIdle in the Login defer) left the
UI stuck on the previous snapshot until an unrelated peer event
happened to fire notifyStateChange. Add a callback on contextState
fired from Set (outside the mutex, to avoid lock-order issues with
the recorder's stateChangeMux), and wire it in Server.Start to the
recorder's new public NotifyStateChange. Every state.Set callsite
now pushes automatically; new ones don't need to opt in.
* WaitSSOLogin's WaitToken error branch lumped every failure into
StatusLoginFailed, including context.Canceled aborts from a parallel
profile switch (actCancel/waitCancel). That spurious LoginFailed
then wedged the new profile's Up RPC with "up already in progress:
current status LoginFailed". Split the branch by error type:
context.Canceled lets the top-level defer pick StatusIdle,
context.DeadlineExceeded sets StatusNeedsLogin (retryable; OAuth
device-code window just expired), other errors keep LoginFailed
(real auth/IO failures). Document the full state-transition table
in the function godoc.