netbird

mirror of https://github.com/netbirdio/netbird.git synced 2026-06-08 17:09:57 +00:00

Author	SHA1	Message	Date
Zoltán Papp	ca6f6d88cb	Merge branch 'main' into ui-refactor	2026-06-04 17:06:11 +02:00
Maycon Santos	fa1e241aea	[management, client, proxy] Follow-up fixes for private reverse-proxy services (#6268 ) * fix(proxy): gate tunnel-peer fast-path on inbound listener marker forwardWithTunnelPeer previously accepted any RFC1918 / ULA / CGNAT source IP, so a public client whose address happened to fall in those ranges could bypass the configured operator auth scheme by colliding with a known tunnel IP. The fast-path is now gated on TunnelLookupFromContext(r.Context()) being present — that context value is attached only by the per-account inbound (overlay) listener, so the host-facing listener never enters this branch. Tests updated to reflect the new requirement: requests that don't carry the inbound marker now fall through to the regular auth flow. * fix(proxy): harden inbound listener resource + startup-ctx handling Three correctness fixes on the per-account inbound path, with tests: - Close the logrus ErrorLog PipeWriter on tearDown. WriterLevel hands back an io.PipeWriter backed by a pipe + scanner goroutine that the caller owns; the two writers per account (https + plain) were never closed, leaking the pipe and goroutine on every teardown. - Run the post-Start hooks on context.Background(). runClientStartup is launched in a goroutine from AddPeer and was inheriting the caller's request-scoped ctx, so a cancelled request could abort the inbound bring-up or fail the management status notification. The tail is split into notifyClientReady so the contract is testable. Tests cover the PipeWriter close behaviour and assert the readyHandler + NotifyStatus calls receive a non-cancelled background context. feat(proxy): short-circuit peer-own-target loops with 421 When a peer that hosts the target of a private service dials its own service URL the request was being looped through the proxy and back over WireGuard to the same peer — twice the WG round-trip for no benefit, with no signal to the caller that something was wrong. Add isSelfTargetLoop to ReverseProxy.ServeHTTP: when the request arrived on the per-account overlay listener (IsOverlayOrigin) and the source tunnel IP matches the target host, refuse the request with 421 Misdirected Request and a body pointing the operator at the backend directly. The gate is scoped to overlay origin so requests on the public listener that happen to share a source IP with the target host are forwarded normally. * fix(management): private-service validation + tunnel-IP lookup semantics - Require an explicit port for L4 cluster targets. validateL4Target exempted TargetTypeCluster from the port check, but buildPathMappings serializes every L4 target via net.JoinHostPort(host, port) — port=0 shipped a ":0" upstream. Cluster targets use the same Host/Port fields, so the same requirement applies. - GetPeerByIP returns NotFound on a tunnel-IP miss instead of mapping every error to Internal. The proxy's ValidateTunnelPeer probes IPs that legitimately aren't in the roster; the miss is expected and now distinguishable from a real store failure. - Thread ctx into getClusterCapability's gorm query so a cancelled request doesn't keep the store busy. Tests updated for the L4-cluster port requirement and the GetPeerByIP NotFound path. * fix(client): include offlinePeers in PeerStateByIP lookup ReplaceOfflinePeers moves peers into d.offlinePeers but PeerStateByIP only scanned d.peers. Callers (the local DNS filter via localPeerConnectivity, embed.Client.IdentityForIP used by the proxy's tunnel-peer validator) were treating known-but-offline peers as unknown, which: - causes the DNS filter to keep returning records pointing at peers that have no live tunnel, AND - makes the proxy's local-roster check deny a request from such a peer rather than letting the cached management RPC carry the authorisation decision. Search both slices in PeerStateByIP. Adds a unit test for the IPv4 and IPv6 offline-match paths. * fix(rest): reject empty Delete path params in reverse-proxy clients ReverseProxyClustersAPI.Delete and ReverseProxyTokensAPI.Delete passed the path parameter into url.PathEscape without an empty check. PathEscape("") returns "" which collapses the request onto the collection endpoint ("/api/reverse-proxies/clusters/" / "/api/reverse-proxies/proxy-tokens/"), so a caller bug delete with no id reached a routable URL with surprising semantics (typically 405). Short-circuit with a typed error before the request is built. Tests mount a handler on the collection path that fails the test if hit, so the regression is impossible to reintroduce silently. * chore(api,ci,docs,test): private-service schema, proto-check, fixups Non-functional cleanups and contract/CI hardening around the private-service work: API schema (openapi.yml): - Require a non-empty access_groups and mode=http when private=true, on both Service and ServiceRequest, mirroring validatePrivateRequirements. mode stays optional-but-constrained (empty defaults to http server-side), matching runtime. CI (proto-version-check.yml): - Cover renamed .pb.go files (read base via previous_filename). - Match protoc-gen-go-grpc version headers (optional "- " prefix and -gen-go-grpc suffix) so grpc-generated files are in scope. Docs / comments: - Reword Config field docs to say defaults are applied at Server.Start (initDefaults), not New. - Rename the obsolete --private-inbound flag to --private across comments and the proto doc. Pre-existing test fixups surfaced by review: - Repair the integration-tagged validate_session_test.go (SignToken signature growth + new Manager interface methods). - Fix the CI-skip boolean precedence so Windows isn't skipped unconditionally. - Guard the router.HTTPListener type assertion with comma-ok. * fix(proxy): background ctx for already-started AddPeer notification The earlier ctx fix covered the async runClientStartup path but missed the synchronous branch: when a service is added to an already-started client, AddPeer called NotifyStatus with the caller's request-scoped ctx. A cancelled request/stream could drop the connected notification to management. Use context.Background() here too, matching notifyClientReady. Extends TestNetBird_AddPeer_ExistingStartedClient_NotifiesStatus to pass a pre-cancelled caller ctx and assert the notification still ran on a non-cancelled context. * use the cmd context for roundtripper	2026-06-02 13:40:09 +02:00
Zoltan Papp	5f7657b95e	Merge branch 'main' into ui-refactor	2026-05-31 12:33:50 +02:00
Pascal Fischer	9189625487	[management] enrich context in permissions manager (#6286 )	2026-05-29 16:36:38 +02:00
Zoltán Papp	3279b705fe	session-extend: drop dead retry loop in mgmClient.ExtendAuthSession The backoff loop only retried on codes.Canceled, but mgmCtx was derived from context.Background() and never cancelled by anything — so every real error path (Unavailable, DeadlineExceeded, etc.) went through backoff.Permanent on the first attempt. The loop was a no-op wrapper that just held the call open for the daemon's lifetime regardless of shutdown. Replace with a single context.WithTimeout(c.ctx, ConnectTimeout) call. Daemon shutdown now interrupts the RPC; behaviour on real errors is unchanged.	2026-05-28 19:28:15 +02:00
Zoltan Papp	174dc24867	[management] Add SSO session extend flow (management) (#6197 ) * add SSO session extend flow (management) Adds the management-server half of the SSO session-extension feature: - New ExtendAuthSession gRPC RPC that refreshes a peer's session expiry using a fresh JWT, validated through the same pipeline as Login but without tearing down the tunnel or redoing the NetworkMap sync. - Per-peer SessionExpiresAt timestamp on every LoginResponse and SyncResponse so connected clients learn the deadline on the existing long-lived stream, and admin-side changes (toggling expiration, changing the expiration window) reach every peer within seconds. - SessionExpiresAt(...) helper on Peer that derives the absolute UTC deadline from LastLogin + the account-level PeerLoginExpiration setting, returning zero when the peer is not SSO-tracked or expiration is disabled. The matching client-side consumer of these fields lands separately. * encode SessionExpiresAt as 3-state on the wire Previously the `sessionExpiresAt` field on LoginResponse, SyncResponse and ExtendAuthSessionResponse was 2-state: a valid timestamp meant "new deadline", and nil meant "clear". That conflated two distinct meanings — "no info in this snapshot" vs "expiry is explicitly off / peer is not SSO-tracked" — so a Sync push that legitimately couldn't compute the deadline (settings lookup failed) would silently clear the client's anchor and lose the warning window. Three states now, encoded on the same field number (no .proto schema churn — only comments and the server-side encoder change): - nil pointer (field absent) → "no info"; client preserves anchor - &Timestamp{} (seconds=0, nanos=0) → explicit "disabled / not SSO" sentinel; client clears - valid timestamp → new absolute UTC deadline A new encodeSessionExpiresAt helper centralises the zero/non-zero encoding and is shared by the Sync, Login and ExtendAuthSession builders. The Sync builder still emits nil when settings are missing. Login and ExtendAuthSession always carry an authoritative value. The matching client-side decoder lands on feature/session-extend. * add UserExtendedPeerSession activity event ExtendAuthSession previously reused UserLoggedInPeer for its audit record, which conflated two distinct user actions: a full interactive SSO login (tunnel re-established, network map resync) versus an in-place deadline refresh (tunnel untouched). Auditors reading the log couldn't tell which one happened, and downstream dashboards/alerts on "login" volume were polluted by routine extends. Adds a dedicated UserExtendedPeerSession Activity (code 125, "user.peer.session.extend") and switches ExtendPeerSession over to it. The peer-extend audit trail is now distinguishable from interactive logins. * make ExtendAuthSession JWT-retry backoff cancellable Skip the retry log and 200ms wait on the final attempt, and replace the uncancellable time.Sleep with a select on time.After/ctx.Done so an upstream cancellation aborts the wait instead of running it to completion.	2026-05-28 19:14:14 +02:00
Zoltan Papp	13179081d2	Merge branch 'main' into ui-refactor	2026-05-26 23:41:18 +02:00
Bethuel Mmbaga	14af179556	[management] Refactor management server bootstrap (#6256 )	2026-05-26 17:44:28 +03:00
Maycon Santos	7aebdd69dd	[management, client, proxy] add expose NetBird-only services over tunnel peers (#6226 ) Adds a new "private" service mode for the reverse proxy: services reachable exclusively over the embedded WireGuard tunnel, gated by per-peer group membership instead of operator auth schemes. Wire contract - ProxyMapping.private (field 13): the proxy MUST call ValidateTunnelPeer and fail closed; operator schemes are bypassed. - ProxyCapabilities.private (4) + supports_private_service (5): capability gate. Management never streams private mappings to proxies that don't claim the capability; the broadcast path applies the same filter via filterMappingsForProxy. - ValidateTunnelPeer RPC: resolves an inbound tunnel IP to a peer, checks the peer's groups against service.AccessGroups, and mints a session JWT on success. checkPeerGroupAccess fails closed when a private service has empty AccessGroups. - ValidateSession/ValidateTunnelPeer responses now carry peer_group_ids + peer_group_names so the proxy can authorise policy-aware middlewares without an extra management round-trip. - ProxyInboundListener + SendStatusUpdate.inbound_listener: per-account inbound listener state surfaced to dashboards. - PathTargetOptions.direct_upstream (11): bypass the embedded NetBird client and dial the target via the proxy host's network stack for upstreams reachable without WireGuard. Data model - Service.Private (bool) + Service.AccessGroups ([]string, JSON- serialised). Validate() rejects bearer auth on private services. Copy() deep-copies AccessGroups. pgx getServices loads the columns. - DomainConfig.Private threaded into the proxy auth middleware. Request handler routes private services through forwardWithTunnelPeer and returns 403 on validation failure. - Account-level SynthesizePrivateServiceZones (synthetic DNS) and injectPrivateServicePolicies (synthetic ACL) gate on len(svc.AccessGroups) > 0. Proxy - /netbird proxy --private (embedded mode) flag; Config.Private in proxy/lifecycle.go. - Per-account inbound listener (proxy/inbound.go) binding HTTP/HTTPS on the embedded NetBird client's WireGuard tunnel netstack. - proxy/internal/auth/tunnel_cache: ValidateTunnelPeer response cache with single-flight de-duplication and per-account eviction. - Local peerstore short-circuit: when the inbound IP isn't in the account roster, deny fast without an RPC. - proxy/server.go reports SupportsPrivateService=true and redacts the full ProxyMapping JSON from info logs (auth_token + header-auth hashed values now only at debug level). Identity forwarding - ValidateSessionJWT returns user_id, email, method, groups, group_names. sessionkey.Claims carries Email + Groups + GroupNames so the proxy can stamp identity onto upstream requests without an extra management round-trip on every cookie-bearing request. - CapturedData carries userEmail / userGroups / userGroupNames; the proxy stamps X-NetBird-User and X-NetBird-Groups on r.Out from the authenticated identity (strips client-supplied values first to prevent spoofing). - AccessLog.UserGroups: access-log enrichment captures the user's group memberships at write time so the dashboard can render group context without reverse-resolving stale memberships. OpenAPI/dashboard surface - ReverseProxyService gains private + access_groups; ReverseProxyCluster gains private + supports_private. ReverseProxyTarget target_type enum gains "cluster". ServiceTargetOptions gains direct_upstream. ProxyAccessLog gains user_groups.	2026-05-25 17:41:50 +02:00
Pascal Fischer	6137a1fcc5	[proxy] concurrent proxy snapshot apply (#6207 )	2026-05-20 18:21:22 +02:00
Zoltán Papp	ef6b4f7538	add SSO session extend flow Adds an end-to-end SSO session-extension feature: the management server publishes per-peer session deadlines on every Login/Sync, a new ExtendAuthSession RPC refreshes the deadline using a fresh JWT without tearing down the tunnel, and the daemon tracks the deadline locally so the UI can fire a T-10min warning toast with an interactive "Extend now" action.	2026-05-20 16:43:14 +02:00
Maycon Santos	d250f92c43	feat(reverse-proxy): clusters API surfaces type, online status, and capability flags (#6148 ) The cluster listing now answers three questions in one round-trip instead of forcing the dashboard to cross-reference the domains API: which clusters can this account see, are they currently up, and what do they support. The ProxyCluster wire type drops the boolean self_hosted in favour of a `type` enum (`account` / `shared`) plus explicit `online`, `supports_custom_ports`, `require_subdomain`, and `supports_crowdsec` fields. Store query reworked so offline clusters still appear (no last_seen WHERE), with online and connected_proxies both derived from the existing 2-min active window via portable CASE expressions; the 1-hour heartbeat reaper still removes long-stale rows. Service manager enriches each cluster with the capability flags via the existing per-cluster lookups (CapabilityProvider now also exposes ClusterSupportsCrowdSec). GetActiveClusterAddresses* keep their tight 2-min filter so service routing and domain enumeration aren't pulled into the wider window. The hard cut removes self_hosted from the response — the dashboard is the only consumer and is updated in the matching PR; no transitional field is shipped. Adds a cross-engine regression test asserting offline clusters surface, connected_proxies counts only fresh proxies, and account-scoped BYOP clusters never leak across accounts.	2026-05-20 10:08:34 +02:00
Vlad	07cbfdbede	[proxy] feature: bring your own proxy (#5627 )	2026-05-11 14:31:38 +02:00
Viktor Liu	6b08e89c7b	[relay] Preserve non-standard port in WS dialer URL prep (#6061 )	2026-05-11 09:59:33 +02:00
Nicolas Frati	e89aad09f5	[management] Enable MFA for local users (#5804 ) * wip: totp for local users * fix providers not getting populated * polished UI and fix post_login_redirect_uri * fix: make sure logout is only prompted from oidc flow Signed-off-by: jnfrati <nicofrati@gmail.com> * update templates Signed-off-by: jnfrati <nicofrati@gmail.com> * deps: update dex dependency Signed-off-by: jnfrati <nicofrati@gmail.com> * fix qube issues Signed-off-by: jnfrati <nicofrati@gmail.com> * replace window with globalThis on home html Signed-off-by: jnfrati <nicofrati@gmail.com> * fixed coderabbit comments Signed-off-by: jnfrati <nicofrati@gmail.com> * debug * remove unused config and rename totp issuer * deps: update dex reference to latest * add dashboard post logout redirect uri to embedded config * implemented api for mfa configuration * update docs and config parsing * catch error on idp manager init mfa * fix tests * Add remember me for MFA * Add cookie encryption and session share between tabs * fixed logout showing non actionable error and session cookie encription key * fixed missing mfa settings on sql query for account * fix code index for mfa activity --------- Signed-off-by: jnfrati <nicofrati@gmail.com> Co-authored-by: braginini <bangvalo@gmail.com>	2026-05-08 16:31:20 +02:00
Viktor Liu	205ebcfda2	[management, client] Add IPv6 overlay support (#5631 )	2026-05-07 11:33:37 +02:00
Zoltan Papp	a547fc74ed	[client] Use ctx.Err() instead of gRPC codes.Canceled to detect shutdown (#6019 ) Detecting shutdown by inspecting the gRPC status code conflates a local context cancellation with a server- or proxy-sent codes.Canceled. When the latter occurs (e.g. an intermediary proxy resets the stream), the retry loop silently terminates and the client never reconnects. Switch to ctx.Err() in the signal Receive loop and management Sync/Job handlers, and stop matching gRPC Canceled/DeadlineExceeded in the flow client's isContextDone helper. With this change, a server-sent Canceled is treated as a transient error and the backoff retry loop continues.	2026-05-04 11:59:25 +02:00
Viktor Liu	50b58a6828	[client, relay] Advertise relay server IP via signal for foreign-relay fallback dial (#6004 )	2026-05-04 11:40:25 +02:00
Misha Bragin	c4b2da4c92	[management] Add public connection ipv4 and ipv6 posture check (#6038 ) This change enables admins to configure posture checks for connecting public IPs of their peers. It changes the behavior of the check as well and now the evaluation is if the received network is part of the configured network.	2026-04-30 18:36:50 +02:00
Nicolas Frati	dcd1db42ef	[management] Enable PAT creation during setup (#6003 ) * enable pat creation on setup * remove logic from handler towards setup service * fix lint issue * fix rollback on account id returning empty * fix coderabbit comments * fix setup PAT rollback behavior	2026-04-30 17:21:35 +02:00
Bethuel Mmbaga	df197d5001	[management] Prevent JWT reuse during peer login (#6002 )	2026-04-29 15:04:27 +03:00
Zoltan Papp	8fc4265995	[relay] evict foreign client cache on disconnect (#6015 ) * [relay] evict foreign client cache on disconnect When a foreign relay's TCP connection drops, the manager's onServerDisconnected handler only triggered reconnect logic for the home server; the disconnected foreign entry stayed in the relayClients cache. Subsequent OpenConn calls reused the closed client until the 60-second cleanup tick evicted it, breaking peer connectivity through that relay for up to a minute. Evict the foreign entry from the cache on disconnect so the next OpenConn dials a fresh client. Also: - Make the reconnect backoff cap configurable via WithMaxBackoffInterval ManagerOption; the previous hard-coded 60s constant forced TestAutoReconnect to sleep ~61s. Test now polls Ready() and finishes in ~2s. - Add NB_HOME_RELAY_SERVERS env var that overrides the relay URL list received from management, so a peer can be pinned to a specific home relay (used by the netbird-conn-lab Edge 4 reproducer). * [client] treat empty NB_HOME_RELAY_SERVERS as unset Returning (urls=[], ok=true) when the env var contained only separators or whitespace caused callers to wipe the mgmt-provided relay list, leaving the peer with no relays. Treat a parsed-empty result the same as an unset env.	2026-04-28 15:04:48 +02:00
Zoltan Papp	9c50819f20	Don't mark management disconnected on transient job stream errors (#6005 ) The JOB stream is a separate channel from the SYNC stream. Server-side EOF or other transient errors on the JOB stream do not indicate that the management connection is unhealthy — the SYNC stream remains the authoritative state signal. Previously, a JOB stream EOF would call notifyDisconnected and the client would emit OnConnecting to the UI. The backoff retry would reconnect the JOB stream, but handleJobStream never calls notifyConnected on success, so the UI was stuck on "Connecting" until the next SYNC event or health check. Keep notifyDisconnected for codes.PermissionDenied since IsLoginRequired relies on managementError to detect expired auth.	2026-04-28 15:04:41 +02:00
Bethuel Mmbaga	6f0eff3ba0	[management] Handle single-string JWT group claim from IdPs (#6014 )	2026-04-28 14:48:28 +03:00
Bethuel Mmbaga	f8745723fc	[management] Add Microsoft AD FS support for embedded Dex identity providers (#6008 )	2026-04-28 12:42:19 +03:00
Zoltan Papp	5da05ecca6	[client] increase gRPC health check timeout to 5s (#5961 ) Bump the IsHealthy() context timeout from 1s to 5s for both the management and signal gRPC clients to reduce false negatives on slower or congested connections.	2026-04-22 20:54:18 +02:00
Maycon Santos	53b04e512a	[management] Reuse a single cache store across all management server consumers (#5889 ) * Add support for legacy IDP cache environment variable * Centralize cache store creation to reuse a single Redis connection pool Each cache consumer (IDP cache, token store, PKCE store, secrets manager, EDR validator) was independently calling NewStore, creating separate Redis clients with their own connection pools — up to 1400 potential connections from a single management server process. Introduce a shared CacheStore() singleton on BaseServer that creates one store at boot and injects it into all consumers. Consumer constructors now receive a store.StoreInterface instead of creating their own. For Redis mode, all consumers share one connection pool (1000 max conns). For in-memory mode, all consumers share one GoCache instance. * Update management-integrations module to latest version * sync go.sum * Export `GetAddrFromEnv` to allow reuse across packages * Update management-integrations module version in go.mod and go.sum * Update management-integrations module version in go.mod and go.sum	2026-04-16 16:04:53 +02:00
Viktor Liu	0a30b9b275	[management, proxy] Add CrowdSec IP reputation integration for reverse proxy (#5722 )	2026-04-14 12:14:58 +02:00
Zoltan Papp	ebd78e0122	[client] Update `RaceDial` to accept context for improved cancellation handling (#5849 )	2026-04-10 20:51:04 +02:00
Pascal Fischer	ee588e1536	Revert "[management] allow local routing peer resource (#5814 )" (#5847 )	2026-04-10 14:53:47 +02:00
Pascal Fischer	2a8aacc5c9	[management] allow local routing peer resource (#5814 )	2026-04-10 13:08:21 +02:00
Zoltan Papp	0efef671d7	[client] Unexport GetServerPublicKey, add HealthCheck method (#5735 ) * Unexport GetServerPublicKey, add HealthCheck method Internalize server key fetching into Login, Register, GetDeviceAuthorizationFlow, and GetPKCEAuthorizationFlow methods, removing the need for callers to fetch and pass the key separately. Replace the exported GetServerPublicKey with a HealthCheck() error method for connection validation, keeping IsHealthy() bool for non-blocking background monitoring. Fix test encryption to use correct key pairs (client public key as remotePubKey instead of server private key). * Refactor `doMgmLogin` to return only error, removing unused response	2026-04-07 12:18:21 +02:00
Bethuel Mmbaga	9d1a37c644	[management,client] Revert gRPC client secret removal (#5781 ) * This reverts commit `e5914e4e8b` Signed-off-by: bcmmbaga <bethuelmbaga12@gmail.com> * Deprecate client secret in proto Signed-off-by: bcmmbaga <bethuelmbaga12@gmail.com> * Fix lint Signed-off-by: bcmmbaga <bethuelmbaga12@gmail.com> --------- Signed-off-by: bcmmbaga <bethuelmbaga12@gmail.com>	2026-04-02 18:21:00 +02:00
Bethuel Mmbaga	c2c6396a04	[management] Allow updating embedded IdP user name and email (#5721 )	2026-04-02 13:02:10 +03:00
Zoltan Papp	d670e7382a	[client] Fix ipv6 address in quic server (#5763 ) * [client] Use `net.JoinHostPort` for consistency in constructing host-port pairs * [client] Fix handling of IPv6 addresses by trimming brackets in `net.JoinHostPort`	2026-04-01 15:11:23 +02:00
shuuri-labs	940f530ac2	[management] Legacy to embedded IdP migration tool (#5586 )	2026-04-01 13:53:19 +02:00
Bethuel Mmbaga	e5914e4e8b	[management,client] Remove client secret from gRPC auth flow (#5751 ) Remove client secret from gRPC auth flow. The secret was originally included to support providers like Google Workspace that don't offer a proper PKCE flow, but this is no longer necessary with the embedded IdP. Deployments using such providers should migrate to the embedded IdP instead.	2026-03-31 18:50:49 +03:00
Maycon Santos	405c3f4003	[management] Feature/fleetdm api spec (#5597 ) add fleetdm api spec	2026-03-31 14:03:34 +02:00
Bethuel Mmbaga	c919ea149e	[misc] Add missing OpenAPI definitions (#5690 )	2026-03-30 11:20:17 +03:00
Pascal Fischer	7e1cce4b9f	[management] add terminated field to service (#5700 )	2026-03-26 16:59:08 +01:00
Bethuel Mmbaga	7be8752a00	[management] Add notification endpoints (#5590 )	2026-03-26 18:26:33 +03:00
Zoltan Papp	91f0d5cefd	[client] Feature/client metrics (#5512 ) * Add client metrics * Add client metrics system with OpenTelemetry and VictoriaMetrics support Implements a comprehensive client metrics system to track peer connection stages and performance. The system supports multiple backend implementations (OpenTelemetry, VictoriaMetrics, and no-op) and tracks detailed connection stage durations from creation through WireGuard handshake. Key changes: - Add metrics package with pluggable backend implementations - Implement OpenTelemetry metrics backend - Implement VictoriaMetrics metrics backend - Add no-op metrics implementation for disabled state - Track connection stages: creation, semaphore, signaling, connection ready, and WireGuard handshake - Move WireGuard watcher functionality to conn.go - Refactor engine to integrate metrics tracking - Add metrics export endpoint in debug server * Add signaling metrics tracking for initial and reconnection attempts * Reset connection stage timestamps during reconnections to exclude unnecessary metrics tracking * Delete otel lib from client * Update unit tests * Invoke callback on handshake success in WireGuard watcher * Add Netbird version tracking to client metrics Integrate Netbird version into VictoriaMetrics backend and metrics labels. Update `ClientMetrics` constructor and metric name formatting to include version information. * Add sync duration tracking to client metrics Introduce `RecordSyncDuration` for measuring sync message processing time. Update all metrics implementations (VictoriaMetrics, no-op) to support the new method. Refactor `ClientMetrics` to use `AgentInfo` for static agent data. * Remove no-op metrics implementation and simplify ClientMetrics constructor Eliminate unused `noopMetrics` and refactor `ClientMetrics` to always use the VictoriaMetrics implementation. Update associated logic to reflect these changes. * Add total duration tracking for connection attempts Calculate total duration for both initial connections and reconnections, accounting for different timestamp scenarios. Update `Export` method to include Prometheus HELP comments. * Add metrics push support to VictoriaMetrics integration * [client] anchor connection metrics to first signal received * Remove creation_to_semaphore connection stage metric The semaphore queuing stage (Created → SemaphoreAcquired) is no longer tracked. Connection metrics now start from SignalingReceived. Updated docs and Grafana dashboard accordingly. * [client] Add remote push config for metrics with version-based eligibility Introduce remoteconfig.Manager that fetches a remote JSON config to control metrics push interval and restrict pushing to a specific agent version range. When NB_METRICS_INTERVAL is set, remote config is bypassed entirely for local override. * [client] Add WASM-compatible NewClientMetrics implementation Replace NewClientMetrics in metrics.go with a WASM-specific stub in metrics_js.go, returning nil for compatibility with JS builds. Simplify method usage for WASM targets. * Add missing file * Update default case in DeploymentType.String to return "unknown" instead of "selfhosted" * [client] Rework metrics to use timestamped samples instead of histograms Replace cumulative Prometheus histograms with timestamped point-in-time samples that are pushed once and cleared. This fixes metrics for sparse events (connections/syncs that happen once at startup) where rate() and increase() produced incorrect or empty results. Changes: - Switch from VictoriaMetrics histogram library to raw Prometheus text format with explicit millisecond timestamps - Reset samples after successful push (no resending stale data) - Rename connection_to_handshake → connection_to_wg_handshake - Add netbird_peer_connection_count metric for ICE vs Relay tracking - Simplify dashboard: point-based scatter plots, donut pie chart - Add maxStalenessInterval=1m to VictoriaMetrics to prevent forward-fill - Fix deployment_type Unknown returning "selfhosted" instead of "unknown" - Fix inverted shouldPush condition in push.go * [client] Add InfluxDB metrics backend alongside VictoriaMetrics Add influxdb.go with timestamped line protocol export for sparse one-shot events. Restore victoria.go to use proper Prometheus histograms. Update Grafana dashboards, add InfluxDB datasource, and update docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [client] Fix metrics issues and update dev docker setup - Fix StopPush not clearing push state, preventing restart - Fix race condition reading currentConnPriority without lock in recordConnectionMetrics - Fix stale comment referencing old metrics server URL - Update docker-compose for InfluxDB: add scoped tokens, .env config, init scripts - Rename docker-compose.victoria.yml to docker-compose.yml * [client] Add anonymised peer tracking to pushed metrics Introduce peer_id and connection_pair_id tags to InfluxDB metrics. Public keys are hashed (truncated SHA-256) for anonymisation. The connection pair ID is deterministic regardless of which side computes it, enabling deduplication of reconnections in the ICE vs Relay dashboard. Also pin Grafana to v11.6.0 for file-based provisioning and fix datasource UID references. * Remove unused dependencies from go.mod and go.sum * Refactor InfluxDB ingest pipeline: extract validation logic - Move line validation logic to `validateLine` and `validateField` helper functions. - Improve error handling with structured validation and clearer separation of concerns. - Add stderr redirection for error messages in `create-tokens.sh`. * Set non-root user in Dockerfile for Ingest service * Fix Windows CI: command line too long * Remove Victoria metrics * Add hashed peer ID as Authorization header in metrics push * Revert influxdb in docker compose * Enable gzip compression and authorization validation for metrics push and ingest * Reducate code of complexity * Update debug documentation to include metrics.txt description * Increase `maxBodySize` limit to 50 MB and update gzip reader wrapping logic * Refactor deployment type detection to use URL parsing for improved accuracy * Update readme * Throttle remote config retries on fetch failure * Preserve first WG handshake timestamp, ignore rekeys * Skip adding empty metrics.txt to debug bundle in debug mode * Update default metrics server URL to https://ingest.netbird.io * Atomic metrics export-and-reset to prevent sample loss between Export and Reset calls * Fix doc * Refactor Push configuration to improve clarity and enforce minimum push interval * Remove `minPushInterval` and update push interval validation logic * Revert ExportAndReset, it is acceptable data loss * Fix metrics review issues: rename env var, remove stale infra, add tests - Rename NB_METRICS_ENABLED to NB_METRICS_PUSH_ENABLED to clarify that collection is always active (for debug bundles) and only push is opt-in - Change default config URL from staging to production (ingest.netbird.io) - Delete broken Prometheus dashboard (used non-existent metric names) - Delete unused VictoriaMetrics datasource config - Replace committed .env with .env.example containing placeholder values - Wire Grafana admin credentials through env vars in docker-compose - Make metricsStages a pointer to prevent reset-vs-write race on reconnect - Fix typed-nil interface in debug bundle path (GetClientMetrics) - Use deterministic field order in InfluxDB Export (sorted keys) - Replace Authorization header with X-Peer-ID for metrics push - Fix ingest server timeout to use time.Second instead of float - Fix gzip double-close, stale comments, trim log levels - Add tests for influxdb.go and MetricsStages * Add login duration metric, ingest tag validation, and duration bounds - Add netbird_login measurement recording login/auth duration to management server, with success/failure result tag - Validate InfluxDB tags against per-measurement allowlists in ingest server to prevent arbitrary tag injection - Cap all duration fields (_seconds) at 300s instead of only total_seconds - Add ingest server tests for tag/field validation, bounds, and auth Add arch tag to all metrics * Fix Grafana dashboard: add arch to drop columns, add login panels * Validate NB_METRICS_SERVER_URL is an absolute HTTP(S) URL * Address review comments: fix README wording, update stale comments * Clarify env var precedence does not bypass remote config eligibility * Remove accidentally committed pprof files --------- Co-authored-by: Viktor Liu <viktor@netbird.io>	2026-03-22 12:45:41 +01:00
Viktor Liu	b550a2face	[management, proxy] Add require_subdomain capability for proxy clusters (#5628 )	2026-03-20 11:29:50 +01:00
Viktor Liu	ab77508950	[client] Add env var for management gRPC max receive message size (#5622 )	2026-03-19 17:33:50 +01:00
Viktor Liu	387e374e4b	[proxy, management] Add header auth, access restrictions, and session idle timeout (#5587 )	2026-03-16 15:22:00 +01:00
Viktor Liu	3e6baea405	[management,proxy,client] Add L4 capabilities (TLS/TCP/UDP) (#5530 )	2026-03-13 18:36:44 +01:00
Zoltan Papp	fe9b844511	[client] refactor auto update workflow (#5448 ) Auto-update logic moved out of the UI into a dedicated updatemanager.Manager service that runs in the connection layer. The UI no longer polls or checks for updates independently. The update manager supports three modes driven by the management server's auto-update policy: No policy set by mgm: checks GitHub for the latest version and notifies the user (previous behavior, now centralized) mgm enforces update: the "About" menu triggers installation directly instead of just downloading the file — user still initiates the action mgm forces update: installation proceeds automatically without user interaction updateManager lifecycle is now owned by daemon, giving the daemon server direct control via a new TriggerUpdate RPC Introduces EngineServices struct to group external service dependencies passed to NewEngine, reducing its argument count from 11 to 4	2026-03-13 17:01:28 +01:00
Pascal Fischer	f884299823	[proxy] refactor metrics and add usage logs (#5533 ) * New Features * Access logs now include bytes_upload and bytes_download (API and schemas updated, fields required). * Certificate issuance duration is now recorded as a metric. * Refactor * Metrics switched from Prometheus client to OpenTelemetry-backed meters; health endpoint now exposes OpenMetrics via OTLP exporter. * Tests * Metric tests updated to use OpenTelemetry Prometheus exporter and MeterProvider.	2026-03-09 18:45:45 +01:00
Viktor Liu	e601278117	[management,proxy] Add per-target options to reverse proxy (#5501 )	2026-03-05 10:03:26 +01:00
Viktor Liu	0ca59535f1	[management] Add reverse proxy services REST client (#5454 )	2026-02-28 13:04:58 +08:00

1 2 3

111 Commits