mirror of
https://github.com/netbirdio/netbird.git
synced 2026-04-18 16:26:38 +00:00
[client] Feature/client metrics (#5512)
* Add client metrics * Add client metrics system with OpenTelemetry and VictoriaMetrics support Implements a comprehensive client metrics system to track peer connection stages and performance. The system supports multiple backend implementations (OpenTelemetry, VictoriaMetrics, and no-op) and tracks detailed connection stage durations from creation through WireGuard handshake. Key changes: - Add metrics package with pluggable backend implementations - Implement OpenTelemetry metrics backend - Implement VictoriaMetrics metrics backend - Add no-op metrics implementation for disabled state - Track connection stages: creation, semaphore, signaling, connection ready, and WireGuard handshake - Move WireGuard watcher functionality to conn.go - Refactor engine to integrate metrics tracking - Add metrics export endpoint in debug server * Add signaling metrics tracking for initial and reconnection attempts * Reset connection stage timestamps during reconnections to exclude unnecessary metrics tracking * Delete otel lib from client * Update unit tests * Invoke callback on handshake success in WireGuard watcher * Add Netbird version tracking to client metrics Integrate Netbird version into VictoriaMetrics backend and metrics labels. Update `ClientMetrics` constructor and metric name formatting to include version information. * Add sync duration tracking to client metrics Introduce `RecordSyncDuration` for measuring sync message processing time. Update all metrics implementations (VictoriaMetrics, no-op) to support the new method. Refactor `ClientMetrics` to use `AgentInfo` for static agent data. * Remove no-op metrics implementation and simplify ClientMetrics constructor Eliminate unused `noopMetrics` and refactor `ClientMetrics` to always use the VictoriaMetrics implementation. Update associated logic to reflect these changes. * Add total duration tracking for connection attempts Calculate total duration for both initial connections and reconnections, accounting for different timestamp scenarios. Update `Export` method to include Prometheus HELP comments. * Add metrics push support to VictoriaMetrics integration * [client] anchor connection metrics to first signal received * Remove creation_to_semaphore connection stage metric The semaphore queuing stage (Created → SemaphoreAcquired) is no longer tracked. Connection metrics now start from SignalingReceived. Updated docs and Grafana dashboard accordingly. * [client] Add remote push config for metrics with version-based eligibility Introduce remoteconfig.Manager that fetches a remote JSON config to control metrics push interval and restrict pushing to a specific agent version range. When NB_METRICS_INTERVAL is set, remote config is bypassed entirely for local override. * [client] Add WASM-compatible NewClientMetrics implementation Replace NewClientMetrics in metrics.go with a WASM-specific stub in metrics_js.go, returning nil for compatibility with JS builds. Simplify method usage for WASM targets. * Add missing file * Update default case in DeploymentType.String to return "unknown" instead of "selfhosted" * [client] Rework metrics to use timestamped samples instead of histograms Replace cumulative Prometheus histograms with timestamped point-in-time samples that are pushed once and cleared. This fixes metrics for sparse events (connections/syncs that happen once at startup) where rate() and increase() produced incorrect or empty results. Changes: - Switch from VictoriaMetrics histogram library to raw Prometheus text format with explicit millisecond timestamps - Reset samples after successful push (no resending stale data) - Rename connection_to_handshake → connection_to_wg_handshake - Add netbird_peer_connection_count metric for ICE vs Relay tracking - Simplify dashboard: point-based scatter plots, donut pie chart - Add maxStalenessInterval=1m to VictoriaMetrics to prevent forward-fill - Fix deployment_type Unknown returning "selfhosted" instead of "unknown" - Fix inverted shouldPush condition in push.go * [client] Add InfluxDB metrics backend alongside VictoriaMetrics Add influxdb.go with timestamped line protocol export for sparse one-shot events. Restore victoria.go to use proper Prometheus histograms. Update Grafana dashboards, add InfluxDB datasource, and update docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [client] Fix metrics issues and update dev docker setup - Fix StopPush not clearing push state, preventing restart - Fix race condition reading currentConnPriority without lock in recordConnectionMetrics - Fix stale comment referencing old metrics server URL - Update docker-compose for InfluxDB: add scoped tokens, .env config, init scripts - Rename docker-compose.victoria.yml to docker-compose.yml * [client] Add anonymised peer tracking to pushed metrics Introduce peer_id and connection_pair_id tags to InfluxDB metrics. Public keys are hashed (truncated SHA-256) for anonymisation. The connection pair ID is deterministic regardless of which side computes it, enabling deduplication of reconnections in the ICE vs Relay dashboard. Also pin Grafana to v11.6.0 for file-based provisioning and fix datasource UID references. * Remove unused dependencies from go.mod and go.sum * Refactor InfluxDB ingest pipeline: extract validation logic - Move line validation logic to `validateLine` and `validateField` helper functions. - Improve error handling with structured validation and clearer separation of concerns. - Add stderr redirection for error messages in `create-tokens.sh`. * Set non-root user in Dockerfile for Ingest service * Fix Windows CI: command line too long * Remove Victoria metrics * Add hashed peer ID as Authorization header in metrics push * Revert influxdb in docker compose * Enable gzip compression and authorization validation for metrics push and ingest * Reducate code of complexity * Update debug documentation to include metrics.txt description * Increase `maxBodySize` limit to 50 MB and update gzip reader wrapping logic * Refactor deployment type detection to use URL parsing for improved accuracy * Update readme * Throttle remote config retries on fetch failure * Preserve first WG handshake timestamp, ignore rekeys * Skip adding empty metrics.txt to debug bundle in debug mode * Update default metrics server URL to https://ingest.netbird.io * Atomic metrics export-and-reset to prevent sample loss between Export and Reset calls * Fix doc * Refactor Push configuration to improve clarity and enforce minimum push interval * Remove `minPushInterval` and update push interval validation logic * Revert ExportAndReset, it is acceptable data loss * Fix metrics review issues: rename env var, remove stale infra, add tests - Rename NB_METRICS_ENABLED to NB_METRICS_PUSH_ENABLED to clarify that collection is always active (for debug bundles) and only push is opt-in - Change default config URL from staging to production (ingest.netbird.io) - Delete broken Prometheus dashboard (used non-existent metric names) - Delete unused VictoriaMetrics datasource config - Replace committed .env with .env.example containing placeholder values - Wire Grafana admin credentials through env vars in docker-compose - Make metricsStages a pointer to prevent reset-vs-write race on reconnect - Fix typed-nil interface in debug bundle path (GetClientMetrics) - Use deterministic field order in InfluxDB Export (sorted keys) - Replace Authorization header with X-Peer-ID for metrics push - Fix ingest server timeout to use time.Second instead of float - Fix gzip double-close, stale comments, trim log levels - Add tests for influxdb.go and MetricsStages * Add login duration metric, ingest tag validation, and duration bounds - Add netbird_login measurement recording login/auth duration to management server, with success/failure result tag - Validate InfluxDB tags against per-measurement allowlists in ingest server to prevent arbitrary tag injection - Cap all duration fields (*_seconds) at 300s instead of only total_seconds - Add ingest server tests for tag/field validation, bounds, and auth * Add arch tag to all metrics * Fix Grafana dashboard: add arch to drop columns, add login panels * Validate NB_METRICS_SERVER_URL is an absolute HTTP(S) URL * Address review comments: fix README wording, update stale comments * Clarify env var precedence does not bypass remote config eligibility * Remove accidentally committed pprof files --------- Co-authored-by: Viktor Liu <viktor@netbird.io>
This commit is contained in:
16
client/internal/metrics/infra/.env.example
Normal file
16
client/internal/metrics/infra/.env.example
Normal file
@@ -0,0 +1,16 @@
|
||||
# Copy to .env and adjust values before running docker compose
|
||||
|
||||
# InfluxDB admin (server-side only, never exposed to clients)
|
||||
INFLUXDB_ADMIN_PASSWORD=changeme
|
||||
INFLUXDB_ADMIN_TOKEN=changeme
|
||||
|
||||
# Grafana admin credentials
|
||||
GRAFANA_ADMIN_USER=admin
|
||||
GRAFANA_ADMIN_PASSWORD=changeme
|
||||
|
||||
# Remote config served by ingest at /config
|
||||
# Set CONFIG_METRICS_SERVER_URL to the ingest server's public address to enable
|
||||
CONFIG_METRICS_SERVER_URL=
|
||||
CONFIG_VERSION_SINCE=0.0.0
|
||||
CONFIG_VERSION_UNTIL=99.99.99
|
||||
CONFIG_PERIOD_MINUTES=5
|
||||
1
client/internal/metrics/infra/.gitignore
vendored
Normal file
1
client/internal/metrics/infra/.gitignore
vendored
Normal file
@@ -0,0 +1 @@
|
||||
.env
|
||||
194
client/internal/metrics/infra/README.md
Normal file
194
client/internal/metrics/infra/README.md
Normal file
@@ -0,0 +1,194 @@
|
||||
# Client Metrics
|
||||
|
||||
Internal documentation for the NetBird client metrics system.
|
||||
|
||||
## Overview
|
||||
|
||||
Client metrics track connection performance and sync durations using InfluxDB line protocol (`influxdb.go`). Each event is pushed once then cleared.
|
||||
|
||||
Metrics collection is always active (for debug bundles). Push to backend is:
|
||||
- Disabled by default (opt-in via `NB_METRICS_PUSH_ENABLED=true`)
|
||||
- Managed at daemon layer (survives engine restarts)
|
||||
|
||||
## Architecture
|
||||
|
||||
### Layer Separation
|
||||
|
||||
```text
|
||||
Daemon Layer (connect.go)
|
||||
├─ Creates ClientMetrics instance once
|
||||
├─ Starts/stops push lifecycle
|
||||
└─ Updates AgentInfo on profile switch
|
||||
│
|
||||
▼
|
||||
Engine Layer (engine.go)
|
||||
└─ Records metrics via ClientMetrics methods
|
||||
```
|
||||
|
||||
### Ingest Server
|
||||
|
||||
Clients do not talk to InfluxDB directly. An ingest server sits between clients and InfluxDB:
|
||||
|
||||
```text
|
||||
Client ──POST──▶ Ingest Server (:8087) ──▶ InfluxDB (internal)
|
||||
│
|
||||
├─ Validates line protocol
|
||||
├─ Allowlists measurements, fields, and tags
|
||||
├─ Rejects out-of-bound values
|
||||
└─ Serves remote config at /config
|
||||
```
|
||||
|
||||
- **No secret/token-based client auth** — the ingest server holds the InfluxDB token server-side. Clients must send a hashed peer ID via `X-Peer-ID` header.
|
||||
- **InfluxDB is not exposed** — only accessible within the docker network
|
||||
- Source: `ingest/main.go`
|
||||
|
||||
## Metrics Collected
|
||||
|
||||
### Connection Stage Timing
|
||||
|
||||
Measurement: `netbird_peer_connection`
|
||||
|
||||
| Field | Timestamps | Description |
|
||||
|-------|-----------|-------------|
|
||||
| `signaling_to_connection_seconds` | `SignalingReceived → ConnectionReady` | ICE/relay negotiation time after the first signal is received from the remote peer |
|
||||
| `connection_to_wg_handshake_seconds` | `ConnectionReady → WgHandshakeSuccess` | WireGuard cryptographic handshake latency once the transport layer is ready |
|
||||
| `total_seconds` | `SignalingReceived → WgHandshakeSuccess` | End-to-end connection time anchored at the first received signal |
|
||||
|
||||
Tags:
|
||||
- `deployment_type`: "cloud" | "selfhosted" | "unknown"
|
||||
- `connection_type`: "ice" | "relay"
|
||||
- `attempt_type`: "initial" | "reconnection"
|
||||
- `version`: NetBird version string
|
||||
- `os`: Operating system (linux, darwin, windows, android, ios, etc.)
|
||||
- `arch`: CPU architecture (amd64, arm64, etc.)
|
||||
|
||||
**Note:** `SignalingReceived` is set when the first offer or answer arrives from the remote peer (in both initial and reconnection paths). It excludes the potentially unbounded wait for the remote peer to come online.
|
||||
|
||||
### Sync Duration
|
||||
|
||||
Measurement: `netbird_sync`
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| `duration_seconds` | Time to process a sync message from management server |
|
||||
|
||||
Tags:
|
||||
- `deployment_type`: "cloud" | "selfhosted" | "unknown"
|
||||
- `version`: NetBird version string
|
||||
- `os`: Operating system (linux, darwin, windows, android, ios, etc.)
|
||||
- `arch`: CPU architecture (amd64, arm64, etc.)
|
||||
|
||||
### Login Duration
|
||||
|
||||
Measurement: `netbird_login`
|
||||
|
||||
| Field | Description |
|
||||
|-------|-------------|
|
||||
| `duration_seconds` | Time to complete the login/auth exchange with management server |
|
||||
|
||||
Tags:
|
||||
- `deployment_type`: "cloud" | "selfhosted" | "unknown"
|
||||
- `result`: "success" | "failure"
|
||||
- `version`: NetBird version string
|
||||
- `os`: Operating system (linux, darwin, windows, android, ios, etc.)
|
||||
- `arch`: CPU architecture (amd64, arm64, etc.)
|
||||
|
||||
## Buffer Limits
|
||||
|
||||
The InfluxDB backend limits in-memory sample storage to prevent unbounded growth when pushes fail:
|
||||
- **Max age:** Samples older than 5 days are dropped
|
||||
- **Max size:** Estimated buffer size capped at 5 MB (~20k samples)
|
||||
|
||||
## Configuration
|
||||
|
||||
### Client Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `NB_METRICS_PUSH_ENABLED` | `false` | Enable metrics push to backend |
|
||||
| `NB_METRICS_SERVER_URL` | *(from remote config)* | Ingest server URL (e.g., `https://ingest.netbird.io`) |
|
||||
| `NB_METRICS_INTERVAL` | *(from remote config)* | Push interval (e.g., "1m", "30m", "4h") |
|
||||
| `NB_METRICS_FORCE_SENDING` | `false` | Skip remote config, push unconditionally |
|
||||
| `NB_METRICS_CONFIG_URL` | `https://ingest.netbird.io/config` | Remote push config URL |
|
||||
|
||||
`NB_METRICS_SERVER_URL` and `NB_METRICS_INTERVAL` override their respective values but do not bypass remote config eligibility checks (version range). Use `NB_METRICS_FORCE_SENDING=true` to skip all remote config gating.
|
||||
|
||||
### Ingest Server Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `INGEST_LISTEN_ADDR` | `:8087` | Listen address |
|
||||
| `INFLUXDB_URL` | `http://influxdb:8086/api/v2/write?org=netbird&bucket=metrics&precision=ns` | InfluxDB write endpoint |
|
||||
| `INFLUXDB_TOKEN` | *(required)* | InfluxDB auth token (server-side only) |
|
||||
| `CONFIG_METRICS_SERVER_URL` | *(empty — disables /config)* | `server_url` in the remote config JSON (the URL clients push metrics to) |
|
||||
| `CONFIG_VERSION_SINCE` | `0.0.0` | Minimum client version to push metrics |
|
||||
| `CONFIG_VERSION_UNTIL` | `99.99.99` | Maximum client version to push metrics |
|
||||
| `CONFIG_PERIOD_MINUTES` | `5` | Push interval in minutes |
|
||||
|
||||
The ingest server serves a remote config JSON at `GET /config` when `CONFIG_METRICS_SERVER_URL` is set. Clients can use `NB_METRICS_CONFIG_URL=http://<ingest>/config` to fetch it.
|
||||
|
||||
### Configuration Precedence
|
||||
|
||||
For URL and Interval, the precedence is:
|
||||
1. **Environment variable** - `NB_METRICS_SERVER_URL` / `NB_METRICS_INTERVAL`
|
||||
2. **Remote config** - fetched from `NB_METRICS_CONFIG_URL`
|
||||
3. **Default** - 5 minute interval, URL from remote config
|
||||
|
||||
## Push Behavior
|
||||
|
||||
1. `StartPush()` spawns background goroutine with timer
|
||||
2. First push happens immediately on startup
|
||||
3. Periodically: `push()` → `Export()` → HTTP POST to ingest server
|
||||
4. On failure: log error, continue (non-blocking)
|
||||
5. On success: `Reset()` clears pushed samples
|
||||
6. `StopPush()` cancels context and waits for goroutine
|
||||
|
||||
Samples are collected with exact timestamps, pushed once, then cleared. No data is resent.
|
||||
|
||||
## Local Development Setup
|
||||
|
||||
### 1. Configure and Start Services
|
||||
|
||||
```bash
|
||||
# From this directory (client/internal/metrics/infra)
|
||||
cp .env.example .env
|
||||
# Edit .env to set INFLUXDB_ADMIN_PASSWORD, INFLUXDB_ADMIN_TOKEN, and GRAFANA_ADMIN_PASSWORD
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
This starts:
|
||||
- **Ingest server** on http://localhost:8087 — accepts client metrics (requires `X-Peer-ID` header, no secret/token auth)
|
||||
- **InfluxDB** — internal only, not exposed to host
|
||||
- **Grafana** on http://localhost:3001
|
||||
|
||||
### 2. Configure Client
|
||||
|
||||
```bash
|
||||
export NB_METRICS_PUSH_ENABLED=true
|
||||
export NB_METRICS_FORCE_SENDING=true
|
||||
export NB_METRICS_SERVER_URL=http://localhost:8087
|
||||
export NB_METRICS_INTERVAL=1m
|
||||
```
|
||||
|
||||
### 3. Run Client
|
||||
|
||||
```bash
|
||||
cd ../../../..
|
||||
go run ./client/ up
|
||||
```
|
||||
|
||||
### 4. View in Grafana
|
||||
|
||||
- **InfluxDB dashboard:** http://localhost:3001/d/netbird-influxdb-metrics
|
||||
|
||||
### 5. Verify Data
|
||||
|
||||
```bash
|
||||
# Query via InfluxDB (using admin token from .env)
|
||||
docker compose exec influxdb influx query \
|
||||
'from(bucket: "metrics") |> range(start: -1h)' \
|
||||
--org netbird
|
||||
|
||||
# Check ingest server health
|
||||
curl http://localhost:8087/health
|
||||
```
|
||||
69
client/internal/metrics/infra/docker-compose.yml
Normal file
69
client/internal/metrics/infra/docker-compose.yml
Normal file
@@ -0,0 +1,69 @@
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
ingest:
|
||||
container_name: ingest
|
||||
build:
|
||||
context: ./ingest
|
||||
ports:
|
||||
- "8087:8087"
|
||||
environment:
|
||||
- INGEST_LISTEN_ADDR=:8087
|
||||
- INFLUXDB_URL=http://influxdb:8086/api/v2/write?org=netbird&bucket=metrics&precision=ns
|
||||
- INFLUXDB_TOKEN=${INFLUXDB_ADMIN_TOKEN:?required}
|
||||
- CONFIG_METRICS_SERVER_URL=${CONFIG_METRICS_SERVER_URL:-}
|
||||
- CONFIG_VERSION_SINCE=${CONFIG_VERSION_SINCE:-0.0.0}
|
||||
- CONFIG_VERSION_UNTIL=${CONFIG_VERSION_UNTIL:-99.99.99}
|
||||
- CONFIG_PERIOD_MINUTES=${CONFIG_PERIOD_MINUTES:-5}
|
||||
depends_on:
|
||||
- influxdb
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- metrics
|
||||
|
||||
influxdb:
|
||||
container_name: influxdb
|
||||
image: influxdb:2
|
||||
# No ports exposed — only accessible within the metrics network
|
||||
volumes:
|
||||
- influxdb-data:/var/lib/influxdb2
|
||||
- ./influxdb/scripts:/docker-entrypoint-initdb.d
|
||||
environment:
|
||||
- DOCKER_INFLUXDB_INIT_MODE=setup
|
||||
- DOCKER_INFLUXDB_INIT_USERNAME=admin
|
||||
- DOCKER_INFLUXDB_INIT_PASSWORD=${INFLUXDB_ADMIN_PASSWORD:?required}
|
||||
- DOCKER_INFLUXDB_INIT_ORG=netbird
|
||||
- DOCKER_INFLUXDB_INIT_BUCKET=metrics
|
||||
- DOCKER_INFLUXDB_INIT_RETENTION=365d
|
||||
- DOCKER_INFLUXDB_INIT_ADMIN_TOKEN=${INFLUXDB_ADMIN_TOKEN:-}
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- metrics
|
||||
|
||||
grafana:
|
||||
container_name: grafana
|
||||
image: grafana/grafana:11.6.0
|
||||
ports:
|
||||
- "3001:3000"
|
||||
environment:
|
||||
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
|
||||
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:?required}
|
||||
- GF_USERS_ALLOW_SIGN_UP=false
|
||||
- GF_INSTALL_PLUGINS=
|
||||
- INFLUXDB_ADMIN_TOKEN=${INFLUXDB_ADMIN_TOKEN:-}
|
||||
volumes:
|
||||
- grafana-data:/var/lib/grafana
|
||||
- ./grafana/provisioning:/etc/grafana/provisioning
|
||||
depends_on:
|
||||
- influxdb
|
||||
restart: unless-stopped
|
||||
networks:
|
||||
- metrics
|
||||
|
||||
volumes:
|
||||
influxdb-data:
|
||||
grafana-data:
|
||||
|
||||
networks:
|
||||
metrics:
|
||||
driver: bridge
|
||||
@@ -0,0 +1,12 @@
|
||||
apiVersion: 1
|
||||
|
||||
providers:
|
||||
- name: 'NetBird Dashboards'
|
||||
orgId: 1
|
||||
folder: ''
|
||||
type: file
|
||||
disableDeletion: false
|
||||
updateIntervalSeconds: 10
|
||||
allowUiUpdates: true
|
||||
options:
|
||||
path: /etc/grafana/provisioning/dashboards/json
|
||||
@@ -0,0 +1,280 @@
|
||||
{
|
||||
"uid": "netbird-influxdb-metrics",
|
||||
"title": "NetBird Client Metrics (InfluxDB)",
|
||||
"tags": ["netbird", "connections", "influxdb"],
|
||||
"timezone": "browser",
|
||||
"panels": [
|
||||
{
|
||||
"id": 5,
|
||||
"title": "Sync Duration Extremes",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "influxdb",
|
||||
"uid": "influxdb"
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"query": "from(bucket: \"metrics\")\n |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\n |> filter(fn: (r) => r._measurement == \"netbird_sync\" and r._field == \"duration_seconds\")\n |> map(fn: (r) => ({r with _value: r._value * 1000.0}))\n |> drop(columns: [\"deployment_type\", \"version\", \"os\", \"arch\", \"peer_id\"])\n |> min()\n |> set(key: \"_field\", value: \"Min\")",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"query": "from(bucket: \"metrics\")\n |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\n |> filter(fn: (r) => r._measurement == \"netbird_sync\" and r._field == \"duration_seconds\")\n |> map(fn: (r) => ({r with _value: r._value * 1000.0}))\n |> drop(columns: [\"deployment_type\", \"version\", \"os\", \"arch\", \"peer_id\"])\n |> max()\n |> set(key: \"_field\", value: \"Max\")",
|
||||
"refId": "B"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"min": 0
|
||||
}
|
||||
},
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"]
|
||||
},
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"textMode": "auto"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 6,
|
||||
"title": "Total Connection Time Extremes",
|
||||
"type": "stat",
|
||||
"datasource": {
|
||||
"type": "influxdb",
|
||||
"uid": "influxdb"
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"query": "from(bucket: \"metrics\")\n |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\n |> filter(fn: (r) => r._measurement == \"netbird_peer_connection\" and r._field == \"total_seconds\")\n |> map(fn: (r) => ({r with _value: r._value * 1000.0}))\n |> drop(columns: [\"deployment_type\", \"connection_type\", \"attempt_type\", \"version\", \"os\", \"arch\", \"peer_id\", \"connection_pair_id\"])\n |> min()\n |> set(key: \"_field\", value: \"Min\")",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"query": "from(bucket: \"metrics\")\n |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\n |> filter(fn: (r) => r._measurement == \"netbird_peer_connection\" and r._field == \"total_seconds\")\n |> map(fn: (r) => ({r with _value: r._value * 1000.0}))\n |> drop(columns: [\"deployment_type\", \"connection_type\", \"attempt_type\", \"version\", \"os\", \"arch\", \"peer_id\", \"connection_pair_id\"])\n |> max()\n |> set(key: \"_field\", value: \"Max\")",
|
||||
"refId": "B"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"min": 0
|
||||
}
|
||||
},
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"]
|
||||
},
|
||||
"colorMode": "value",
|
||||
"graphMode": "none",
|
||||
"textMode": "auto"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 1,
|
||||
"title": "Sync Duration",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "influxdb",
|
||||
"uid": "influxdb"
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"query": "from(bucket: \"metrics\")\n |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\n |> filter(fn: (r) => r._measurement == \"netbird_sync\" and r._field == \"duration_seconds\")\n |> map(fn: (r) => ({r with _value: r._value * 1000.0}))\n |> drop(columns: [\"deployment_type\", \"version\", \"os\", \"arch\", \"peer_id\"])\n |> set(key: \"_field\", value: \"Sync Duration\")",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"min": 0,
|
||||
"custom": {
|
||||
"drawStyle": "points",
|
||||
"pointSize": 5
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 4,
|
||||
"title": "ICE vs Relay",
|
||||
"type": "piechart",
|
||||
"datasource": {
|
||||
"type": "influxdb",
|
||||
"uid": "influxdb"
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 8
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"query": "from(bucket: \"metrics\")\n |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\n |> filter(fn: (r) => r._measurement == \"netbird_peer_connection\" and r._field == \"total_seconds\")\n |> drop(columns: [\"deployment_type\", \"attempt_type\", \"version\", \"os\", \"arch\", \"peer_id\"])\n |> group(columns: [\"connection_pair_id\"])\n |> last()\n |> group(columns: [\"connection_type\"])\n |> count()",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"]
|
||||
},
|
||||
"pieType": "donut",
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 2,
|
||||
"title": "Connection Stage Durations (avg)",
|
||||
"type": "bargauge",
|
||||
"datasource": {
|
||||
"type": "influxdb",
|
||||
"uid": "influxdb"
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 16
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"query": "from(bucket: \"metrics\")\n |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\n |> filter(fn: (r) => r._measurement == \"netbird_peer_connection\" and r._field == \"signaling_to_connection_seconds\")\n |> map(fn: (r) => ({r with _value: r._value * 1000.0}))\n |> drop(columns: [\"deployment_type\", \"connection_type\", \"attempt_type\", \"version\", \"os\", \"arch\", \"peer_id\", \"connection_pair_id\"])\n |> mean()\n |> drop(columns: [\"_start\", \"_stop\", \"_measurement\", \"_time\", \"_field\"])\n |> rename(columns: {_value: \"Avg Signaling to Connection\"})",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"query": "from(bucket: \"metrics\")\n |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\n |> filter(fn: (r) => r._measurement == \"netbird_peer_connection\" and r._field == \"connection_to_wg_handshake_seconds\")\n |> map(fn: (r) => ({r with _value: r._value * 1000.0}))\n |> drop(columns: [\"deployment_type\", \"connection_type\", \"attempt_type\", \"version\", \"os\", \"arch\", \"peer_id\", \"connection_pair_id\"])\n |> mean()\n |> drop(columns: [\"_start\", \"_stop\", \"_measurement\", \"_time\", \"_field\"])\n |> rename(columns: {_value: \"Avg Connection to WG Handshake\"})",
|
||||
"refId": "B"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"min": 0
|
||||
}
|
||||
},
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"]
|
||||
},
|
||||
"orientation": "horizontal",
|
||||
"displayMode": "gradient"
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 3,
|
||||
"title": "Total Connection Time",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "influxdb",
|
||||
"uid": "influxdb"
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 16
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"query": "from(bucket: \"metrics\")\n |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\n |> filter(fn: (r) => r._measurement == \"netbird_peer_connection\" and r._field == \"total_seconds\")\n |> map(fn: (r) => ({r with _value: r._value * 1000.0}))\n |> drop(columns: [\"deployment_type\", \"connection_type\", \"attempt_type\", \"version\", \"os\", \"arch\", \"peer_id\", \"connection_pair_id\"])\n |> set(key: \"_field\", value: \"Total Connection Time\")",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"min": 0,
|
||||
"custom": {
|
||||
"drawStyle": "points",
|
||||
"pointSize": 5
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 7,
|
||||
"title": "Login Duration",
|
||||
"type": "timeseries",
|
||||
"datasource": {
|
||||
"type": "influxdb",
|
||||
"uid": "influxdb"
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 24
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"query": "from(bucket: \"metrics\")\n |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\n |> filter(fn: (r) => r._measurement == \"netbird_login\" and r._field == \"duration_seconds\")\n |> map(fn: (r) => ({r with _value: r._value * 1000.0}))\n |> drop(columns: [\"deployment_type\", \"version\", \"os\", \"arch\", \"peer_id\"])\n |> set(key: \"_field\", value: \"Login Duration\")",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"fieldConfig": {
|
||||
"defaults": {
|
||||
"unit": "ms",
|
||||
"min": 0,
|
||||
"custom": {
|
||||
"drawStyle": "points",
|
||||
"pointSize": 5
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"id": 8,
|
||||
"title": "Login Success vs Failure",
|
||||
"type": "piechart",
|
||||
"datasource": {
|
||||
"type": "influxdb",
|
||||
"uid": "influxdb"
|
||||
},
|
||||
"gridPos": {
|
||||
"h": 8,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 24
|
||||
},
|
||||
"targets": [
|
||||
{
|
||||
"query": "from(bucket: \"metrics\")\n |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\n |> filter(fn: (r) => r._measurement == \"netbird_login\" and r._field == \"duration_seconds\")\n |> drop(columns: [\"deployment_type\", \"version\", \"os\", \"arch\", \"peer_id\"])\n |> group(columns: [\"result\"])\n |> count()",
|
||||
"refId": "A"
|
||||
}
|
||||
],
|
||||
"options": {
|
||||
"reduceOptions": {
|
||||
"calcs": ["lastNotNull"]
|
||||
},
|
||||
"pieType": "donut",
|
||||
"tooltip": {
|
||||
"mode": "multi"
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"schemaVersion": 27,
|
||||
"version": 2,
|
||||
"refresh": "30s"
|
||||
}
|
||||
@@ -0,0 +1,15 @@
|
||||
apiVersion: 1
|
||||
|
||||
datasources:
|
||||
- name: InfluxDB
|
||||
uid: influxdb
|
||||
type: influxdb
|
||||
access: proxy
|
||||
url: http://influxdb:8086
|
||||
editable: true
|
||||
jsonData:
|
||||
version: Flux
|
||||
organization: netbird
|
||||
defaultBucket: metrics
|
||||
secureJsonData:
|
||||
token: ${INFLUXDB_ADMIN_TOKEN}
|
||||
25
client/internal/metrics/infra/influxdb/scripts/create-tokens.sh
Executable file
25
client/internal/metrics/infra/influxdb/scripts/create-tokens.sh
Executable file
@@ -0,0 +1,25 @@
|
||||
#!/bin/bash
|
||||
# Creates a scoped InfluxDB read-only token for Grafana.
|
||||
# Clients do not need a token — they push via the ingest server.
|
||||
|
||||
BUCKET_ID=$(influx bucket list --org netbird --name metrics --json | grep -oP '"id"\s*:\s*"\K[^"]+' | head -1)
|
||||
ORG_ID=$(influx org list --name netbird --json | grep -oP '"id"\s*:\s*"\K[^"]+' | head -1)
|
||||
|
||||
if [[ -z "$BUCKET_ID" ]] || [[ -z "$ORG_ID" ]]; then
|
||||
echo "ERROR: Could not determine bucket or org ID" >&2
|
||||
echo "BUCKET_ID=$BUCKET_ID ORG_ID=$ORG_ID" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Create read-only token for Grafana
|
||||
READ_TOKEN=$(influx auth create \
|
||||
--org netbird \
|
||||
--read-bucket "$BUCKET_ID" \
|
||||
--description "Grafana read-only token" \
|
||||
--json | grep -oP '"token"\s*:\s*"\K[^"]+' | head -1)
|
||||
|
||||
echo ""
|
||||
echo "============================================"
|
||||
echo "GRAFANA READ-ONLY TOKEN:"
|
||||
echo "$READ_TOKEN"
|
||||
echo "============================================"
|
||||
10
client/internal/metrics/infra/ingest/Dockerfile
Normal file
10
client/internal/metrics/infra/ingest/Dockerfile
Normal file
@@ -0,0 +1,10 @@
|
||||
FROM golang:1.25-alpine AS build
|
||||
WORKDIR /app
|
||||
COPY go.mod main.go ./
|
||||
RUN CGO_ENABLED=0 go build -o ingest .
|
||||
|
||||
FROM alpine:3.20
|
||||
RUN adduser -D -H ingest
|
||||
COPY --from=build /app/ingest /usr/local/bin/ingest
|
||||
USER ingest
|
||||
ENTRYPOINT ["ingest"]
|
||||
11
client/internal/metrics/infra/ingest/go.mod
Normal file
11
client/internal/metrics/infra/ingest/go.mod
Normal file
@@ -0,0 +1,11 @@
|
||||
module github.com/netbirdio/netbird/client/internal/metrics/infra/ingest
|
||||
|
||||
go 1.25
|
||||
|
||||
require github.com/stretchr/testify v1.11.1
|
||||
|
||||
require (
|
||||
github.com/davecgh/go-spew v1.1.1 // indirect
|
||||
github.com/pmezard/go-difflib v1.0.0 // indirect
|
||||
gopkg.in/yaml.v3 v3.0.1 // indirect
|
||||
)
|
||||
10
client/internal/metrics/infra/ingest/go.sum
Normal file
10
client/internal/metrics/infra/ingest/go.sum
Normal file
@@ -0,0 +1,10 @@
|
||||
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
|
||||
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
|
||||
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
|
||||
github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U=
|
||||
github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U=
|
||||
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405 h1:yhCVgyC4o1eVCa2tZl7eS0r+SDo693bJlVdllGtEeKM=
|
||||
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
|
||||
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
|
||||
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
|
||||
355
client/internal/metrics/infra/ingest/main.go
Normal file
355
client/internal/metrics/infra/ingest/main.go
Normal file
@@ -0,0 +1,355 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"compress/gzip"
|
||||
"encoding/hex"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"log"
|
||||
"net/http"
|
||||
"os"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
)
|
||||
|
||||
const (
|
||||
defaultListenAddr = ":8087"
|
||||
defaultInfluxDBURL = "http://influxdb:8086/api/v2/write?org=netbird&bucket=metrics&precision=ns"
|
||||
maxBodySize = 50 * 1024 * 1024 // 50 MB max request body
|
||||
maxDurationSeconds = 300.0 // reject any duration field > 5 minutes
|
||||
peerIDLength = 16 // truncated SHA-256: 8 bytes = 16 hex chars
|
||||
maxTagValueLength = 64 // reject tag values longer than this
|
||||
)
|
||||
|
||||
type measurementSpec struct {
|
||||
allowedFields map[string]bool
|
||||
allowedTags map[string]bool
|
||||
}
|
||||
|
||||
var allowedMeasurements = map[string]measurementSpec{
|
||||
"netbird_peer_connection": {
|
||||
allowedFields: map[string]bool{
|
||||
"signaling_to_connection_seconds": true,
|
||||
"connection_to_wg_handshake_seconds": true,
|
||||
"total_seconds": true,
|
||||
},
|
||||
allowedTags: map[string]bool{
|
||||
"deployment_type": true,
|
||||
"connection_type": true,
|
||||
"attempt_type": true,
|
||||
"version": true,
|
||||
"os": true,
|
||||
"arch": true,
|
||||
"peer_id": true,
|
||||
"connection_pair_id": true,
|
||||
},
|
||||
},
|
||||
"netbird_sync": {
|
||||
allowedFields: map[string]bool{
|
||||
"duration_seconds": true,
|
||||
},
|
||||
allowedTags: map[string]bool{
|
||||
"deployment_type": true,
|
||||
"version": true,
|
||||
"os": true,
|
||||
"arch": true,
|
||||
"peer_id": true,
|
||||
},
|
||||
},
|
||||
"netbird_login": {
|
||||
allowedFields: map[string]bool{
|
||||
"duration_seconds": true,
|
||||
},
|
||||
allowedTags: map[string]bool{
|
||||
"deployment_type": true,
|
||||
"result": true,
|
||||
"version": true,
|
||||
"os": true,
|
||||
"arch": true,
|
||||
"peer_id": true,
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
func main() {
|
||||
listenAddr := envOr("INGEST_LISTEN_ADDR", defaultListenAddr)
|
||||
influxURL := envOr("INFLUXDB_URL", defaultInfluxDBURL)
|
||||
influxToken := os.Getenv("INFLUXDB_TOKEN")
|
||||
|
||||
if influxToken == "" {
|
||||
log.Fatal("INFLUXDB_TOKEN is required")
|
||||
}
|
||||
|
||||
client := &http.Client{Timeout: 10 * time.Second}
|
||||
|
||||
http.HandleFunc("/", handleIngest(client, influxURL, influxToken))
|
||||
|
||||
// Build config JSON once at startup from env vars
|
||||
configJSON := buildConfigJSON()
|
||||
if configJSON != nil {
|
||||
log.Printf("serving remote config at /config")
|
||||
}
|
||||
|
||||
http.HandleFunc("/config", func(w http.ResponseWriter, r *http.Request) {
|
||||
if r.Method != http.MethodGet {
|
||||
http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
|
||||
return
|
||||
}
|
||||
if configJSON == nil {
|
||||
http.Error(w, "config not configured", http.StatusNotFound)
|
||||
return
|
||||
}
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.Write(configJSON) //nolint:errcheck
|
||||
})
|
||||
|
||||
http.HandleFunc("/health", func(w http.ResponseWriter, _ *http.Request) {
|
||||
w.WriteHeader(http.StatusOK)
|
||||
fmt.Fprint(w, "ok") //nolint:errcheck
|
||||
})
|
||||
|
||||
log.Printf("ingest server listening on %s, forwarding to %s", listenAddr, influxURL)
|
||||
if err := http.ListenAndServe(listenAddr, nil); err != nil { //nolint:gosec
|
||||
log.Fatal(err)
|
||||
}
|
||||
}
|
||||
|
||||
func handleIngest(client *http.Client, influxURL, influxToken string) http.HandlerFunc {
|
||||
return func(w http.ResponseWriter, r *http.Request) {
|
||||
if r.Method != http.MethodPost {
|
||||
http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
|
||||
return
|
||||
}
|
||||
|
||||
if err := validateAuth(r); err != nil {
|
||||
http.Error(w, err.Error(), http.StatusUnauthorized)
|
||||
return
|
||||
}
|
||||
|
||||
body, err := readBody(r)
|
||||
if err != nil {
|
||||
http.Error(w, err.Error(), http.StatusBadRequest)
|
||||
return
|
||||
}
|
||||
if len(body) > maxBodySize {
|
||||
http.Error(w, "body too large", http.StatusRequestEntityTooLarge)
|
||||
return
|
||||
}
|
||||
|
||||
validated, err := validateLineProtocol(body)
|
||||
if err != nil {
|
||||
log.Printf("WARN validation failed from %s: %v", r.RemoteAddr, err)
|
||||
http.Error(w, err.Error(), http.StatusBadRequest)
|
||||
return
|
||||
}
|
||||
|
||||
forwardToInflux(w, r, client, influxURL, influxToken, validated)
|
||||
}
|
||||
}
|
||||
|
||||
func forwardToInflux(w http.ResponseWriter, r *http.Request, client *http.Client, influxURL, influxToken string, body []byte) {
|
||||
req, err := http.NewRequestWithContext(r.Context(), http.MethodPost, influxURL, bytes.NewReader(body))
|
||||
if err != nil {
|
||||
log.Printf("ERROR create request: %v", err)
|
||||
http.Error(w, "internal error", http.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
req.Header.Set("Content-Type", "text/plain; charset=utf-8")
|
||||
req.Header.Set("Authorization", "Token "+influxToken)
|
||||
|
||||
resp, err := client.Do(req)
|
||||
if err != nil {
|
||||
log.Printf("ERROR forward to influxdb: %v", err)
|
||||
http.Error(w, "upstream error", http.StatusBadGateway)
|
||||
return
|
||||
}
|
||||
defer func(Body io.ReadCloser) {
|
||||
_ = Body.Close()
|
||||
}(resp.Body)
|
||||
|
||||
w.WriteHeader(resp.StatusCode)
|
||||
io.Copy(w, resp.Body) //nolint:errcheck
|
||||
}
|
||||
|
||||
// validateAuth checks that the X-Peer-ID header contains a valid hashed peer ID.
|
||||
func validateAuth(r *http.Request) error {
|
||||
peerID := r.Header.Get("X-Peer-ID")
|
||||
if peerID == "" {
|
||||
return fmt.Errorf("missing X-Peer-ID header")
|
||||
}
|
||||
if len(peerID) != peerIDLength {
|
||||
return fmt.Errorf("invalid X-Peer-ID header length")
|
||||
}
|
||||
if _, err := hex.DecodeString(peerID); err != nil {
|
||||
return fmt.Errorf("invalid X-Peer-ID header format")
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// readBody reads the request body, decompressing gzip if Content-Encoding indicates it.
|
||||
func readBody(r *http.Request) ([]byte, error) {
|
||||
reader := io.LimitReader(r.Body, maxBodySize+1)
|
||||
|
||||
if r.Header.Get("Content-Encoding") == "gzip" {
|
||||
gz, err := gzip.NewReader(reader)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("invalid gzip: %w", err)
|
||||
}
|
||||
defer gz.Close()
|
||||
reader = io.LimitReader(gz, maxBodySize+1)
|
||||
}
|
||||
|
||||
return io.ReadAll(reader)
|
||||
}
|
||||
|
||||
// validateLineProtocol parses InfluxDB line protocol lines,
|
||||
// whitelists measurements and fields, and checks value bounds.
|
||||
func validateLineProtocol(body []byte) ([]byte, error) {
|
||||
lines := strings.Split(strings.TrimSpace(string(body)), "\n")
|
||||
var valid []string
|
||||
|
||||
for _, line := range lines {
|
||||
line = strings.TrimSpace(line)
|
||||
if line == "" {
|
||||
continue
|
||||
}
|
||||
|
||||
if err := validateLine(line); err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
valid = append(valid, line)
|
||||
}
|
||||
|
||||
if len(valid) == 0 {
|
||||
return nil, fmt.Errorf("no valid lines")
|
||||
}
|
||||
|
||||
return []byte(strings.Join(valid, "\n") + "\n"), nil
|
||||
}
|
||||
|
||||
func validateLine(line string) error {
|
||||
// line protocol: measurement,tag=val,tag=val field=val,field=val timestamp
|
||||
parts := strings.SplitN(line, " ", 3)
|
||||
if len(parts) < 2 {
|
||||
return fmt.Errorf("invalid line protocol: %q", truncate(line, 100))
|
||||
}
|
||||
|
||||
// parts[0] is "measurement,tag=val,tag=val"
|
||||
measurementAndTags := strings.Split(parts[0], ",")
|
||||
measurement := measurementAndTags[0]
|
||||
|
||||
spec, ok := allowedMeasurements[measurement]
|
||||
if !ok {
|
||||
return fmt.Errorf("unknown measurement: %q", measurement)
|
||||
}
|
||||
|
||||
// Validate tags (everything after measurement name in parts[0])
|
||||
for _, tagPair := range measurementAndTags[1:] {
|
||||
if err := validateTag(tagPair, measurement, spec.allowedTags); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
|
||||
// Validate fields
|
||||
for _, pair := range strings.Split(parts[1], ",") {
|
||||
if err := validateField(pair, measurement, spec.allowedFields); err != nil {
|
||||
return err
|
||||
}
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func validateTag(pair, measurement string, allowedTags map[string]bool) error {
|
||||
kv := strings.SplitN(pair, "=", 2)
|
||||
if len(kv) != 2 {
|
||||
return fmt.Errorf("invalid tag: %q", pair)
|
||||
}
|
||||
|
||||
tagName := kv[0]
|
||||
if !allowedTags[tagName] {
|
||||
return fmt.Errorf("unknown tag %q in measurement %q", tagName, measurement)
|
||||
}
|
||||
|
||||
if len(kv[1]) > maxTagValueLength {
|
||||
return fmt.Errorf("tag value too long for %q: %d > %d", tagName, len(kv[1]), maxTagValueLength)
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func validateField(pair, measurement string, allowedFields map[string]bool) error {
|
||||
kv := strings.SplitN(pair, "=", 2)
|
||||
if len(kv) != 2 {
|
||||
return fmt.Errorf("invalid field: %q", pair)
|
||||
}
|
||||
|
||||
fieldName := kv[0]
|
||||
if !allowedFields[fieldName] {
|
||||
return fmt.Errorf("unknown field %q in measurement %q", fieldName, measurement)
|
||||
}
|
||||
|
||||
val, err := strconv.ParseFloat(kv[1], 64)
|
||||
if err != nil {
|
||||
return fmt.Errorf("invalid field value %q for %q", kv[1], fieldName)
|
||||
}
|
||||
if val < 0 {
|
||||
return fmt.Errorf("negative value for %q: %g", fieldName, val)
|
||||
}
|
||||
if strings.HasSuffix(fieldName, "_seconds") && val > maxDurationSeconds {
|
||||
return fmt.Errorf("%q too large: %g > %g", fieldName, val, maxDurationSeconds)
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// buildConfigJSON builds the remote config JSON from env vars.
|
||||
// Returns nil if required vars are not set.
|
||||
func buildConfigJSON() []byte {
|
||||
serverURL := os.Getenv("CONFIG_METRICS_SERVER_URL")
|
||||
versionSince := envOr("CONFIG_VERSION_SINCE", "0.0.0")
|
||||
versionUntil := envOr("CONFIG_VERSION_UNTIL", "99.99.99")
|
||||
periodMinutes := envOr("CONFIG_PERIOD_MINUTES", "5")
|
||||
|
||||
if serverURL == "" {
|
||||
return nil
|
||||
}
|
||||
|
||||
period, err := strconv.Atoi(periodMinutes)
|
||||
if err != nil || period <= 0 {
|
||||
log.Printf("WARN invalid CONFIG_PERIOD_MINUTES: %q, using 5", periodMinutes)
|
||||
period = 5
|
||||
}
|
||||
|
||||
cfg := map[string]any{
|
||||
"server_url": serverURL,
|
||||
"version-since": versionSince,
|
||||
"version-until": versionUntil,
|
||||
"period_minutes": period,
|
||||
}
|
||||
|
||||
data, err := json.Marshal(cfg)
|
||||
if err != nil {
|
||||
log.Printf("ERROR failed to marshal config: %v", err)
|
||||
return nil
|
||||
}
|
||||
return data
|
||||
}
|
||||
|
||||
func envOr(key, defaultVal string) string {
|
||||
if v := os.Getenv(key); v != "" {
|
||||
return v
|
||||
}
|
||||
return defaultVal
|
||||
}
|
||||
|
||||
func truncate(s string, n int) string {
|
||||
if len(s) <= n {
|
||||
return s
|
||||
}
|
||||
return s[:n] + "..."
|
||||
}
|
||||
124
client/internal/metrics/infra/ingest/main_test.go
Normal file
124
client/internal/metrics/infra/ingest/main_test.go
Normal file
@@ -0,0 +1,124 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
"strings"
|
||||
"testing"
|
||||
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
func TestValidateLine_ValidPeerConnection(t *testing.T) {
|
||||
line := `netbird_peer_connection,deployment_type=cloud,connection_type=ice,attempt_type=initial,version=1.0.0,os=linux,arch=amd64,peer_id=abcdef0123456789,connection_pair_id=pair1234 signaling_to_connection_seconds=1.5,connection_to_wg_handshake_seconds=0.5,total_seconds=2 1234567890`
|
||||
assert.NoError(t, validateLine(line))
|
||||
}
|
||||
|
||||
func TestValidateLine_ValidSync(t *testing.T) {
|
||||
line := `netbird_sync,deployment_type=selfhosted,version=2.0.0,os=darwin,arch=arm64,peer_id=abcdef0123456789 duration_seconds=1.5 1234567890`
|
||||
assert.NoError(t, validateLine(line))
|
||||
}
|
||||
|
||||
func TestValidateLine_ValidLogin(t *testing.T) {
|
||||
line := `netbird_login,deployment_type=cloud,result=success,version=1.0.0,os=linux,arch=amd64,peer_id=abcdef0123456789 duration_seconds=3.2 1234567890`
|
||||
assert.NoError(t, validateLine(line))
|
||||
}
|
||||
|
||||
func TestValidateLine_UnknownMeasurement(t *testing.T) {
|
||||
line := `unknown_metric,foo=bar value=1 1234567890`
|
||||
err := validateLine(line)
|
||||
require.Error(t, err)
|
||||
assert.Contains(t, err.Error(), "unknown measurement")
|
||||
}
|
||||
|
||||
func TestValidateLine_UnknownTag(t *testing.T) {
|
||||
line := `netbird_sync,deployment_type=cloud,evil_tag=injected,version=1.0.0,os=linux,arch=amd64,peer_id=abc duration_seconds=1.5 1234567890`
|
||||
err := validateLine(line)
|
||||
require.Error(t, err)
|
||||
assert.Contains(t, err.Error(), "unknown tag")
|
||||
}
|
||||
|
||||
func TestValidateLine_UnknownField(t *testing.T) {
|
||||
line := `netbird_sync,deployment_type=cloud,version=1.0.0,os=linux,arch=amd64,peer_id=abc injected_field=1 1234567890`
|
||||
err := validateLine(line)
|
||||
require.Error(t, err)
|
||||
assert.Contains(t, err.Error(), "unknown field")
|
||||
}
|
||||
|
||||
func TestValidateLine_NegativeValue(t *testing.T) {
|
||||
line := `netbird_sync,deployment_type=cloud,version=1.0.0,os=linux,arch=amd64,peer_id=abc duration_seconds=-1.5 1234567890`
|
||||
err := validateLine(line)
|
||||
require.Error(t, err)
|
||||
assert.Contains(t, err.Error(), "negative")
|
||||
}
|
||||
|
||||
func TestValidateLine_DurationTooLarge(t *testing.T) {
|
||||
line := `netbird_sync,deployment_type=cloud,version=1.0.0,os=linux,arch=amd64,peer_id=abc duration_seconds=999 1234567890`
|
||||
err := validateLine(line)
|
||||
require.Error(t, err)
|
||||
assert.Contains(t, err.Error(), "too large")
|
||||
}
|
||||
|
||||
func TestValidateLine_TotalSecondsTooLarge(t *testing.T) {
|
||||
line := `netbird_peer_connection,deployment_type=cloud,connection_type=ice,attempt_type=initial,version=1.0.0,os=linux,arch=amd64,peer_id=abc,connection_pair_id=pair total_seconds=500 1234567890`
|
||||
err := validateLine(line)
|
||||
require.Error(t, err)
|
||||
assert.Contains(t, err.Error(), "too large")
|
||||
}
|
||||
|
||||
func TestValidateLine_TagValueTooLong(t *testing.T) {
|
||||
longTag := strings.Repeat("a", maxTagValueLength+1)
|
||||
line := `netbird_sync,deployment_type=` + longTag + `,version=1.0.0,os=linux,arch=amd64,peer_id=abc duration_seconds=1.5 1234567890`
|
||||
err := validateLine(line)
|
||||
require.Error(t, err)
|
||||
assert.Contains(t, err.Error(), "tag value too long")
|
||||
}
|
||||
|
||||
func TestValidateLineProtocol_MultipleLines(t *testing.T) {
|
||||
body := []byte(
|
||||
"netbird_sync,deployment_type=cloud,version=1.0.0,os=linux,arch=amd64,peer_id=abc duration_seconds=1.5 1234567890\n" +
|
||||
"netbird_login,deployment_type=cloud,result=success,version=1.0.0,os=linux,arch=amd64,peer_id=abc duration_seconds=2.0 1234567890\n",
|
||||
)
|
||||
validated, err := validateLineProtocol(body)
|
||||
require.NoError(t, err)
|
||||
assert.Contains(t, string(validated), "netbird_sync")
|
||||
assert.Contains(t, string(validated), "netbird_login")
|
||||
}
|
||||
|
||||
func TestValidateLineProtocol_RejectsOnBadLine(t *testing.T) {
|
||||
body := []byte(
|
||||
"netbird_sync,deployment_type=cloud,version=1.0.0,os=linux,arch=amd64,peer_id=abc duration_seconds=1.5 1234567890\n" +
|
||||
"evil_metric,foo=bar value=1 1234567890\n",
|
||||
)
|
||||
_, err := validateLineProtocol(body)
|
||||
require.Error(t, err)
|
||||
}
|
||||
|
||||
func TestValidateAuth(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
peerID string
|
||||
wantErr bool
|
||||
}{
|
||||
{"valid hex", "abcdef0123456789", false},
|
||||
{"empty", "", true},
|
||||
{"too short", "abcdef01234567", true},
|
||||
{"too long", "abcdef01234567890", true},
|
||||
{"invalid hex", "ghijklmnopqrstuv", true},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
r, _ := http.NewRequest(http.MethodPost, "/", nil)
|
||||
if tt.peerID != "" {
|
||||
r.Header.Set("X-Peer-ID", tt.peerID)
|
||||
}
|
||||
err := validateAuth(r)
|
||||
if tt.wantErr {
|
||||
require.Error(t, err)
|
||||
} else {
|
||||
require.NoError(t, err)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
Reference in New Issue
Block a user