finalized implemntation plan

This commit is contained in:
Ashley Mensah
2026-03-02 13:46:19 +01:00
parent 0b21498b39
commit 721aa41361

335
idp-migration-plan.md Normal file
View File

@@ -0,0 +1,335 @@
# Plan: Standalone IdP Migration Tool (External IdP → Embedded DEX)
## Context
**Target repo:** `/Users/ashleymensah/Documents/netbird-repos/netbird` (main repo, not the fork)
Self-hosted NetBird users migrating from an external IdP (Zitadel, Keycloak, Okta, etc.) to NetBird's embedded DEX-based IdP need a way to re-key all user IDs in the database. A colleague's fork at `/Users/ashleymensah/Documents/netbird-repos/nico-netbird/netbird` has a prototype that runs inside management as an AfterInit hook, but this has a chicken-and-egg problem (enabling EmbeddedIdP causes management to initialize DEX before migration runs → startup failure).
This plan creates a **standalone CLI tool** that runs with management stopped, re-keys all user IDs, then the user manually updates their management config and restarts. The main repo already has DEX/EmbeddedIdP infrastructure but is missing the store methods and migration logic — these need to be created (porting patterns from the fork).
**Note:** Does not need to work with the combined management container setup (that only supports embeddedIdP-enabled setups anyway).
---
## What the migration does
For each user, transforms the old ID (e.g., a Zitadel UUID) into DEX's encoded format:
```
newID = EncodeDexUserID(oldUserID, connectorID)
→ base64(protobuf{field1: userID, field2: connectorID})
```
This encoded ID is what DEX puts in JWT `sub` claims, ensuring continuity after switching IdPs.
---
## Tables requiring user ID updates
### Main store (store.db / PostgreSQL) — 10 columns
| # | Table | Column | Notes |
|---|-------|--------|-------|
| 1 | `users` | `id` (PK) | Primary key update, done last in transaction |
| 2 | `personal_access_tokens` | `user_id` (FK) | |
| 3 | `personal_access_tokens` | `created_by` | |
| 4 | `peers` | `user_id` | |
| 5 | `user_invites` | `created_by` | GORM `TableName()` returns `user_invites` (not `user_invite_records`) |
| 6 | `accounts` | `created_by` | |
| 7 | `proxy_access_tokens` | `created_by` | |
| 8 | `jobs` | `triggered_by` | |
| 9 | `policy_rules` | `authorized_user` | SSH policy user refs — missed by fork's implementation |
| 10 | `access_log_entries` | `user_id` | Reverse proxy access logs — missed by both fork and original plan |
### Activity store (events.db / PostgreSQL) — 3 columns
| # | Table | Column | Notes |
|---|-------|--------|-------|
| 10 | `events` | `initiator_id` | |
| 11 | `events` | `target_id` | |
| 12 | `deleted_users` | `id` (PK) | Raw SQL needed (GORM can't update PK via Model) |
**Total: 13 columns (10 main store + 3 activity store)**
### Verified NOT needing migration
- `policy_rules.authorized_groups` — maps group IDs → local Unix usernames (e.g., "root", "admin"), NOT NetBird user IDs
- `groups` / `group_peers` — store peer IDs, not user IDs
- `routes`, `nameserver_groups`, `setup_keys`, `posture_checks`, `networks`, `dns_settings` — no user ID fields
---
## What exists in main repo vs what needs to be created
| Component | Main repo status | Action |
|-----------|-----------------|--------|
| `EncodeDexUserID` / `DecodeDexUserID` | EXISTS at `idp/dex/provider.go` | No changes |
| EmbeddedIdP config + manager | EXISTS at `management/server/idp/embedded.go` | No changes |
| DEX provider | EXISTS at `idp/dex/provider.go` | No changes |
| Server bootstrapping (modules.go) | EXISTS at `management/internals/server/modules.go` | No changes |
| `Store.ListUsers()` interface method | **MISSING** | Add to `management/server/store/store.go` |
| `SqlStore.ListUsers()` implementation | **MISSING** | Add to `management/server/store/sql_store.go` |
| `Store.UpdateUserID()` interface method | **MISSING** | Add to `management/server/store/store.go` |
| `SqlStore.UpdateUserID()` implementation | **MISSING** | Add to `management/server/store/sql_store.go` |
| `activity.Store.UpdateUserID()` interface | **MISSING** | Add to `management/server/activity/store.go` |
| Activity `Store.UpdateUserID()` implementation | **MISSING** | Add to `management/server/activity/store/sql_store.go` |
| `InMemoryEventStore.UpdateUserID()` no-op | **MISSING** | Add to `management/server/activity/store.go` (compile-blocking) |
| `txDeferFKConstraints` helper | **MISSING** | Port from fork to `management/server/store/sql_store.go` |
| Store mock regeneration | **NEEDED** | Run `go generate ./management/server/store/...` after interface changes |
| Migration package | **MISSING** | Create at `management/server/idp/migration/` |
| Standalone CLI tool | **MISSING** | Create at `management/cmd/migrate-idp/` |
**Source of patterns:** Fork at `/Users/ashleymensah/Documents/netbird-repos/nico-netbird/netbird`
---
## Implementation plan
### Step 1: Add `ListUsers()` to store interface and implementation
**File:** `management/server/store/store.go` — add to Store interface:
```go
ListUsers(ctx context.Context) ([]*types.User, error)
```
**File:** `management/server/store/sql_store.go` — add implementation:
```go
func (s *SqlStore) ListUsers(ctx context.Context) ([]*types.User, error) {
var users []*types.User
if err := s.db.Find(&users).Error; err != nil {
return nil, status.Errorf(status.Internal, "failed to list users")
}
// Decrypt sensitive fields (Email, Name) so logging shows readable values.
// No-op when fieldEncrypt is nil (no encryption key configured).
for _, user := range users {
if err := user.DecryptSensitiveData(s.fieldEncrypt); err != nil {
return nil, status.Errorf(status.Internal, "failed to decrypt user data")
}
}
return users, nil
}
```
### Step 2: Add `UpdateUserID()` to store interface and implementation
**File:** `management/server/store/store.go` — add to Store interface:
```go
UpdateUserID(ctx context.Context, accountID, oldUserID, newUserID string) error
```
**File:** `management/server/store/sql_store.go` — add implementation (ported from fork, with `policy_rules` fix):
```go
func (s *SqlStore) UpdateUserID(ctx context.Context, accountID, oldUserID, newUserID string) error {
updates := []fkUpdate{
{&types.PersonalAccessToken{}, "user_id", "user_id = ?"},
{&types.PersonalAccessToken{}, "created_by", "created_by = ?"},
{&nbpeer.Peer{}, "user_id", "user_id = ?"},
{&types.UserInviteRecord{}, "created_by", "created_by = ?"},
{&types.Account{}, "created_by", "created_by = ?"},
{&types.ProxyAccessToken{}, "created_by", "created_by = ?"},
{&types.Job{}, "triggered_by", "triggered_by = ?"},
{&types.PolicyRule{}, "authorized_user", "authorized_user = ?"}, // missed by fork
{&accesslogs.AccessLogEntry{}, "user_id", "user_id = ?"}, // missed by both fork and original plan
}
// Transaction with deferred FK constraints, update FKs first, then users.id PK
// Note: txDeferFKConstraints helper must be ported from fork (does not exist in main repo)
// - SQLite: PRAGMA defer_foreign_keys = ON
// - PostgreSQL: SET CONSTRAINTS ALL DEFERRED (belt-and-suspenders; FK-first update order
// already handles non-deferrable constraints)
// - MySQL: handled by existing transaction() helper (SET FOREIGN_KEY_CHECKS = 0)
}
```
### Step 2b: Port `txDeferFKConstraints` helper
**File:** `management/server/store/sql_store.go` — add helper (ported from fork lines 842-853):
```go
func (s *SqlStore) txDeferFKConstraints(tx *gorm.DB) error {
// SQLite: defer FK checks until transaction commit
// PostgreSQL: defer constraints (belt-and-suspenders; update order handles non-deferrable)
// MySQL: already handled by transaction() wrapper
}
```
### Step 3: Add `UpdateUserID()` to activity store interface and implementation
**File:** `management/server/activity/store.go` — add to Store interface:
```go
UpdateUserID(ctx context.Context, oldUserID, newUserID string) error
```
**File:** `management/server/activity/store.go` — add no-op to `InMemoryEventStore` (compile-blocking):
```go
func (store *InMemoryEventStore) UpdateUserID(_ context.Context, _, _ string) error {
return nil
}
```
**File:** `management/server/activity/store/sql_store.go` — add implementation (ported from fork):
- Update `events.initiator_id` and `events.target_id` via GORM
- Update `deleted_users.id` via raw SQL (GORM can't update PK via Model)
- All in one transaction
### Step 3b: Regenerate store mocks
Run `go generate ./management/server/store/...` to regenerate `store_mock.go` with the new `ListUsers` and `UpdateUserID` methods. Without this, tests using the mock won't compile.
### Step 4: Create migration package
**New file:** `management/server/idp/migration/migration.go`
- Define narrow interfaces:
```go
type MainStoreUpdater interface {
ListUsers(ctx context.Context) ([]*types.User, error)
UpdateUserID(ctx context.Context, accountID, oldUserID, newUserID string) error
}
type ActivityStoreUpdater interface {
UpdateUserID(ctx context.Context, oldUserID, newUserID string) error
}
```
- `MigrationConfig` struct: `ConnectorID`, `DryRun`, `MainStore`, `ActivityStore`
- `MigrationResult` struct: `Migrated`, `Skipped` counts
- `Migrate(ctx, *MigrationConfig) (*MigrationResult, error)`:
1. List all users from main store
2. Reconciliation pass: for already-migrated users, ensure activity store is also updated
3. For each non-migrated user: encode new ID, update both stores
4. Return counts
- Idempotency: `DecodeDexUserID(user.Id)` succeeds → user already migrated, skip
- Empty-ID guard: skip users with `Id == ""` before the decode check (`DecodeDexUserID("")` succeeds with empty strings — edge case)
- Service users: `IsServiceUser=true` users get re-keyed like all others (they'll be looked up by the new DEX-encoded ID after migration). This is intentional — document in CLI help text.
- Uses `EncodeDexUserID` / `DecodeDexUserID` from `idp/dex/provider.go`
**New file:** `management/server/idp/migration/migration_test.go`
- Mock-based tests for `Migrate()` covering: normal migration, skip already-migrated, dry-run, reconciliation, empty user list, error handling
### Step 5: Build the standalone CLI tool
**New file:** `management/cmd/migrate-idp/main.go` (~200 lines)
CLI flags:
| Flag | Required | Default | Description |
|------|----------|---------|-------------|
| `--config` | Yes | `/etc/netbird/management.json` | Path to management config |
| `--connector-id` | Yes | — | DEX connector ID to encode into user IDs |
| `--dry-run` | No | `false` | Preview changes without writing |
| `--no-backup` | No | `false` | Skip automatic database backup |
| `--log-level` | No | `info` | Verbosity |
Flow:
1. Load management config JSON (reuse `util.ReadJsonWithEnvSub`)
2. Validate: connector-id is non-empty, DB is accessible
3. Open main store via `store.NewStore(ctx, engine, datadir, nil, false)` — nil metrics, run AutoMigrate
- `skipMigration=false` ensures schema is up-to-date (AutoMigrate is idempotent/non-destructive)
- Using `true` risks stale schema if user upgrades management + tool simultaneously
4. Call `store.SetFieldEncrypt(enc)` to enable field decryption (needed for `ListUsers` to return readable Email/Name for logging)
5. Open activity store via `activity_store.NewSqlStore(ctx, datadir, encryptionKey)`
- Gracefully handle missing activity store (e.g., `events.db` doesn't exist) — warn and skip activity migration
6. Backup databases (SQLite: file copy; PostgreSQL: print `pg_dump` instructions)
7. Call `migration.Migrate(ctx, cfg)`
8. Print summary and exit
**New file:** `management/cmd/migrate-idp/backup.go` (~60 lines)
- `backupSQLiteFile(srcPath)` — copies to `{src}.backup-{timestamp}`
### Step 6: Tests
- Unit tests in `migration_test.go` with mock interfaces
- Integration test in `management/cmd/migrate-idp/main_test.go` with real SQLite:
- Seed users, events, policy rules with `authorized_user`, access log entries with `user_id`
- Run migration, verify all 13 columns updated
- Run again, verify idempotent (0 new migrations)
- Test partial failure reconciliation
- Test missing activity store (graceful skip)
---
## User-facing migration procedure
```
1. Stop management: systemctl stop netbird-management
2. Dry-run: netbird-migrate-idp \
--config /etc/netbird/management.json \
--connector-id "oidc" \
--dry-run
3. Run migration: netbird-migrate-idp \
--config /etc/netbird/management.json \
--connector-id "oidc"
4. Update management.json: Add EmbeddedIdP config with a StaticConnector
whose ID matches the --connector-id used above (see below)
5. Start management: systemctl start netbird-management
```
### Why manual config is required (step 4)
The EmbeddedIdP config block isn't just about the connector — it includes deployment-specific
values that depend on your infrastructure: OIDC issuer URL (must match your public domain),
dashboard/CLI redirect URIs (depend on your reverse proxy setup), storage paths, the initial
owner account (email + bcrypt password hash), and whether local password auth is disabled.
Auto-generating these would require the tool to make assumptions about DNS, port config,
and proxy setup that could easily be wrong. The connector ID is the only piece the migration
tool owns (it's baked into user IDs). Everything else is infrastructure config that belongs
in the operator's hands. Getting any of these wrong means management still won't start.
---
## Pitfalls and mitigations
| Risk | Mitigation |
|------|------------|
| Management running during migration | Warn user; SQLite will return SQLITE_BUSY with clear error |
| Wrong connector ID | Dry-run shows exact ID transformations; backup enables rollback |
| Partial failure mid-migration | Idempotent: `DecodeDexUserID` detects already-migrated users; reconciliation pass fixes activity store lag |
| Large user count | Each user migrated in own transaction; progress every 100 users (not per-user to avoid log spam) |
| Missing encryption key for activity store | Read from management config's `DataStoreEncryptionKey` |
| Missing activity store database | Warn and skip activity migration; main store migration proceeds |
| Empty user ID in database | Explicit guard before decode check; `DecodeDexUserID("")` succeeds with empty strings |
| Re-running with different connector-id | Already-migrated users correctly skipped (decode succeeds). To change connector-id, restore from backup first |
| MySQL store engine | Supported — existing `transaction()` helper handles `SET FOREIGN_KEY_CHECKS = 0` |
| PostgreSQL non-deferrable FK constraints | Update order (FKs first, PK last) avoids constraint violations regardless of deferrability |
---
## Verification
1. **Unit tests:** Mock-based tests for migration logic (skip/migrate/dry-run/reconcile/empty-ID guard)
2. **Integration test:** Real SQLite databases seeded with test data, verify all 13 columns
3. **Manual test:** Run `--dry-run` on a copy of a real self-hosted deployment's databases
4. **Idempotency test:** Run migration twice, second run should report 0 migrations
5. **Policy rules test:** Seed `policy_rules.authorized_user` with old user ID, verify it's updated
6. **Access log test:** Seed `access_log_entries.user_id` with old user ID, verify it's updated
7. **Missing activity store test:** Run with missing `events.db`, verify main store migration succeeds with warning
---
## Key files (all paths relative to main repo)
**New files to create:**
- `management/server/idp/migration/migration.go` — migration interfaces + `Migrate()` function
- `management/server/idp/migration/migration_test.go` — unit tests
- `management/cmd/migrate-idp/main.go` — CLI entry point
- `management/cmd/migrate-idp/backup.go` — SQLite backup logic
- `management/cmd/migrate-idp/main_test.go` — integration tests
**Existing files to modify:**
- `management/server/store/store.go` — add `ListUsers()` and `UpdateUserID()` to Store interface
- `management/server/store/sql_store.go` — add `ListUsers()`, `UpdateUserID()`, and `txDeferFKConstraints()` implementations
- `management/server/activity/store.go` — add `UpdateUserID()` to Store interface + `InMemoryEventStore.UpdateUserID()` no-op
- `management/server/activity/store/sql_store.go` — add `UpdateUserID()` implementation
**Generated files to regenerate:**
- `management/server/store/store_mock.go` — run `go generate ./management/server/store/...` after interface changes
**Read-only references (port patterns from fork):**
- Fork's `management/server/store/sql_store.go:855-895` — `UpdateUserID()` pattern
- Fork's `management/server/activity/store/sql_store.go:230-254` — activity `UpdateUserID()` pattern
- Fork's `management/server/idp/migration/migration.go` — orchestration logic pattern
**Existing files used as-is (no changes):**
- `idp/dex/provider.go` — `EncodeDexUserID` / `DecodeDexUserID`
- `management/server/types/policyrule.go:88` — `AuthorizedUser` field
- `management/internals/modules/reverseproxy/accesslogs/accesslogentry.go:25` — `AccessLogEntry.UserId` field
- `management/server/idp/embedded.go` — EmbeddedIdP manager