mirror of
https://github.com/netbirdio/docs.git
synced 2026-04-16 07:26:35 +00:00
Adding new AI use case doc and images (#470)
* Ading new docs and images * Update title from NetBid to NetBird typo fix * Update link for 'Distributed AI on Kubernetes' distributed-multi-cloud-ai-argocd-microk8s-vllm * Rename file url scheme distributed-multi-cloud-ai-argocd-microk8s-vllm
This commit is contained in:
Binary file not shown.
|
After Width: | Height: | Size: 138 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 389 KiB |
BIN
public/docs-static/img/use-cases/distributed-ai-argo/llm-geo.jpg
Normal file
BIN
public/docs-static/img/use-cases/distributed-ai-argo/llm-geo.jpg
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 250 KiB |
@@ -273,6 +273,7 @@ export const docsNavigation = [
|
||||
{ title: 'Routing peers and Kubernetes', href: '/how-to/routing-peers-and-kubernetes'},
|
||||
{ title: 'NetBird Client on AWS ECS', href: '/how-to/examples'},
|
||||
{ title: 'NetBird on Mikrotik Router', href: '/how-to/client-on-mikrotik-router' },
|
||||
{ title: 'Distributed AI on Kubernetes', href: '/use-cases/distributed-multi-cloud-ai-argocd-microk8s-vllm' },
|
||||
{ title: 'Self-hosted vs. Cloud-hosted NetBird', href: '/selfhosted/self-hosted-vs-cloud-netbird' },
|
||||
],
|
||||
|
||||
|
||||
@@ -0,0 +1,698 @@
|
||||
import {Note} from "@/components/mdx";
|
||||
|
||||
# Distributed AI with NetBird, ArgoCD, MicroK8s, and vLLM
|
||||
|
||||
We are building a distributed AI infrastructure project that connects GPU clusters across many different cloud providers using Kubernetes orchestration. This setup uses ArgoCD as a GitOps control plane to manage workloads across multiple MicroK8s clusters, with NetBird providing secure zero-trust networking between all components.
|
||||
|
||||
<Note>
|
||||
This is document is activily being changed and tested. Please see 'Known Issues and Future Improvements' at the bottom of the page.
|
||||
</Note>
|
||||
|
||||
<p>
|
||||
<img src="/docs-static/img/use-cases/distributed-ai-argo/ai-infra.jpg" alt="geolocation dns" className="imagewrapper"/>
|
||||
</p>
|
||||
|
||||
## Architecture Components
|
||||
|
||||
- **ArgoCD Control Plane**: GitOps tool with UI for managing and automatically syncing Kubernetes applications from Git repositories.
|
||||
- **MicroK8s Edge Clusters**: Lightweight Kubernetes clusters deployed across multiple cloud providers, each with GPU support for AI workloads
|
||||
- **vLLM Inference Service**: AI model serving infrastructure running on MicroK8s deployed to clusters for distributed inference
|
||||
- **GeoLocation DNS**: Geographic routing to direct traffic to the nearest available cluster
|
||||
|
||||
## ArgoCD Control Plane Setup
|
||||
|
||||
### Create the GKE Cluster
|
||||
|
||||
First, we'll create a Google Kubernetes Engine (GKE) cluster to host our ArgoCD control plane. This cluster will be the central management point for all our edge MicroK8s clusters. You don’t need to use GKE, but you will need to adapt these step to work with your provider.
|
||||
|
||||
```jsx
|
||||
export GCP_PROJECT=netbird-demos
|
||||
export GCP_CLUSTER_NAME=megamesh
|
||||
export GCP_REGION=europe-west4
|
||||
|
||||
gcloud container --project "${GCP_PROJECT}" clusters create "${GCP_CLUSTER_NAME}" \
|
||||
--no-enable-basic-auth --release-channel "regular" \
|
||||
--machine-type "e2-standard-2" --image-type "COS_CONTAINERD" --disk-type "pd-standard" \
|
||||
--disk-size "100" --metadata disable-legacy-endpoints=true \
|
||||
--num-nodes "1" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias \
|
||||
--enable-autoscaling --min-nodes "1" --max-nodes "2" \
|
||||
--network "projects/${GCP_PROJECT}/global/networks/default" \
|
||||
--subnetwork "projects/${GCP_PROJECT}/regions/${GCP_REGION}/subnetworks/default" \
|
||||
--no-enable-intra-node-visibility --default-max-pods-per-node "110" \
|
||||
--security-posture=standard --workload-vulnerability-scanning=disabled \
|
||||
--no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver \
|
||||
--enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 \
|
||||
--enable-managed-prometheus --enable-shielded-nodes \
|
||||
--region "${GCP_REGION}" --node-locations "${GCP_REGION}-b,${GCP_REGION}-c"
|
||||
```
|
||||
|
||||
### Install NetBird Operator
|
||||
|
||||
NetBird provides the secure networking layer that will allow users to connect our GKE control plane without exposing any services to the open internet.
|
||||
|
||||
**Prerequisites**
|
||||
|
||||
- Helm version 3+
|
||||
- kubectl version v1.11.3+
|
||||
- Access to a Kubernetes v1.11.3+ cluster
|
||||
- Cert Manager
|
||||
|
||||
Add helm repository:
|
||||
|
||||
```jsx
|
||||
helm repo add netbirdio https://netbirdio.github.io/helms
|
||||
```
|
||||
|
||||
Install [**cert-manager**](https://cert-manager.io/docs/installation/#default-static-install) for k8s API to communicate with the NetBird operator
|
||||
|
||||
```jsx
|
||||
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.0/cert-manager.yaml
|
||||
```
|
||||
|
||||
Add NetBird API token. You can create a PAT by following the steps [**here**](https://docs.netbird.io/how-to/access-netbird-public-api#creating-a-service-user)
|
||||
|
||||
```jsx
|
||||
kubectl create namespace netbirdkubectl -n netbird create secret generic netbird-mgmt-api-key --from-literal=NB_API_KEY=$(cat ~/nb-pat.secret)
|
||||
```
|
||||
|
||||
Replace `~/nb-pat.secret` with your NetBird API key and create a [**`values.yaml`**](https://github.com/netbirdio/kubernetes-operator/blob/main/examples/ingress/values.yaml) file, check `helm show values netbirdio/kubernetes-operator` for more info
|
||||
|
||||
```jsx
|
||||
# managementURL: "https://netbird.example.io:443"
|
||||
ingress:
|
||||
enabled: true
|
||||
router:
|
||||
enabled: true
|
||||
|
||||
netbirdAPI:
|
||||
keyFromSecret:
|
||||
name: "netbird-mgmt-api-key"
|
||||
key: "NB_API_KEY"
|
||||
```
|
||||
|
||||
Install using helm install command
|
||||
|
||||
```jsx
|
||||
helm install --create-namespace -f values.yaml -n netbird netbird-operator netbirdio/kubernetes-operator
|
||||
```
|
||||
|
||||
**Expose Kubernetes Control Plane to your NetBird Network**
|
||||
|
||||
To access your Kubernetes control plane from a NetBird network, you can expose your Kubernetes control plane as a [**NetBird resource**](https://docs.netbird.io/how-to/networks#resources) by enabling the following option in the operator values:
|
||||
|
||||
```jsx
|
||||
ingres:
|
||||
enabled: true
|
||||
kubernetesAPI:
|
||||
enabled: true
|
||||
```
|
||||
|
||||
To learn more please checkout the official docs on our [K8s Operator](https://docs.netbird.io/how-to/kubernetes-operator#using-helm).
|
||||
|
||||
### Install ArgoCD
|
||||
|
||||
ArgoCD is a declarative GitOps tool for managing and syncing Kubernetes clusters. We will be installing on the GKE cluster that we created in the previous step, or your chosen provider. ArgoCD will monitor our Git repository and automatically sync the application state to our clusters.
|
||||
|
||||
```bash
|
||||
# argo
|
||||
kubectl create namespace argocd
|
||||
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
|
||||
kubectl port-forward svc/argocd-server -n argocd 8080:443
|
||||
# Argo client
|
||||
brew install argocd
|
||||
argocd admin initial-password -n argocd
|
||||
argocd login localhost:8080
|
||||
# optional (update password)
|
||||
argocd account update-password
|
||||
```
|
||||
|
||||
This will deploy all ArgoCD components; API server, application controller, repo server, set up the web UI for visual management, creates the initial admin credentials, and configures CLI access for command-line management.
|
||||
|
||||
### Configure NetBird Integration with ArgoCD
|
||||
|
||||
Now we need to integrate NetBird with ArgoCD and the statefulset service so that:
|
||||
|
||||
**argocd-server** needs to be accessible over the NetBird network as a resource allowing us to access the user interface and manage our clusters. This is done with a `annotate` command, these are key-value pairs that can store arbitrary, non-identifying metadata about a Kubernetes object.
|
||||
|
||||
```bash
|
||||
# Expose service
|
||||
kubectl -n argocd annotate svc/argocd-server netbird.io/expose="true" netbird.io/groups="argo-management-ui"
|
||||
```
|
||||
|
||||
Next we will enable sidecars. **Why Sidecars?** The application controller needs to make API calls to remote MicroK8s clusters. The sidecar provides transparent network access to those clusters through the NetBird mesh.
|
||||
|
||||
To enable sidecar functionality in your deployments, you first need to generate a setup key, either via the UI (enable the **Ephemeral Peers** options) or by following [**this guide**](https://docs.netbird.io/how-to/register-machines-using-setup-keys) for more details on setup keys. We will inject side-cars to ArgoCD application controller so it can communicate with remote MicroK8s clusters.
|
||||
|
||||
Note: We recommend checking out the section of our [Kubernetes Operator docs on using sidecars](https://docs.netbird.io/how-to/kubernetes-operator#accessing-remote-services-using-sidecars) for more context and detail.
|
||||
|
||||
Next, you'll create a secret in Kubernetes and add a new resource called `NBSetupKey`. The `NBSetupKey` name can then be referenced in your deployments or daemon sets to specify which setup key should be used when injecting a sidecar into your application pods. Below is an example of a secret and an `NBSetupKey` resource:
|
||||
|
||||
```jsx
|
||||
# secret.yaml
|
||||
apiVersion: v1
|
||||
stringData:
|
||||
setupkey: EEEEEEEE-EEEE-EEEE-EEEE-EEEEEEEEEEEE
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: app-setup-key
|
||||
```
|
||||
|
||||
NBSetupKey:
|
||||
|
||||
```jsx
|
||||
# nbsetup-key.yaml
|
||||
apiVersion: netbird.io/v1
|
||||
kind: NBSetupKey
|
||||
metadata:
|
||||
name: app-setup-key
|
||||
spec:
|
||||
# Optional, overrides management URL for this setupkey only
|
||||
# defaults to https://api.netbird.io
|
||||
# managementURL: https://netbird.example.com
|
||||
secretKeyRef:
|
||||
name: app-setup-key # Required
|
||||
key: setupkey # Required
|
||||
|
||||
```
|
||||
|
||||
Amend the application controller to apply the secrets:
|
||||
|
||||
```jsx
|
||||
kubectl -n argocd apply -f secret.yaml
|
||||
kubectl -n argocd apply -f nbsetup-key.yaml
|
||||
# Inject side-cars
|
||||
kubectl -n argocd patch statefulset argocd-application-controller --type merge -p '{"spec":{"template":{"metadata":{"annotations":{"netbird.io/setup-key":"app-setup-key","netbird.io/extra-dns-labels": "argocd"}}}}}'
|
||||
```
|
||||
|
||||
### Create NetBird Access Policies
|
||||
|
||||
In the NetBird dashboard, create two groups and configure policies:
|
||||
|
||||
**Example Policy Configuration:**
|
||||
|
||||
- Policy 1: argocd → microK8s GPUs (port 16443 for Kubernetes API)
|
||||
- Policy 2: IT Administrators → argo-management-ui (port 443)
|
||||
|
||||
Where argocd group is applied to the GKE Cluster running Argo, microK8s GPU is applied to other provider GPU enabled VMs, agro-access is the group created by the operator.
|
||||
|
||||
<p>
|
||||
<img src="/docs-static/img/use-cases/distributed-ai-argo/argo-access-policy.png" alt="argo netbird policies" className="imagewrapper"/>
|
||||
</p>
|
||||
|
||||
## Part 2: MicroK8s Edge Cluster Setup
|
||||
|
||||
### Understanding the Architecture
|
||||
|
||||
Each MicroK8s cluster represents a GPU enabled VM in our distributed AI infrastructure. These nodes are deployed across different cloud providers (AWS, GCP, Digital Ocean, etc.) in various geographic regions. Each node:
|
||||
|
||||
- Runs a lightweight Kubernetes distribution (MicroK8s)
|
||||
- Has GPU hardware for AI inference workloads
|
||||
- Connects to the NetBird mesh for secure communication
|
||||
- Is managed by the central ArgoCD control plane
|
||||
|
||||
### VM Provisioning with User Data
|
||||
|
||||
When creating VMs in different cloud providers, use this user data script for automated provisioning. This script runs on first boot and configures everything needed for the node to join the Mega-Mesh infrastructure. Be sure to edit the `<SETUP-KEY>` variable.
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
sudo hostnamectl set-hostname telekom-germany-cluster
|
||||
|
||||
sudo apt update
|
||||
|
||||
sudo DEBIAN_FRONTEND=noninteractive apt install ca-certificates curl gnupg snapd vim ubuntu-drivers-common -y
|
||||
sudo curl -L https://pkgs.netbird.io/debian/public.key | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/netbird.gpg
|
||||
sudo echo 'deb https://pkgs.netbird.io/debian stable main' | sudo tee /etc/apt/sources.list.d/netbird.list
|
||||
|
||||
sudo apt update && apt install netbird -y
|
||||
exec 1>/root/startup.log 2>&1
|
||||
|
||||
sudo netbird up --setup-key <SETUP-KEY> --management-url https://api.netbird.io --hostname telekom-germany-cluster --allow-server-ssh
|
||||
|
||||
sudo ufw disable || true
|
||||
|
||||
sudo snap install microk8s --classic --channel 1.32/stable
|
||||
sudo microk8s status --wait-ready
|
||||
|
||||
sudo usermod -a -G microk8s ubuntu
|
||||
sudo chown -f -R ubuntu ~/.kube
|
||||
|
||||
microk8s disable ha-cluster --force
|
||||
sudo microk8s status --wait-ready
|
||||
|
||||
for addon in dns helm3 hostpath-storage; do
|
||||
microk8s enable $addon
|
||||
done
|
||||
|
||||
microk8s enable gpu --gpu-operator-version v25.10.0 --gpu-operator-set toolkit.env[3].name=RUNTIME_CONFIG_SOURCE --gpu-operator-set toolkit.env[3].value='file=/var/snap/microk8s/current/args/containerd.toml'
|
||||
|
||||
microk8s status --wait-ready
|
||||
|
||||
echo "End of startup.sh"
|
||||
```
|
||||
|
||||
**What This Script Does:**
|
||||
|
||||
1. **Hostname Configuration**: Sets a descriptive hostname for easy identification
|
||||
2. **System Updates**: Installs required packages and repositories
|
||||
3. **NetBird Installation**:
|
||||
- Adds NetBird's package repository
|
||||
- Installs the NetBird client
|
||||
- Connects to the NetBird mesh using a setup key
|
||||
- Enables SSH access through the mesh
|
||||
4. **Firewall Configuration**: Disables UFW to prevent conflicts with MicroK8s networking
|
||||
5. **MicroK8s Installation**: Installs MicroK8s from the stable channel
|
||||
6. **Permission Setup**: Adds the ubuntu user to the microk8s group
|
||||
7. **Essential Addons**:
|
||||
- **dns**: Internal DNS for service discovery
|
||||
- **helm3**: Package manager for Kubernetes applications
|
||||
- **hostpath-storage**: Local storage provisioner
|
||||
8. **GPU Support**: Installs NVIDIA GPU Operator for GPU workload support
|
||||
|
||||
**Monitoring GPU Setup:**
|
||||
|
||||
Depending on the provider speed, it might take up to 5 minutes to install the drivers. One can check status with:
|
||||
|
||||
```bash
|
||||
microk8s.kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator
|
||||
```
|
||||
|
||||
### Manual Cluster Registration (Current Process)
|
||||
|
||||
After the VM boots and MicroK8s is running, we need to register this cluster with ArgoCD. This involves creating a ServiceAccount, generating credentials, and registering them with ArgoCD.
|
||||
|
||||
**Future Optimization Notes:**
|
||||
|
||||
- Setup keys: Currently using one-shot keys per node; needs automation for scale
|
||||
- Pre-baked images: Creating custom VM images with software pre-installed would speed up provisioning significantly
|
||||
|
||||
### Create Service Account and RBAC
|
||||
|
||||
This creates a ServiceAccount that ArgoCD will use to authenticate with the MicroK8s API server.
|
||||
|
||||
```bash
|
||||
vim svc.yml
|
||||
# content
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: argocd
|
||||
namespace: kube-system
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: argocd
|
||||
namespace: kube-system
|
||||
annotations:
|
||||
kubernetes.io/service-account.name: argocd
|
||||
type: kubernetes.io/service-account-token
|
||||
---
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
kind: ClusterRoleBinding
|
||||
metadata:
|
||||
name: argocd
|
||||
roleRef:
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
kind: ClusterRole
|
||||
name: cluster-admin
|
||||
subjects:
|
||||
- kind: ServiceAccount
|
||||
name: argocd
|
||||
namespace: kube-system
|
||||
|
||||
|
||||
# Apply with:
|
||||
microk8s.kubectl apply -f svc.yml
|
||||
```
|
||||
|
||||
**What This Creates:**
|
||||
|
||||
- **ServiceAccount**: An identity for ArgoCD to use
|
||||
- **Secret**: Automatically generated token for authentication
|
||||
- **ClusterRoleBinding**: Grants cluster-admin permissions to ArgoCD (full control)
|
||||
|
||||
*Note: In production, you'd use more restricted RBAC permissions based on the principle of least privilege.*
|
||||
|
||||
### Extract the Authentication Token
|
||||
|
||||
```bash
|
||||
microk8s.kubectl -n kube-system get secret argocd -o yaml
|
||||
|
||||
# Output sample
|
||||
apiVersion: v1
|
||||
data:
|
||||
ca.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUREekNDQWZlZ0F3SUJBZ0lVYnpNOW93RXN2QS8zSXpGc3R5Sm9uMGRUeU84d0RRWUpLb1pJaHZjTkFRRUwKQlFBd0Z6RVZNQk1HQTFVRUF3d01NVEF1TVRVeUxqRTRNeTR4TUI0WERUSTFNVEV3T0RBNU16WXhOVm9YRFRNMQpNVEV3TmpBNU16WXhOVm93RnpFVk1CTUdBMVVFQXd3TU1UQXVNVFV5TGpFNE15NHhNSUlCSWpBTkJna3Foa2lHCjl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUF6VFN2eFBkV2N6anRHWlhTMC9VUCtnMjJUK1pNdXVZZnJWY2YKeWtDa0xpRXByeTAxSWFMNU43QUV5Q01tTXNuODE1eE1yWGRyNFhBcFJ6dE1DSTkwWnRBUWJWUnZUS3lmNm9lUApjb1pkYTN3cktoSFpoMGl5U3o2S1pQZTVqbTRGYllTNFhibHA5dTBEaFQyTVdiVCtaTXJmbXh1alYweUR2WHRDCnp2NGplM0Z4NTFlcVNEYTNza1p3UTdqZVpMZFlzNUNsUmVTRi8vbFRIam95WGFvR0I1d1BMcnJ6NmdlZnJLQXoKUU9pUTBsMVJNTG8xOE4rNEhRc2ZjZzVSaXRNVVNJMUFxRjY2dS81ZkVlNXd4S3BTZ2ZFb3ZrOHFmTmY5Ymd6ZgpOYkhzTEN5YmhPWHEyQ1RkMUpOQjFKQURTK2t1RnFrcWtxYVJnejQ3MlBZa0ZqT3NSd0lEQVFBQm8xTXdVVEFkCkJnTlZIUTRFRmdRVStJb3lQQW9lUkN4bEpBOXZiRnhmVjJWTHRRc3dId1lEVlIwakJCZ3dGb0FVK0lveVBBb2UKUkN4bEpBOXZiRnhmVjJWTHRRc3dEd1lEVlIwVEFRSC9CQVV3QXdFQi96QU5CZ2txaGtpRzl3MEJBUXNGQUFPQwpBUUVBY2pwemtpL2owODA3Qk1qZ3JJc011TGdOaHZ0R01ud3lpTGErWHh1cHZVQ3ZGMUt1aUZzVTR4MTV3WCtVCkxJRWdLb3B4QkVlZlNrUkozRDlqbUxyTHVJeHhkWDl4VlpJK0VYM0ZWdzBqTzFGSkJhMmJOcHIzL1U0bXgyRGUKclRQVkdjV09BcGxVTzhveGV4VnRDbDNWTytoWVhTWVk1RVQ4cDlUQWRmeloxcGdWdXVsUVZoSEJOWE8rV1dIbwozMXYwd09BYVA3cURnVTB6NFBYZkxXdFZzK3dnMWxhaUQ5UnptWlQ4WFo3eVZ5Q0hMVXNSZ1lFSE1qWFc4UXE5CkJRdGtNdDF4Rlg3TnJRUEovM2U5ZU9DOVRaTE9MVnpncXRFNDArQ3RoaEt5UklrelZZamRPRUp4Q1RzV3pQUE0KNktFbFJOVjNoM0NRY2JMN3pZdWt4N2R3b0E9PQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg==
|
||||
namespace: a3ViZS1zeXN0ZW0=
|
||||
token: ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNkltTjRhRUl3UWtSaFEwTlRTbXhKU1dWQldVSTJUMXBFVlhOTVFXcHFPSEJoVDFkVlZVVmlRekZVUWtFaWZRLmV5SnBjM01pT2lKcmRXSmxjbTVsZEdWekwzTmxjblpwWTJWaFkyTnZkVzUwSWl3aWEzVmlaWEp1WlhSbGN5NXBieTl6WlhKMmFXTmxZV05qYjNWdWRDOXVZVzFsYzNCaFkyVWlPaUpyZFdKbExYTjVjM1JsYlNJc0ltdDFZbVZ5Ym1WMFpYTXVhVzh2YzJWeWRtbGpaV0ZqWTI5MWJuUXZjMlZqY21WMExtNWhiV1VpT2lKaGNtZHZZMlFpTENKcmRXSmxjbTVsZEdWekxtbHZMM05sY25acFkyVmhZMk52ZFc1MEwzTmxjblpwWTJVdFlXTmpiM1Z1ZEM1dVlXMWxJam9pWVhKbmIyTmtJaXdpYTNWaVpYSnVaWFJsY3k1cGJ5OXpaWEoyYVdObFlXTmpiM1Z1ZEM5elpYSjJhV05sTFdGalkyOTFiblF1ZFdsa0lqb2laamt3TURrMk1ESXRNekExTVMwME5URTNMVGd4T1dNdFlqVXdZekprWXpobE4yRmpJaXdpYzNWaUlqb2ljM2x6ZEdWdE9uTmxjblpwWTJWaFkyTnZkVzUwT210MVltVXRjM2x6ZEdWdE9tRnlaMjlqWkNKOS5hd0xIZE9ucWRVM1k1RlVkTHJEdUJJWEo0aThmWkFVNXBzeF9xRWRETUprd2RVOGhlcjRDRGx3dWpIWlU3WjdCZnpsSUNtS0Q2dXVJdkItelhNc1pPQkdRN0lVeDJwZGdobVhxNnIwVEhzcl9fb21RNWppT3pXajVnVzZxRXpKeTV6S0dMSlpxN3JaUjMwM3MydDg1a1BWM1Y3ODNxemxiZ0JjYkk2X3p2TG5vYi1RNVBBbzJxUHRyMDF3WjlqeFBpdlk3dXp1VHVPTVNHeTlVNnRCV244MnltY2RfSXUyQXFlNDVabVhrMDVmdFNCQTZKMkhreEJTTy1OOTNDLWtPU2JKbDh1U1N0ZGZ1ZmR5WVRVdlh4LUc4OUdTUmZIZEpLX1RnU0JYR1JiTGlRVmVhTEhMUkstdGk2UTVjSG9UeFhteDJSNGlpdHU0X0xFSVNCVzNia1E=
|
||||
kind: Secret
|
||||
metadata:
|
||||
annotations:
|
||||
kubectl.kubernetes.io/last-applied-configuration: |
|
||||
{"apiVersion":"v1","kind":"Secret","metadata":{"annotations":{"kubernetes.io/service-account.name":"argocd"},"name":"argocd","namespace":"kube-system"},"type":"kubernetes.io/service-account-token"}
|
||||
kubernetes.io/service-account.name: argocd
|
||||
kubernetes.io/service-account.uid: f9009602-3051-4517-819c-b50c2dc8e7ac
|
||||
creationTimestamp: "2025-11-08T10:50:01Z"
|
||||
name: argocd
|
||||
namespace: kube-system
|
||||
resourceVersion: "14797"
|
||||
uid: 059cc196-6d16-437a-8aba-6b45a8c1fd60
|
||||
type: kubernetes.io/service-account-token
|
||||
```
|
||||
|
||||
**Understanding the Secret:**
|
||||
|
||||
- **ca.crt**: The cluster's CA certificate (base64 encoded)
|
||||
- **token**: The JWT bearer token ArgoCD will use for authentication
|
||||
- **namespace**: The namespace where this secret exists
|
||||
|
||||
### Decode the Token
|
||||
|
||||
The token is stored as base64-encoded data. We need to decode it to get the actual JWT token.
|
||||
|
||||
```bash
|
||||
echo '<token base64>' | base64 -d
|
||||
mvlkdsmvlkdf.lnsafkasldnvksjdnvj....kdnvkljsd.dls...knksndkls
|
||||
```
|
||||
|
||||
This JWT token contains:
|
||||
|
||||
- The service account identity
|
||||
- Expiration information
|
||||
- Signed proof of authenticity
|
||||
|
||||
### Create Cluster Registration Secret
|
||||
|
||||
Now we create a secret in ArgoCD that contains the connection information for this MicroK8s cluster.
|
||||
|
||||
```bash
|
||||
# cluster.yaml
|
||||
---
|
||||
# telekom germany
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: telekom-germany-cluster
|
||||
namespace: argocd
|
||||
labels:
|
||||
argocd.argoproj.io/secret-type: cluster
|
||||
megamesh: 'true'
|
||||
gpu: 'true'
|
||||
type: Opaque
|
||||
stringData:
|
||||
name: "telekom-germany-cluster"
|
||||
server: "https://100.121.1.222:16443"
|
||||
config: |
|
||||
{
|
||||
"bearerToken": "mvlkdsmvlkdf.lnsafkasldnvksjdnvj....kdnvkljsd.dls...knksndkls",
|
||||
"tlsClientConfig": {
|
||||
"insecure": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Key Components:**
|
||||
|
||||
- **name**: Human-readable cluster identifier
|
||||
- **server**: The Kubernetes API server endpoint (this is the NetBird mesh IP)
|
||||
|
||||
- **bearerToken**: The JWT token we extracted earlier
|
||||
- **insecure: true**: Skips TLS certificate verification (acceptable since we're using NetBird's encrypted mesh)
|
||||
- **labels**:
|
||||
- `megamesh: 'true'`: Used by ApplicationSet to target this cluster
|
||||
- `gpu: 'true'`: Indicates this cluster has GPU capabilities
|
||||
|
||||
### Apply the Cluster Secret
|
||||
|
||||
```bash
|
||||
kubectl apply -f cluster.yaml
|
||||
```
|
||||
|
||||
Once applied, ArgoCD can now:
|
||||
|
||||
- See this cluster in its cluster list
|
||||
- Deploy applications to it
|
||||
- Monitor the health of deployed applications
|
||||
|
||||
**Verification:**
|
||||
|
||||
```jsx
|
||||
argocd cluster list
|
||||
```
|
||||
|
||||
You should see your new cluster listed with a "Successful" connection status.
|
||||
|
||||
## Part 3: SSL Certificate and API Credentials
|
||||
|
||||
### Create SSL Certificate
|
||||
|
||||
For production HTTPS access to the vLLM inference endpoints, we need a valid SSL certificate.
|
||||
|
||||
```yaml
|
||||
sudo certbot certonly --manual --preferred-challenges dns -d mega-mesh.net
|
||||
```
|
||||
|
||||
**What This Does:**
|
||||
|
||||
- Uses Certbot (Let's Encrypt) to generate a certificate
|
||||
- Manual mode with DNS challenge (you'll need to create a TXT record)
|
||||
- Valid for mega-mesh.net domain
|
||||
|
||||
**Why DNS Challenge?** This allows us to get a certificate without needing HTTP server access, which is perfect for distributed infrastructure.
|
||||
|
||||
### Create Hugging Face Access Token
|
||||
|
||||
vLLM needs to download models from Hugging Face, so we need an access token.
|
||||
|
||||
Create one at: [https://huggingface.co/docs/hub/en/security-tokens](https://huggingface.co/docs/hub/en/security-tokens)
|
||||
|
||||
### Create Kubernetes Secret for SSL and Tokens
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: mega-mesh-tls
|
||||
namespace: default
|
||||
type: kubernetes.io/tls
|
||||
stringData:
|
||||
tls.crt: |
|
||||
-----BEGIN CERTIFICATE-----
|
||||
MIIDhjC....
|
||||
-----END CERTIFICATE-----
|
||||
-----BEGIN CERTIFICATE-----
|
||||
MIIEVzCCAj....
|
||||
+VUwFj9tmWxyR/M=
|
||||
-----END CERTIFICATE-----
|
||||
tls.key: |
|
||||
-----BEGIN PRIVATE KEY-----
|
||||
MIGHAgEA....
|
||||
-----END PRIVATE KEY-----
|
||||
```
|
||||
|
||||
**This Secret Contains:**
|
||||
|
||||
- **tls.crt**: The full certificate chain (your cert + intermediate CA cert)
|
||||
- **tls.key**: The private key for the certificate
|
||||
|
||||
Apply this to each MicroK8s cluster where vLLM will be deployed:
|
||||
|
||||
```jsx
|
||||
microk8s.kubectl apply -f tls-secret.yaml
|
||||
```
|
||||
|
||||
## Part 4: Configure ArgoCD Repository and ApplicationSet
|
||||
|
||||
### Register Private Git Repository
|
||||
|
||||
The application manifests are stored in a private GitHub repository. We need to give ArgoCD access to this repo.
|
||||
|
||||
**Repository Location:** [https://github.com/netbirdio/megamesh-argocd](https://github.com/netbirdio/megamesh-argocd)
|
||||
|
||||
**⚠️ Security Note:** This repository contains secrets and should NOT be public.
|
||||
|
||||
**Setup Steps:**
|
||||
|
||||
1. Generate an SSH key or GitHub Personal Access Token
|
||||
2. Follow ArgoCD's guide: [https://argo-cd.readthedocs.io/en/stable/user-guide/private-repositories/](https://argo-cd.readthedocs.io/en/stable/user-guide/private-repositories/)
|
||||
3. Register the repository in ArgoCD UI or via CLI
|
||||
|
||||
### Understanding ApplicationSets
|
||||
|
||||
ApplicationSet is an ArgoCD feature that automatically generates multiple Applications based on generators. Think of it as a template that creates ArgoCD Applications for each matching cluster.
|
||||
|
||||
**Why ApplicationSet?**
|
||||
|
||||
- Automatically deploy applications to new clusters as they're added
|
||||
- Single source of truth for application configuration
|
||||
- Reduces manual work when scaling to many clusters
|
||||
|
||||
### Create the ApplicationSet
|
||||
|
||||
```yaml
|
||||
# application-set.yaml
|
||||
apiVersion: argoproj.io/v1alpha1
|
||||
kind: ApplicationSet
|
||||
metadata:
|
||||
name: megamesh-apps-per-cluster
|
||||
namespace: argocd
|
||||
spec:
|
||||
# generators expand configuration with matching rules
|
||||
generators:
|
||||
- matrix:
|
||||
generators:
|
||||
- clusters:
|
||||
selector:
|
||||
matchLabels:
|
||||
megamesh: "true"
|
||||
- git:
|
||||
# even though the repo is named "megamesh" in Argo,
|
||||
# ApplicationSet needs the *repo URL*:
|
||||
repoURL: git@github.com:netbirdio/megamesh-argocd.git
|
||||
revision: main
|
||||
directories:
|
||||
- path: "apps/*"
|
||||
template:
|
||||
metadata:
|
||||
name: '{{name}}-{{path.basename}}'
|
||||
spec:
|
||||
project: default
|
||||
source:
|
||||
# same repo URL here
|
||||
repoURL: git@github.com:netbirdio/megamesh-argocd.git
|
||||
targetRevision: main
|
||||
path: '{{path}}'
|
||||
destination:
|
||||
server: '{{server}}'
|
||||
namespace: '{{path.basename}}'
|
||||
syncPolicy:
|
||||
automated:
|
||||
prune: true
|
||||
selfHeal: true
|
||||
syncOptions:
|
||||
- CreateNamespace=true
|
||||
```
|
||||
|
||||
**How This Works:**
|
||||
|
||||
1. **Matrix Generator**: Combines two generators to create applications
|
||||
- **Cluster Generator**: Finds all clusters with label `megamesh: "true"`
|
||||
- **Git Generator**: Finds all directories under `apps/` in the repo
|
||||
2. **Template Variables**:
|
||||
- `{{name}}`: Cluster name (e.g., "telekom-germany-cluster")
|
||||
- `{{server}}`: Cluster API endpoint
|
||||
- `{{path}}`: Directory path in Git (e.g., "apps/vllm")
|
||||
- `{{path.basename}}`: Just the directory name (e.g., "vllm")
|
||||
3. **Resulting Applications**:
|
||||
- For each cluster with `megamesh: "true"` label
|
||||
- For each app directory in the repo
|
||||
- Creates an Application named like: `aws-germany-cluster-vllm`
|
||||
4. **Sync Policy**:
|
||||
- **automated**: Automatically syncs changes from Git
|
||||
- **prune: true**: Deletes resources not in Git
|
||||
- **selfHeal: true**: Reverts manual changes back to Git state
|
||||
- **CreateNamespace**: Automatically creates the target namespace
|
||||
|
||||
### Apply the ApplicationSet
|
||||
|
||||
```yaml
|
||||
kubectl apply -f application-set.yaml
|
||||
```
|
||||
|
||||
**What Happens Next:**
|
||||
|
||||
1. ArgoCD scans for matching clusters
|
||||
2. Reads the Git repository structure
|
||||
3. Creates Applications for each combination
|
||||
4. Begins deploying to all matching clusters simultaneously
|
||||
5. Monitors health and sync status
|
||||
|
||||
## Part 5: Configure Geo-Location Based DNS
|
||||
|
||||
Geographic DNS routing ensures users connect to the nearest cluster for optimal latency.
|
||||
|
||||
<p>
|
||||
<img src="/docs-static/img/use-cases/distributed-ai-argo/llm-geo.jpg" alt="geolocation dns" className="imagewrapper"/>
|
||||
</p>
|
||||
|
||||
### GCP Cloud DNS Configuration
|
||||
|
||||
**How It Works:**
|
||||
|
||||
1. Multiple A records for the same domain (mega-mesh.net)
|
||||
2. Each record has a geographic region assigned
|
||||
3. GCP routes requests based on user's location
|
||||
4. Health checks ensure only healthy endpoints receive traffic
|
||||
|
||||
**Configuration Steps:**
|
||||
|
||||
1. Create health checks for each cluster endpoint
|
||||
2. Create DNS policy with geolocation routing
|
||||
3. Add A records for each cluster with region tags
|
||||
|
||||
**Reference:** [https://docs.cloud.google.com/dns/docs/routing-policies-overview](https://docs.cloud.google.com/dns/docs/routing-policies-overview)
|
||||
|
||||
### Verification Tools
|
||||
|
||||
**Check DNS propagation:** [https://www.whatsmydns.net/#A/mega-mesh.net](https://www.whatsmydns.net/#A/relay.netbird.io)
|
||||
|
||||
**Global health monitoring:** [https://www.openstatus.dev/play/checker](https://www.openstatus.dev/play/checker)
|
||||
|
||||
### Test the Deployed Services
|
||||
|
||||
**List available models:**
|
||||
|
||||
```yaml
|
||||
curl -H "Content-Type: application/json" https://mega-mesh.net/v1/models
|
||||
```
|
||||
|
||||
**Generate completion:**
|
||||
|
||||
```jsx
|
||||
curl https://mega-mesh.net/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
"prompt": "Why NetBird rocks?",
|
||||
"max_tokens": 500
|
||||
}' | jq
|
||||
```
|
||||
|
||||
## Known Issues and Future Improvements
|
||||
|
||||
### Security Concerns
|
||||
|
||||
1. **Secrets in Git Repository**:
|
||||
- **Current**: Secrets committed to the repo
|
||||
- **Future**: Use sealed secrets, external secret operators, or cloud secret managers
|
||||
2. **Certificate Auto-Renewal**:
|
||||
- **Current**: Manual certificate renewal required
|
||||
- **Future**: Integrate with cert-manager for automatic renewal via cloud DNS API
|
||||
|
||||
### Automation Needs
|
||||
|
||||
1. **Manual Cluster Registration**:
|
||||
- **Current**: ServiceAccount and cluster secrets created manually
|
||||
- **Future**: Automated registration script or operator
|
||||
2. **Setup Key Management**:
|
||||
- **Current**: Single setup key in user data script
|
||||
- **Future**: Dynamic setup key generation per node or automated rotation
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
### Official Documentation
|
||||
|
||||
- **MicroK8s GPU Addon**: [https://microk8s.io/docs/addon-gpu](https://microk8s.io/docs/addon-gpu)
|
||||
- **ArgoCD with MicroK8s**: [https://medium.com/@orangecola3/gitops-with-argocd-and-microk8s-4c70a92e43b2](https://medium.com/@orangecola3/gitops-with-argocd-and-microk8s-4c70a92e43b2)
|
||||
- **External Cluster Registration**: [https://medium.com/pickme-engineering-blog/how-to-connect-an-external-kubernetes-cluster-to-argo-cd-using-bearer-token-authentication-d9ab093f081d](https://medium.com/pickme-engineering-blog/how-to-connect-an-external-kubernetes-cluster-to-argo-cd-using-bearer-token-authentication-d9ab093f081d)
|
||||
|
||||
### NetBird Resources
|
||||
|
||||
- **Kubernetes Operator Deployment**: [https://docs.netbird.io/how-to/kubernetes-operator#deployment](https://docs.netbird.io/how-to/kubernetes-operator#deployment)
|
||||
- **Service Mesh Sidecars**: [https://docs.netbird.io/how-to/kubernetes-operator#accessing-remote-services-using-sidecars](https://docs.netbird.io/how-to/kubernetes-operator#accessing-remote-services-using-sidecars)
|
||||
|
||||
### ArgoCD Resources
|
||||
|
||||
- **Private Repository Setup**: [https://argo-cd.readthedocs.io/en/stable/user-guide/private-repositories/](https://argo-cd.readthedocs.io/en/stable/user-guide/private-repositories/)
|
||||
- **ApplicationSet Documentation**: [https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/](https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/)
|
||||
Reference in New Issue
Block a user