# CLAUDE.md — Working with VergeOS via vrg

These are the things I (Claude) learned working against VergeOS through the
`vrg` CLI and its raw API. Drop this file into the working directory of any
project that touches a VergeOS cloud, and a future agent will start with the
same playbook rather than rediscovering each gotcha. It covers the vrg/API
fundamentals **and** the full workflow that stood up four Kubernetes clusters
(k3s, RKE2 via Rancher, Talos) on this cloud — CSI/CCM, the Rancher node
driver, golden templates, and Talos.

**Contents**
- *vrg & API fundamentals:* Verify before success · Structured output · Exit
  codes · Raw API when vrg lacks a flag · `vms` vs `machines` key spaces ·
  Webhooks · `doctor` · `-A` · Quiet mode
- *Provisioning VMs:* Recipe deploy gotchas · VM lifecycle · Drive grow ·
  Editing cloud-init via API · Resizing RAM/CPU · Storage tiers
- *Networking:* Firewall apply · Network/VM-delete gotchas · DHCP on External
- *Kubernetes:* CSI + CCM placement · Rancher node-driver golden templates ·
  Talos on VergeOS · Merging cluster kubeconfigs

The single highest-leverage lesson from the cluster work: **on the `External`
vnet you do not own DHCP, and clone-based provisioning lives or dies on the
guest agent.** Both are explained below.

## Verify before reporting success

Exit code 0 from `vrg` means the API call returned without error. It does
**not** always mean the resource is in the state you expect. Recipe deploys
in particular can silently skip individual stages — most often when a
question has no default value and nothing was passed via `--set`. The deploy
"succeeds," but the resource is malformed.

After every `create` or `deploy`, run a `get` and check the resource shape.
**`--query` is a global option — it must come BEFORE the subcommand,
not after.** `vrg vm get <name> --query ...` errors with "No such option";
`vrg --query ... vm get <name>` works.

```bash
vrg -o json --query status vm get <name>          # is it running?
vrg -o json --query 'machine_nics[].vnet' vm get <name>  # right networks?
```

`vrg vm get` does **not** include drives in any output mode (json, wide, or
table) despite `--help` claiming "-o json or -o wide for the full attribute
set". To verify a VM has its OS drive, query `machine_drives` directly —
see "vms vs machines key spaces" below.

If verification fails, treat the deploy as failed regardless of exit code.

## Use structured output, not table parsing

Every `list` and `get` supports `-o json`. Use it. Use `--query` for
single-field extraction — most cases don't need `jq`:

```bash
vrg --query status vm get web-01
vrg -o json vm list --query '[].name'
```

If you find yourself reaching for `grep` or `awk`, switch to `-o json` first.

## Branch on exit codes, not stderr

vrg's exit codes are distinct and stable:

| Code | Meaning                                  |
| ---- | ---------------------------------------- |
| 0    | success                                  |
| 6    | not found                                |
| 7    | conflict / ambiguous (e.g. two matches)  |
| 8    | validation error                         |
| 9    | timeout                                  |
| 10   | connection error                         |

Retry on 9 and 10. Do not retry on 6/7/8 — fix the call.

Code 7 is loud on purpose: vrg refuses to silently pick a match when a name
is ambiguous. Resolve by passing the numeric `$key` or by scoping
(`--tenant`, etc.).

## Recipe deploy gotchas

- **Questions without default values silently break post-deploy.** Run
  `vrg recipe question list <name>` first (NOT `vrg recipe get`), read every
  question, and pass `--set` for every one whose default is empty. Most
  common offender: `SELECT_OS_TIER` (the recipe tries to use a tier `$key`
  later and finds none → no drive created → unbootable VM with no error).
- **Pass NICs by network name, not network key.** Some code paths accept
  keys; the recipe parser resolves names. Names are the safe choice.
- **But `YB_CLUSTER` needs the cluster `$key`, NOT the name — the opposite
  of NICs.** `--set YB_CLUSTER=homelab` fails the entire deploy with
  `Unable to get the cluster for the question 'Cluster': No such file or
  directory`; `--set YB_CLUSTER=1` works. Get the key from
  `/api/v4/clusters` (the one compute cluster here, `homelab`, is `$key` 1).
- **Recipe *catalog* VMs are not bootable templates.** The catalog entries
  (e.g. `Ubuntu Server 24.04 (Noble Numbat) 1.0-9`) define the recipe; the
  real OS image is imported via `OS_DL_URL`/`CREATE_OS_DRIVE` only on a
  *deploy*. Cloning a catalog entry gives a VM with no working OS and no
  guest agent. Always `vrg recipe deploy`, never clone the catalog VM.
- **CIDR needs a leading slash.** `--set YB_NIC_ETH0_CIDR=24` errors with
  `'Subnet Mask CIDR' is invalid`. Use `/24`. The question is a `list`
  type whose valid values are `/32`, `/31`, ..., `/1` — leading slash
  mandatory.
- **Drive size is raw bytes as a string.** `YB_DRIVE_OS_SIZE=20G` /
  `=20Gi` / `=20480M` all get silently dropped — drive comes out at
  the base cloud-image size. Pass bytes literally: `=21474836480` for
  20 GiB. The question's own default (`53687091200` = 50 GiB) is the
  format hint.
- **RAM is megabytes, NOT bytes — asymmetric with drive size.**
  Easy to get wrong because of the contrast. `YB_RAM=2147483648`
  (2 GiB in bytes) errors with `'RAM' must be less than or equal to
  1048576` (the cap is in MB — that's ≈1 TiB upper bound). Correct:
  `YB_RAM=2048` for 2 GiB. The stored `vms.ram` field is also in MB —
  any existing VM shows `ram: 2048` for a 2 GiB allocation.
- **`SSH_KEY` accepts ONE key only.** Multi-line `--set SSH_KEY=$(cat keys)`
  with newlines silently truncates to the first key. To authorize multiple
  keys, deploy with one, SSH in, append the others manually. Painful for
  any deployment that expects multiple operator pubkeys.
- **`SELECT_OS_TIER` value is the tier *integer* (e.g. `1`), not a name.**
  The `storage_tiers` rows have no `$key` field at all — the `tier`
  integer IS the identifier. Get available tiers with
  `curl /api/v4/storage_tiers?fields=all` → look for `tier: N`.
- **Some recipes ignore `YB_DRIVE_OS_SIZE` entirely.** Confirmed bad on
  `Kubernetes K3S` 1.0-15 — drives always come out at the cloud-image base
  size (~3 GiB) regardless of what you pass. Confirmed honored on
  `Ubuntu Server 24.04` and `Debian 13`. When the recipe ignores it, grow
  via API after deploy (see "Drive operations" below).
- **A "complete" task does not mean a bootable VM.** Always verify drives
  exist (see "Verify before reporting success").

## VM lifecycle: deploy → start → wait → install

`vrg recipe deploy` only creates the VM record + drives. It does NOT power
the VM on. The created VM sits in `status: stopped`. Always follow with:

```bash
vrg recipe deploy <recipe> --name <name> --set ...
vrg vm start <name>
```

No `--auto-start` flag exists on `recipe deploy`. Easy to miss in scripts.

**Wait for the OS drive to settle before starting.** After deploy, the OS
drive sits in `media: import` while the cloud image streams in (~15-30s).
`vrg vm start` during this window fails with
`Cannot power on a VM while drives are importing`. Wait until
`media != "import" AND disksize >= target` before starting.

**Cloud-init blocks apt for several minutes on first boot.** Fresh Ubuntu
cloud images run `unattended-upgrades` on first boot — apt is locked, and a
kernel update may trigger a mid-cloud-init reboot. Before any
`sudo apt-get install ...` from a script:

```bash
ssh <vm> 'cloud-init status --wait'   # blocks until first-boot fully done
```

`cloud-init status --wait` may itself disconnect (the reboot kills sshd).
Wrap it in a poll loop that retries SSH until `cloud-init status` reports
`status: done`.

## Drive operations: grow without rebooting

To grow a VergeOS drive (e.g. to fill the K3S-recipe drive-size gap, or to
expand a K8s PVC):

```bash
curl -ks -X PUT -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{"disksize": 21474836480}' \
  "$HOST/api/v4/machine_drives/<key>"
```

Returns 200 with `[]` body. **The PUT acknowledges before VergeOS
internally flips the drive into a second `media: import` cycle for the
resize.** If you poll for `media != "import"` immediately after the PUT,
you'll catch the pre-flip moment and return straight away — then any
follow-up action (e.g. `vrg vm start`) fails 2 seconds later with the
importing-drive error. Sleep ~5 seconds after the PUT before polling.

Inside the VM, cloud-init's `growpart` + `resize2fs` will expand the root
FS to fill the new drive size on the next boot — but only if the VM hasn't
already booted with the smaller size. Order matters: grow drive THEN start
VM the first time, not the other way around.

## Firewall apply: the endpoint vrg uses but doesn't document

`vrg network apply-rules <network>` corresponds to:

```bash
curl -ks -X POST -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{}' "$HOST/api/v4/vnets/<id>/apply"
```

Returns 200 with `[]`. Bumps `last_fw_apply` on the vnet. VergeOS also
auto-applies firewall changes within 50-140s of any `vnet_rules` mutation
(observed empirically; varies). For controllers reconciling rules: call
the explicit apply after a successful POST/DELETE batch if you want
seconds-not-minutes propagation; otherwise let auto-apply do it.

## Network create and VM delete gotchas

These three came up together during a tenant rebuild. All silent or
misleading via vrg; the raw API surfaces the real cause.

- **`vrg network create` defaults to VXLAN.** New internal networks come up
  with `layer2_type: "vxlan"` and fail to start on this cluster with
  `vxlan: 'group' requires 'dev' to be specified` — no parent device for
  VXLAN multicast. `vrg network start` will report success without
  actually starting. Fix: switch to VLAN before starting, with an unused
  `layer2_id`:

  ```bash
  curl -sk -X PUT -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
    -d '{"layer2_type":"vlan","layer2_id":200,"vxlan_multicast":""}' \
    "$HOST/api/v4/vnets/<key>"
  ```

- **`vrg network create` doesn't add a default-gateway rule.** Even with
  `interface_vnet` correctly set to an uplink, new networks have no rule
  routing unmatched traffic to it. VMs get "Destination Net Unreachable"
  from the vnet's own gateway IP. Built-in networks (Core, DMZ) come with
  this rule; new ones don't. Fix: POST it manually, then `apply-rules`:

  ```bash
  curl -sk -X POST -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
    -d '{"vnet":<new_vnet>,"name":"Default gateway","action":"route",
         "source_ip":"","destination_ip":"default",
         "target_ip":"vnetkey:<uplink_vnet>",
         "protocol":"any","direction":"outgoing","enabled":true}' \
    "$HOST/api/v4/vnet_rules"
  vrg network apply-rules <new_network>
  ```

- **`vrg vm delete` fails with cryptic "too many 500 error responses" if
  firewall rules reference the VM's IP.** The CLI swallows the real cause.
  Raw API returns: `"Error deleting associated IP address: Firewall rule(s)
  are referencing this IP address"`. The blocking rule typically has
  `target_ip: "address:<key>"` — a port-forward rule on an upstream vnet
  (e.g. "SSH to <vm>" on External) pinning the VM's address record. Find
  and delete those rules first, then retry:

  ```bash
  curl -sk -H "Authorization: Bearer $KEY" "$HOST/api/v4/vnet_rules?fields=all" \
    | jq '.[] | select((.target_ip // "") | test("^address:")) | {key:."$key",name,target_ip}'
  ```

## When vrg doesn't have a flag, drop to the raw API

vrg covers most operations but not all. When it doesn't, hit the API
directly using the credentials vrg already has:

```bash
curl -X PUT "$VERGE_HOST/api/v4/<plural_snake>/<key>" \
  -H "Authorization: Bearer $VERGE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{...}'
```

- Resource paths are **plural snake_case**: `machine_nics`, not `nic` or
  `machineNics`.
- `PATCH` is not implemented. Use `PUT` with the full object.
- The NIC resource is `machine_nics`.
- The drive resource is `machine_drives`.
- The `?machine=<key>` query param is silently ignored on list endpoints —
  the server returns the full list either way. Filter client-side with `jq`.

## `vms` and `machines` are two tables with two key spaces

`vrg vm` queries the `vms` table. NICs, drives, and stats live under
`machine_*` tables keyed by `machines.$key`, which is **not** the same as
`vms.$key`. The bridge is `vms.machine` (foreign key into `machines`).

So for adminjump: `vms.$key=32`, `machines.$key=38`, and its OS drive is at
`machine_drives` row with `machine=38`. Querying `machine_drives?machine=32`
will quietly return zero matches.

To look up drives for a VM by name:

```bash
mkey=$(curl -ks -H "Authorization: Bearer $KEY" "$HOST/api/v4/vms" \
  | jq -r '.[] | select(.name=="<vm_name>") | .machine')
curl -ks -H "Authorization: Bearer $KEY" "$HOST/api/v4/machine_drives" \
  | jq "[.[] | select(.machine==$mkey)]"
```

Note: `vms.drives` itself is `null` in the API — only `machine_drives` is
authoritative.

## Webhook subscriptions span two tables

Wiring a webhook to a system event involves two distinct tables:

- `webhook_urls` — the destination definition. Managed by `vrg webhook`.
- `webhooks` — the event subscription queue (which event fires which
  webhook_url). Currently raw-API only.

Creating a `webhooks` row is `POST /api/v4/webhooks` (not `PUT /<key>` —
PUT requires an existing key, per the general write-verb split). The
foreign key into `webhook_urls` is named **`webhook_url`** (singular).

Snapshot tables don't emit events directly. To react to snapshot creation,
subscribe to `vms.new` and filter for `is_snapshot=true` in the handler.

## Use `doctor` as a preflight before risky changes

`vrg doctor` runs ~20 health checks (DIMM, SMART, vSAN journal, certs, MTU,
LLDP, more) and exits 1 on any failure. It slots straight into a "should I
proceed?" gate before updates, mass migrations, or snapshot operations:

```bash
vrg -o json doctor | jq '.[] | select(.status == "fail")'
```

For faster targeted checks during routine work:

```bash
vrg doctor --check connectivity,alarms,updates
```

## Cross-cloud triage with `-A`

`-A` runs any list command across every configured profile — useful when
triaging whether a problem is local or fleet-wide:

```bash
vrg -A alarm list --level critical
vrg -A vm list --query '[?status==`error`]'
```

## Quiet mode for agent loops

`-q` suppresses progress spinners and success banners. Combine with
`-o json` for clean, parseable output in scripts:

```bash
vrg -q -o json vm list
```

## DHCP on the `External` vnet: the LAN owns it, not VergeOS

`External` shares a broadcast domain (L2) with the physical LAN, so whatever
DHCP server runs the LAN (a UniFi gateway here) answers requests **before**
VergeOS and wins the race. Consequences that bit repeatedly:

- A VM set to DHCP on External gets a LAN-pool IP, NOT the IP VergeOS
  "reserved" on the NIC. VergeOS records **no lease** in `vnet_addresses`.
- Polling `vnet_addresses` for a VM's IP returns nothing, even though the VM
  is up and on the network. (Talos landed on `.160`, not its reserved IP.)
- Anything that discovers an IP via "VergeOS DHCP lease" (e.g. the Rancher
  node-driver fallback) fails on External.

Two reliable workarounds: **bake a static IP into the guest** (cloud-init /
machine config — not just the VergeOS NIC field), or **rely on the guest
agent**, which reports whatever IP the OS actually holds regardless of who
served DHCP. The guest-agent path is the only one that survives the race.

Note: pinging an External IP from a Tailscale-routed host can return false
ICMP echoes (the subnet router answers) — not a real reachability test.

## Editing cloud-init via the API (NoCloud datasource)

VergeOS VMs use the `nocloud` datasource, stored as **files**, not a field:

- `vms.<key>.cloudinit_files` lists them: `/user-data`, `/meta-data` (and
  sometimes `/network-config`), each a `cloudinit_files` row with a `$key`.
- **Read:** `GET /api/v4/cloudinit_files/<key>?download=true` (the plain
  record / `?fields=all` shows metadata only — no content).
- **Write:** `PUT /api/v4/cloudinit_files/<key>` with `{"contents": "<text>"}`
  (verified; read back with `?download=true`).
- Recipe-generated user-data is a **Jinja2 template** (`render: jinja2`,
  `{{yb.USER}}` …) filled from recipe answers. `vms.recipe_instance` is
  **readonly** — you cannot detach a recipe by PUTting it null. A recipe
  that auto-detaches does so server-side once its guest agent reports.

## Resizing VM RAM/CPU

Needs the VM powered off:

```bash
vrg vm stop <name> --wait --timeout 120
vrg vm update <name> --ram 8192      # MB (see RAM-is-MB note above)
vrg vm start <name>
```

The guest sees the new RAM on cold boot. The kubelet's `.status.capacity`
can read stale for a few seconds right after a k8s node flips Ready —
recheck. `vms.status`/`running` often read `null` mid-transition; poll with
`vrg --query status vm get <name>` or check inside the guest instead.

## Storage tiers

This cluster has only **tier 1** (NVMe, ~2.79 TiB usable), so
`SELECT_OS_TIER=1`. New drives via `vrg vm drive create` default to **tier
4** — pass `--tier 1` for the NVMe pool, or migrate later (`vrg vm drive
update` triggers a live tier migration). List with
`curl /api/v4/storage_tiers?fields=all` (rows have no `$key`; the `tier`
integer is the identifier).

## Kubernetes on VergeOS: CSI + CCM placement

Charts live at `https://verge-io.github.io/helm-charts`: `vergeos-csi`,
`vergeos-cloud-controller-manager`, `vergeos-node-driver`.

- **Both CSI and CCM need HTTPS to the VergeOS host API**, which forces the
  cluster nodes onto the **same vnet as that API** (here `External`). Other
  topologies were tried and failed; same-vnet is the one that works.
- **CSI** `block.poolVmId` is a `vms.$key` (NOT `machines.$key`) — a tiny
  "pool" VM that is just a metadata anchor; PVCs actually hot-plug onto the
  compute node as `machine_drives`.
- **CCM** `loadBalancer.networkID` is the vnet `$key`; `loadBalancer.ipPool`
  is a list of IPs. **Two CCMs on one vnet collide** — the CCM matches
  existing LB `ipalias` rows by the `k8s-lb-<ns>-<svc>` hostname, so a
  same-namespace+name Service on a second cluster *adopts* the first
  cluster's rule (its EXTERNAL-IP becomes a lie). Use non-overlapping IP
  pools AND distinct Service names, or separate vnets.
- The chart's default `nodeSelector: {control-plane: "true"}` does NOT match
  nodes labeled `control-plane: ""` (Talos, k3s). Override it (an empty
  `nodeSelector: {}` is ignored by the chart; set the real label/value).
- **The CCM creates only the DNAT (incoming translate) rule** for a
  LoadBalancer. Off-vnet clients also need a companion **SNAT (outgoing
  translate)** rule, or the node replies with its own IP and the client
  drops the reply. The CCM never emits it — this is the gap to close with a
  small reconciler that diffs `kubectl get svc` against `vnet_rules`.

## Rancher node-driver provisioning: the template is everything

Installing Rancher is a quick-start (cert-manager prereq + the `rancher`
chart; Rancher 2.14.2 gates on k8s `< 1.36`, so a 1.36+ cluster can't be
imported). The node driver (a UI extension + `docker-machine-driver-vergeos`,
added as a `ClusterRepo` pointing at the Pages index, NOT the github.com
URL) clones a **template VM** to build RKE2 nodes. **Getting the template
right is the entire job — everything else is easy.**

Model: clone template → generate SSH keypair → inject via cloud-init →
power on → **wait for the VM to report its IP** → SSH in → install RKE2.

Template requirements (verge.io Rancher Integration docs + hard experience):

- **Ubuntu 24.04** (only supported template OS).
- **cloud-init installed + armed** (driver injects SSH key + hostname via a
  multi-part MIME payload; it handles netplan DHCP for `en*` and machine-id
  regen on each clone).
- **`qemu-guest-agent` installed + running** — the ONLY working IP-discovery
  path on External (the DHCP-lease fallback dies to the LAN race). Without
  it, every node hangs forever at `Waiting for VM to report IP address` with
  no error, and CAPI loops create→delete.
- DHCP-enabled network; ≥4 GB RAM/node.

**Do NOT clone the recipe catalog entry** (bare image, no agent, recipe-
Jinja2 cloud-init you can't detach). **Build a golden image:**

1. `vrg recipe deploy "Ubuntu Server 24.04 (Noble Numbat)"` fresh (real OS
   import; auto-installs qemu-guest-agent; auto-detaches the recipe after
   first boot). Static IP + your SSH key for access.
2. Confirm `qemu-guest-agent` is installed + enabled.
3. Generalize: `rm /etc/cloud/cloud.cfg.d/80_disable_network_after_firstboot.cfg`
   (else clones never DHCP); netplan → DHCP matching `e*`;
   `truncate -s0 /etc/machine-id`; `rm /etc/ssh/ssh_host_*`;
   `cloud-init clean --logs --seed`.
4. Power off. The node pool's `templateVm` must match the VM name exactly.

Gotchas:

- `templateVm` is a **literal name match**; a catalog clone comes back
  version-suffixed (`ubuntu24 1.0-9`) and won't match `ubuntu24`.
- `machines.<key>.agent_guest_info` reads **null at idle even with the agent
  running** — NOT a health check. The IP only surfaces during the driver's
  provisioning loop; the success signal is the provision-pod log printing
  `Guest agent reported IP: <dhcp-ip>`.
- Expect a mid-cloud-init reboot on first boot (kernel update): SSH drops,
  then returns — poll `cloud-init status`.
- Downstream kubeconfig (secret `<cluster>-kubeconfig`, key `.data.value`)
  points at an internal `10.43.x.x` Rancher service IP. To use it off the
  management cluster, rewrite the server to the external Rancher hostname
  (`https://<rancher>/k8s/clusters/<id>`) and add `insecure-skip-tls-verify:
  true`.

## Talos Linux on VergeOS (no recipe; image-based)

No Talos recipe exists — provision close to bare-metal:

- Upload `metal-amd64.iso` via `vrg file upload`. Read-back: `?fields=all`
  shows `filesize` (the table `size_gb` lags and shows 0 right after upload).
- Build: `vrg vm create` + `vrg vm drive create` (OS disk) +
  `vrg vm drive create --media cdrom --media-source <iso>` + `vrg vm nic
  create`.
- **`boot_order` is a QEMU letter code, not a word.** `disk`/`hd`/`harddisk`
  are rejected; valid: `c` (disk-first), `cd` (CD-then-disk, the default for
  all VMs), `dc`, `cdn`. Set `boot_order=c` so post-install reboots land on
  disk instead of the installer ISO. (`cd` with no CD attached also falls
  through to disk — that's why recipe VMs all show `cd`.)
- **Bake the static IP into `machine.network`** (DHCP loses the External
  race). Use `deviceSelector: { driver: virtio_net }` — Talos names the NIC
  `ens1`, so name-based matching misses.
- **`HostnameConfig` (separate doc) conflicts with `machine.network.hostname`**
  — `talosctl validate` errors "static hostname is already set in v1alpha1
  config." Delete the `HostnameConfig` document.
- `apply-config` to maintenance mode installs to disk without an obvious
  reboot ("Applied configuration without a reboot") — verify with
  `talosctl get discoveredvolumes` (sda gains EFI/BIOS/BOOT/META/STATE/
  EPHEMERAL partitions).
- **Hot-detach of an online CD-ROM is rejected** ("Unable to delete online
  drive") — power off, `vrg vm drive delete`, start.

## Merging cluster kubeconfigs

k3s/RKE2 kubeconfigs all use the generic names `default`/`cluster`/`user`,
which collide on merge. Rename each config's `name:`/`cluster:`/`user:`/
`current-context:` **value lines** to a unique name first (targeted `sed` on
lines ending in the old value — never touch the base64 cert data), then
`KUBECONFIG=a:b:c kubectl config view --flatten`. Talos kubeconfigs already
use distinct names. k3s server URL is `127.0.0.1` — rewrite to the node IP.
Always back up `~/.kube/config` before merging.
