# CLAUDE.md — Working with VergeOS via vrg

These are the things I (Claude) learned working against VergeOS through the
`vrg` CLI. Drop this file into the working directory of any project that
touches a VergeOS cloud, and a future agent will start with the same playbook
rather than rediscovering each gotcha.

## Verify before reporting success

Exit code 0 from `vrg` means the API call returned without error. It does
**not** always mean the resource is in the state you expect. Recipe deploys
in particular can silently skip individual stages — most often when a
question has no default value and nothing was passed via `--set`. The deploy
"succeeds," but the resource is malformed.

After every `create` or `deploy`, run a `get` and check the resource shape.
**`--query` is a global option — it must come BEFORE the subcommand,
not after.** `vrg vm get <name> --query ...` errors with "No such option";
`vrg --query ... vm get <name>` works.

```bash
vrg -o json --query status vm get <name>          # is it running?
vrg -o json --query 'machine_nics[].vnet' vm get <name>  # right networks?
```

`vrg vm get` does **not** include drives in any output mode (json, wide, or
table) despite `--help` claiming "-o json or -o wide for the full attribute
set". To verify a VM has its OS drive, query `machine_drives` directly —
see "vms vs machines key spaces" below.

If verification fails, treat the deploy as failed regardless of exit code.

## Use structured output, not table parsing

Every `list` and `get` supports `-o json`. Use it. Use `--query` for
single-field extraction — most cases don't need `jq`:

```bash
vrg --query status vm get web-01
vrg -o json vm list --query '[].name'
```

If you find yourself reaching for `grep` or `awk`, switch to `-o json` first.

## Branch on exit codes, not stderr

vrg's exit codes are distinct and stable:

| Code | Meaning                                  |
| ---- | ---------------------------------------- |
| 0    | success                                  |
| 6    | not found                                |
| 7    | conflict / ambiguous (e.g. two matches)  |
| 8    | validation error                         |
| 9    | timeout                                  |
| 10   | connection error                         |

Retry on 9 and 10. Do not retry on 6/7/8 — fix the call.

Code 7 is loud on purpose: vrg refuses to silently pick a match when a name
is ambiguous. Resolve by passing the numeric `$key` or by scoping
(`--tenant`, etc.).

## Recipe deploy gotchas

- **Questions without default values silently break post-deploy.** Run
  `vrg recipe get <name>` first, read the `questions` field, and pass every
  required answer via `--set` — even ones that "should" have defaults.
- **Pass NICs by network name, not network key.** Some code paths accept
  keys; the recipe parser resolves names. Names are the safe choice.
- **A "complete" task does not mean a bootable VM.** Always verify drives
  exist (see "Verify before reporting success").

## When vrg doesn't have a flag, drop to the raw API

vrg covers most operations but not all. When it doesn't, hit the API
directly using the credentials vrg already has:

```bash
curl -X PUT "$VERGE_HOST/api/v4/<plural_snake>/<key>" \
  -H "Authorization: Bearer $VERGE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{...}'
```

- Resource paths are **plural snake_case**: `machine_nics`, not `nic` or
  `machineNics`.
- `PATCH` is not implemented. Use `PUT` with the full object.
- The NIC resource is `machine_nics`.
- The drive resource is `machine_drives`.
- The `?machine=<key>` query param is silently ignored on list endpoints —
  the server returns the full list either way. Filter client-side with `jq`.

## `vms` and `machines` are two tables with two key spaces

`vrg vm` queries the `vms` table. NICs, drives, and stats live under
`machine_*` tables keyed by `machines.$key`, which is **not** the same as
`vms.$key`. The bridge is `vms.machine` (foreign key into `machines`).

So for adminjump: `vms.$key=32`, `machines.$key=38`, and its OS drive is at
`machine_drives` row with `machine=38`. Querying `machine_drives?machine=32`
will quietly return zero matches.

To look up drives for a VM by name:

```bash
mkey=$(curl -ks -H "Authorization: Bearer $KEY" "$HOST/api/v4/vms" \
  | jq -r '.[] | select(.name=="<vm_name>") | .machine')
curl -ks -H "Authorization: Bearer $KEY" "$HOST/api/v4/machine_drives" \
  | jq "[.[] | select(.machine==$mkey)]"
```

Note: `vms.drives` itself is `null` in the API — only `machine_drives` is
authoritative.

## Webhook subscriptions span two tables

Wiring a webhook to a system event involves two distinct tables:

- `webhook_urls` — the destination definition. Managed by `vrg webhook`.
- `webhooks` — the event subscription queue (which event fires which
  webhook_url). Currently raw-API only.

Creating a `webhooks` row is `POST /api/v4/webhooks` (not `PUT /<key>` —
PUT requires an existing key, per the general write-verb split). The
foreign key into `webhook_urls` is named **`webhook_url`** (singular).

Snapshot tables don't emit events directly. To react to snapshot creation,
subscribe to `vms.new` and filter for `is_snapshot=true` in the handler.

## Use `doctor` as a preflight before risky changes

`vrg doctor` runs ~20 health checks (DIMM, SMART, vSAN journal, certs, MTU,
LLDP, more) and exits 1 on any failure. It slots straight into a "should I
proceed?" gate before updates, mass migrations, or snapshot operations:

```bash
vrg -o json doctor | jq '.[] | select(.status == "fail")'
```

For faster targeted checks during routine work:

```bash
vrg doctor --check connectivity,alarms,updates
```

## Cross-cloud triage with `-A`

`-A` runs any list command across every configured profile — useful when
triaging whether a problem is local or fleet-wide:

```bash
vrg -A alarm list --level critical
vrg -A vm list --query '[?status==`error`]'
```

## Quiet mode for agent loops

`-q` suppresses progress spinners and success banners. Combine with
`-o json` for clean, parseable output in scripts:

```bash
vrg -q -o json vm list
```
