vrg meets Claude: agent-first ops, in practice

The silent deploy

The task was unremarkable: spin up a new VM from a recipe, attach it to the management network, hand it back. I asked Claude to do it. It ran vrg recipe deploy ubuntu-base --set name=metrics-02, watched the task complete, and reported success. Exit code zero. Task status complete. By any reasonable measure, the deploy worked.

Except it hadn't. The recipe had a question called SELECT_OS_TIER with no default value. The deploy didn't fail on the missing answer — it just skipped the stage that depended on it. The VM existed. The OS drive didn't.

What I want to focus on isn't the bug. Recipes with undefaulted questions are a known sharp edge, and the fix is boring (pass --set SELECT_OS_TIER=performance and move on). What's interesting is what happened next.

Claude had been told, in its working instructions, not to trust exit codes for resource creation. After every create or deploy, run a get and check the resource looks the way you expect. So it did. The catch — flagged in the same notes — is that vrg vm get doesn't surface drives in any output mode; for that you query the underlying machine_drives table directly. Claude pulled the host and key out of ~/.vrg/config.toml, hit /api/v4/machine_drives, filtered for the new VM. Empty. It paused, re-read the recipe, noticed the question without a default, re-ran the deploy with the missing argument, and verified again. The drive was there. Now it reported success.

A human running the same command would almost certainly have moved on. Exit zero, task complete, ticket closed — and the missing drive surfaces an hour later when the VM won't boot, or a week later when someone notices the metrics agent never came up. The agent caught it in the same minute it happened, because verifying its own work is cheap and it had been told to do so.

That's the part of agent-mode ops that took me a while to internalize. The interesting property isn't that Claude can run vrg commands faster than I can — it can't, really, and the commands aren't the bottleneck anyway. The interesting property is that, given the right instructions and the right tools, it will actually do the tedious verification step that humans skip. And it turns out a surprising amount of "ops is hard" is really "ops is tedious, so we cut corners."

The rest of this post is about what the right tools means in practice, and why vrg happens to have most of them.

What the right tools look like

The verification step from the previous section wasn't a CLI one-liner — vrg vm get doesn't expose drives, so it dropped to the raw API. But verification was still cheap: a curl, a jq filter, done. That cheapness is the point. vrg and the API beneath it both treat structured output as a first-class concern, not a --json flag bolted on for compatibility. Every vrg list and get command takes -o json, and --query plucks fields with dot notation so simple checks don't need jq at all. The contract isn't "the CLI does everything"; it's "every layer speaks structured data, top to bottom." When verification is a one-liner — even a multi-tool one-liner — it gets written. When it's grep -A 5 ... | awk ..., it gets skipped.

Exit codes are the next layer. Most CLIs collapse every non-success to exit 1, which is fine for humans (who read the error) and useless for agents (who need to branch on the failure mode). vrg uses distinct codes: 6 for not found, 7 for conflict or ambiguity, 8 for validation, 9 for timeout, 10 for connection error. That's the difference between an agent that knows to retry on a transient network blip and stop on a malformed argument, versus one that has to grep stderr to make the same decision — and gets it wrong the day stderr changes wording.

The ambiguity case deserves its own beat. Ask vrg for a VM named web-01 when two tenants both have one, and it doesn't pick. It exits 7 and prints both matches. This sounds annoying until you imagine the alternative: an agent silently operating on the wrong VM because the resolver took the first hit. Loud failure beats quiet miscorrection, especially when the caller can't see the resolver's confidence.

Then doctor, vrg's built-in health audit — a sweep across roughly twenty checks (DIMM health, drive SMART, vSAN journal, certificates, MTU consistency, LLDP switch diversity, and more) that reports each as pass, warn, fail, or skip and exits 1 if anything fails. As a preflight before a change window, it drops directly into an agent's "should I proceed?" gate: vrg -o json doctor | jq '.[] | select(.status == "fail")' is the entire check. No bespoke health script to maintain.

And -A, which runs any list command across every configured profile. When the agent is triaging "is this alarm hitting one cloud or all of them?", that's a single flag instead of a loop.

None of these are agent features. They're plain CLI hygiene — structured output, meaningful exit codes, loud failure on ambiguity, a built-in preflight, fan-out across environments. They make agents work because they make any careful caller work: bash scripts, CI jobs, the operator at 2am with one eye open. The framing that helped me most was "agents are just scripts that read documentation." The tools that work for an agent are the tools that worked for the script you should have been writing all along.

When the CLI runs out

A different task: wire a notification to fire when someone takes a snapshot of a critical VM. Conceptually it's two pieces — a webhook destination (where the message goes) and a subscription (which event triggers it). I asked Claude to set both up.

The first piece was a one-liner: vrg webhook create --name snap-alerts --url .... Done. The second is where things got interesting.

Claude tried the obvious thing — looking for a vrg webhook subscribe or similar — and didn't find one. So it read the relevant docs in the working directory and learned three things: webhook subscriptions live in a separate table (webhooks, not webhook_urls) that vrg doesn't currently expose; snapshot tables don't emit events of their own; the way to react to snapshot creation is to subscribe to vms.new and filter for is_snapshot=true on the receiving end.

None of that was in vrg --help. All of it was in the docs sitting next to the project. And it's the kind of detail that a human would have to either remember or rediscover — the kind of detail that belongs in a CLAUDE.md so the next agent doesn't have to learn it the hard way.

What happened next is the part that's easy to miss. Claude pulled the host and key out of ~/.vrg/config.toml — same place vrg reads them — constructed a POST /api/v4/webhooks with curl, and wired the subscription directly. Not a workaround, not a hack — just a different tool from the same toolbox, used for the step the CLI doesn't cover yet.

The framing "Claude uses the CLI" undersells what's actually happening. Claude has the CLI, the raw API, the docs, the system's structured output, and jq — and switches between them based on which one fits the step in front of it. The CLI's job isn't to be sufficient. Its job is to handle the 80% cleanly enough that the remaining 20% — the workaround, the raw call, the doc-divergence — is visible. You can see exactly where the CLI ends and the API begins, because the CLI fails closed and says why.

That's the same property that made the first vignette work. vrg recipe deploy exited 0; the verification caught the silent skip; Claude fixed it. Here, vrg webhook couldn't do part of the job; the CLI didn't pretend it could; Claude reached for the next tool. The CLI's value is partly what it does and partly how it admits what it doesn't.

What this all adds up to

If you're considering letting Claude — or any LLM agent — operate against your infrastructure, three things mattered more than I expected going in.

Tell the agent to verify, every time. Exit 0 is a hypothesis, not a result. A get after every create or deploy is the cheapest reliability win available, and humans skip it for boring reasons — fatigue, time pressure, "it'll be fine" — that don't apply to agents. Make verification part of the instructions, not part of the hope.

Treat structured output as the contract. If you're parsing tables with awk for human-driven ops, your agent will too — and the day a column shifts width, it'll make a confident decision based on garbage. JSON is the contract. Tables are for humans.

Design CLIs as if the caller might be a careful script that reads the docs. Because that's exactly what an LLM agent in agent mode is. The affordances that make bash scripts, CI jobs, and 2am-operators successful — meaningful exit codes, structured output, fail-loud-on-ambiguous, a built-in preflight — are what make agent-mode ops work too. Agents aren't asking for new features. They're asking, very politely, for the features careful scripts have always wanted.

If you're a VergeOS user and want to skip the rediscovery, I've packaged the gotchas Claude learned the hard way as a CLAUDE.md you can drop straight into your working directory. [download link]

CLAUDE

CLAUDE.md

6 KB

vrg meets Claude: agent-first ops, in practice

The silent deploy

What the right tools look like

When the CLI runs out

What this all adds up to

Read more

What Is a... Site-to-Site Sync?

vrg Meets Bubble Tea: I Built a VergeOS TUI

Self-Hosting a Private LLM API on VergeOS with Lemonade Server

What is a... great VergeOS demo box?