The Template Is the Integration: Getting RKE2 Provisioning Right on VergeOS

Share

I wanted Rancher to provision a Kubernetes cluster directly onto VergeOS — clone the VMs, install RKE2, hand me a Ready cluster, no manual node-building. VergeOS ships an integration for exactly this: a Rancher UI extension and a docker-machine node driver, installed from their Helm chart repo.

Almost all of it was boring in the best way. cert-manager and the Rancher chart went on in a few minutes. The node-driver extension was a couple of clicks under Extensions. Adding the chart repo, creating a cloud credential, filling in the cluster form — all easy. None of that is the story.

The story is the template VM. The node driver clones a template to build each node, and getting that template exactly right was the entire job. I got it wrong three times. Everything else was scaffolding around that one hard part.

Why the template is the whole ballgame

Clone-based provisioning has a simple, unforgiving model. For each node, the driver: clones your template VM, generates an SSH keypair, injects it via cloud-init, powers the clone on, waits for the VM to report its IP, then SSHes in and lays down RKE2. Clone, configure, bootstrap.

Notice what's load-bearing there. The driver doesn't build a node — it copies one and wakes it up. Every assumption it makes about the node is really an assumption about the template. If the template can't announce its IP, the driver waits forever. If the template's cloud-init won't re-run, the injected SSH key never lands. The orchestration is commodity; the template is the contract. This is true of vSphere, of most VM-based provisioners, and it's true here.

So here is the unforgiving model in practice — three wrong templates before a right one.

Wrong template #1: the name didn't match. I pointed the driver at templateVm: ubuntu24 before any such VM existed. Nodes spun create→delete in under ten seconds each. Obvious in hindsight; the field is a literal name match.

Wrong template #2: the clone got renamed. I cloned the Ubuntu 24.04 recipe to make an ubuntu24 — and VergeOS handed back ubuntu24 1.0-9, version-suffixed. Still no exact match. Still looping.

Wrong template #3: the clone had no guest agent. This was the real one, and it cost the afternoon. I finally had a VM named ubuntu24, the driver cloned it cleanly, three node VMs appeared with real disks — and then every node hung here:

(cluster1-pool1-...) Powering on VM...
(cluster1-pool1-...) Waiting for VM to report IP address...

Forever. The VMs were on. The disks were there. But the driver never learned their IPs, so it never reached the SSH step, and CAPI recycled the nodes on a loop.

Getting it right means knowing how the IP is discovered

The driver finds a node's IP through the QEMU guest agent running inside the clone. My ubuntu24 was a clone of the recipe catalog entry — a bare cloud image that had never had qemu-guest-agent installed. No agent, no report, infinite wait.

There's a documented fallback: read the VM's DHCP lease from VergeOS. But my cluster lives on the External vnet, which shares a broadcast domain with my LAN — so my UniFi gateway answers DHCP first and VergeOS never records a lease. Both discovery paths were dead. The clone booted fine and sat there, perfectly healthy and completely mute.

I'll be honest about the detour, because it's the part worth learning from: my first theory was wrong. I was sure the recipe's "purge cloud-init after first boot" behavior had stripped the template. It hadn't — I'd never booted the thing, so nothing had purged. I'd built a story and gone looking for evidence to fit it. The fix only landed once I stopped theorizing and re-read the actual state: cloud-init was present and fine; qemu-guest-agent simply wasn't installed.

The right template, built deliberately:

  1. Deploy a fresh Ubuntu 24.04 from the recipe — not a clone of the catalog entry. A real deploy imports a bootable image, and as a bonus installs the guest agent and detaches the recipe after first boot. Give it a static IP and your SSH key so you can get in.
  2. Bake in / confirm qemu-guest-agent — installed, enabled, channel present. This is the one thing whose absence looks like a total mystery.
  3. Generalize it for cloning: remove the recipe's "disable network after first boot" cloud-init file (or clones never DHCP), reset netplan to DHCP, blank /etc/machine-id, drop the SSH host keys, then cloud-init clean --logs --seed so each clone runs cloud-init fresh.
  4. Power off. That's the golden image.

Next provision, the log finally read the line I'd been chasing:

Waiting for VM to report IP address...
Guest agent reported IP: 192.168.1.195
VM is ready at 192.168.1.195
Provisioning with ubuntu(systemd)...

Three nodes, 192.168.1.194 through .196, RKE2 v1.35.4, all Ready. After the template was right, the cluster came up in minutes — exactly the easy part everyone promises.

The fine print

qemu-guest-agent is non-negotiable on a shared-L2 vnet. It's the only IP-discovery path that survives when something other than VergeOS owns DHCP on the segment. Missing it is the difference between "works in minutes" and "hangs forever with no error."

Never clone the recipe catalog entry as your template. It's a bare image with no agent, and its cloud-init is a recipe Jinja2 template that re-renders on boot and won't cleanly detach. Deploy fresh, then prep.

agent_guest_info reads null even with the agent running fine — don't use it as a health check. I wasted time staring at it. The field stays empty while the VM is idle; the node's IP only surfaces during the driver's provisioning loop. The real signal is the provision log printing Guest agent reported IP — not anything in that field.

The template name is a literal match, and VergeOS may rename your clone. A catalog clone comes back version-suffixed (ubuntu24 1.0-9). Check the actual VM name against the driver config before you blame anything else.

Generalize, or your clones will collide. Skipping the machine-id blank and host-key removal gives you nodes that fight over DHCP identifiers and present identical SSH host keys. Five lines of prep; do them.

What this all adds up to

In clone-based provisioning, invest in the template, not the orchestration. The driver is commodity and well-behaved. The reliability of your whole cluster lives in one VM you build once. Treat it like the artifact it is — version it, document its prep, never clone a pet.

When a clone "boots fine but does nothing," suspect the contract between host and guest. The failure here wasn't in Rancher or in VergeOS — it was in the silent handoff where the guest is supposed to announce itself. Being able to read both the provisioning logs and the hypervisor's view of the VM in the same loop is what turned a mystery into a missing package.

Re-read the state before you trust your theory. My cloud-init-purge hypothesis was clean, plausible, and wrong. The evidence was one SSH command away the whole time.

Installing Rancher on VergeOS is a quick-start. Provisioning a real cluster on it comes down to a single VM you have to get exactly right — and once you do, the rest really is as easy as it looked. The template isn't a prerequisite for the integration. The template is the integration.

A parting gift

One last thing, and it's the part I'd actually keep. The point of doing work like this with an agent isn't speed — it's that the lessons don't evaporate when the terminal closes. Every gotcha in this post — the guest-agent requirement, the DHCP race on External, the boot_order letter codes, the exact golden-image prep — got written down into a CLAUDE.md that lives in the project directory, so the next session starts already knowing what cost me an afternoon. I'm leaving it here as a present: drop it next to your own VergeOS project and your first cluster comes up the boring way, not the three-wrong-templates way. The cluster was never really the deliverable. The playbook is.