docs(swarm): add buildx 0.17+ prereq alongside compose v2

Second Docker-side prereq uncovered running a real deploy on a fresh
Debian trixie VM: images pulled fine but 'docker compose up --build'
bailed with 'compose build requires buildx 0.17.0 or later'. Debian's
buildx is stuck at 0.13, so the compose-plugin install must be paired
with a buildx plugin install.

- Prereqs list now requires docker compose version AND docker buildx
  version to be verified before enrolling.
- Install section renamed to 'Installing Compose v2 and Buildx on a
  worker', covers both plugins with arch-aware curl incantations and
  uname -m → compose-arch / buildx-arch mapping (compose uses x86_64,
  buildx uses amd64 — footgun).
- Adds a troubleshooting row for the buildx-too-old case and one for
  wrong-arch binary ('Invalid Plugins: ... exec format error').

Both uncovered on the live VM run; docs now match reality.
2026-04-18 21:00:26 -04:00
parent 60710d8f2f
commit 7c6fe3b576

@@ -71,28 +71,36 @@ On the **master**:
On each **worker**:
- DECNET installed.
- **Docker Engine + Compose v2 plugin** (the agent shells out to
`docker compose`, not the legacy `docker-compose`). This is the single
most common setup trap — verify with `docker compose version` before
enrolling. See [Installing Compose v2 on a worker](#installing-compose-v2-on-a-worker)
below if your distro ships the Docker engine but not the plugin
(Debian trixie's stock repos, for example, only carry v1).
- **Docker Engine + Compose v2 plugin + Buildx ≥ 0.17** (the agent shells
out to `docker compose` with `--build`, which in turn invokes buildx
for image builds). Verify both before enrolling:
```bash
docker compose version # expect v2.x.y
docker buildx version # expect v0.17.0 or newer
```
This is the single most common setup trap. Distros vary wildly in what
they ship — Debian trixie's stock repos have neither the compose v2
plugin nor a recent-enough buildx, for example. See [Installing
Compose v2 and Buildx on a worker](#installing-compose-v2-and-buildx-on-a-worker)
below.
- `sudo` for the user running `decnet agent` (MACVLAN/IPVLAN needs root).
`NOPASSWD` is convenient for unattended daemons.
- Outbound TCP to master:6514 (log forward) and inbound TCP on 8765 from
the master (deploy/teardown/health RPCs).
### Installing Compose v2 on a worker
### Installing Compose v2 and Buildx on a worker
If `docker compose version` prints anything other than `Docker Compose
version v2.x.y`, you need the plugin. Pick the path that matches your
version v2.x.y`, or `docker buildx version` prints older than `v0.17.0`,
install the missing plugin(s). Pick the path that matches your
environment.
**Option A — Docker's official apt repo (recommended when it's available):**
```bash
# Debian/Ubuntu. Adds Docker's own package source, then installs the
# compose plugin alongside whatever docker-ce/docker.io you already have.
# compose + buildx plugins alongside whatever docker-ce/docker.io you
# already have.
sudo apt-get update
sudo apt-get install -y ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
@@ -103,42 +111,72 @@ echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.
https://download.docker.com/linux/debian $(. /etc/os-release && echo $VERSION_CODENAME) stable" \
| sudo tee /etc/apt/sources.list.d/docker.list
sudo apt-get update
sudo apt-get install -y docker-compose-plugin
docker compose version # expect v2.x.y
sudo apt-get install -y docker-compose-plugin docker-buildx-plugin
docker compose version # expect v2.x.y
docker buildx version # expect v0.17.0+
```
For Ubuntu, swap `debian` for `ubuntu` in both the keyring URL and the
sources.list entry.
**Option B — standalone binary (offline or restricted networks):**
**Option B — standalone binaries (offline or restricted networks):**
Both plugins install the same way: download the binary for your
architecture and drop it into Docker's CLI plugin directory.
```bash
# Drop the v2 binary into Docker's CLI plugin directory. Works on any
# distro with the Docker engine already installed.
# Confirm the worker's architecture first — x86_64, aarch64, armv7l.
ARCH=$(uname -m)
case "$ARCH" in
x86_64) COMPOSE_ARCH=x86_64; BUILDX_ARCH=amd64 ;;
aarch64) COMPOSE_ARCH=aarch64; BUILDX_ARCH=arm64 ;;
armv7l) COMPOSE_ARCH=armv7; BUILDX_ARCH=arm-v7 ;;
esac
sudo mkdir -p /usr/local/lib/docker/cli-plugins
# Compose v2
sudo curl -fsSL \
"https://github.com/docker/compose/releases/download/v2.29.7/docker-compose-linux-$(uname -m)" \
"https://github.com/docker/compose/releases/download/v2.29.7/docker-compose-linux-${COMPOSE_ARCH}" \
-o /usr/local/lib/docker/cli-plugins/docker-compose
sudo chmod +x /usr/local/lib/docker/cli-plugins/docker-compose
# Buildx
sudo curl -fsSL \
"https://github.com/docker/buildx/releases/download/v0.18.0/buildx-v0.18.0.linux-${BUILDX_ARCH}" \
-o /usr/local/lib/docker/cli-plugins/docker-buildx
sudo chmod +x /usr/local/lib/docker/cli-plugins/docker-buildx
docker compose version
docker buildx version
```
If the worker can't reach GitHub directly (closed lab network, air-gapped
VM, etc.), download the binary on a box that *can* reach it and `scp` it
to the worker's `/usr/local/lib/docker/cli-plugins/docker-compose`
that's the entire install.
VM, etc.), download the binaries on a box that *can* reach it and `scp`
them to the worker's `/usr/local/lib/docker/cli-plugins/` — that's the
entire install.
**Watch the architecture.** Downloading `linux-x86_64` onto an `aarch64`
worker (or vice versa) gets you `exec format error: failed to fetch
metadata` from the `docker` CLI and the plugin is listed under "Invalid
Plugins" in `docker info`. `uname -m` is your friend.
**Do not** install the legacy `docker-compose` (v1, the Python one) and
call it a day. The DECNET deployer invokes `docker compose ...` as a
subcommand, not `docker-compose ...` as a binary — they are different
programs with different code paths, and v1 is end-of-life.
**Symptom if you get this wrong.** `decnet deploy --mode swarm` returns a
500 from the worker with
`CalledProcessError: Command '['docker', 'compose', ...]' returned
non-zero exit status 125`. The worker's agent log will show the
`docker` CLI's own help text dumped into stderr because `docker` treats
`compose` as an unknown positional when the plugin isn't installed.
**Symptoms if you get this wrong.**
- No compose plugin at all: `CalledProcessError: Command '['docker',
'compose', ...]' returned non-zero exit status 125`, agent log shows
the `docker` CLI's help text (because `compose` is an unknown
subcommand).
- Compose plugin OK but buildx too old: `compose build requires buildx
0.17.0 or later` in the agent log, followed by `up --build` exit
status 1. Images pull fine, the build step is what fails.
- Wrong-arch binary: `Invalid Plugins: compose failed to fetch metadata:
fork/exec ...: exec format error` in `docker info`.
Time sync is a hard requirement — mTLS cert validation fails if worker and
master clocks differ by more than a few minutes. Run `chronyd`/`systemd-timesyncd`.
@@ -565,7 +603,9 @@ decnet swarm decommission --name <each-worker> --yes
| Lines appear in `master.log` but not the dashboard | Ingester not running, or pointed at the wrong JSON path | `systemctl status decnet-ingester`, confirm `DECNET_INGEST_LOG_FILE` matches `listener --json-path` |
| `deploy --mode swarm` fails with `No enrolled workers` | Exactly what it says | `swarm enroll` at least one worker first |
| Worker returns 500 on `/deploy` with `ip addr show <nic>` error | The worker's agent is re-detecting its own NIC (this is the relocalize step) and can't find a usable interface | Run `ip route show default` on the worker — if empty, the default route is missing; fix the worker's networking before deploying |
| Worker returns 500 on `/deploy` with `docker compose ... exit status 125` and docker help text in the log | Compose v2 plugin is not installed on the worker; the stock `docker` binary is treating `compose` as an unknown subcommand | `docker compose version` on the worker. If it doesn't print v2.x.y, see [Installing Compose v2 on a worker](#installing-compose-v2-on-a-worker) |
| Worker returns 500 on `/deploy` with `docker compose ... exit status 125` and docker help text in the log | Compose v2 plugin is not installed on the worker; the stock `docker` binary is treating `compose` as an unknown subcommand | `docker compose version` on the worker. If it doesn't print v2.x.y, see [Installing Compose v2 and Buildx on a worker](#installing-compose-v2-and-buildx-on-a-worker) |
| Worker returns 500 on `/deploy` with `compose build requires buildx 0.17.0 or later` | Buildx plugin missing or too old on the worker; images pull but the build step fails | `docker buildx version` on the worker. If it's below v0.17.0, see [Installing Compose v2 and Buildx on a worker](#installing-compose-v2-and-buildx-on-a-worker) |
| `docker info` lists a CLI plugin under "Invalid Plugins: ... exec format error" | Wrong-architecture binary installed — e.g. x86_64 binary dropped onto an aarch64 host | Re-download the plugin binary matching `uname -m` and overwrite the file in `/usr/local/lib/docker/cli-plugins/` |
| Agent rejects master with `BAD_CERTIFICATE` | Master's own client cert (`~/.decnet/master/`) isn't in the worker's trust chain | Never happens if both sides were issued from the same CA. Check you didn't re-init the CA between `swarmctl` starts |
If things are really broken and you want a clean slate on the master: