6
SWARM Mode
anti edited this page 2026-04-19 00:23:25 -04:00

SWARM Mode — Multi-host Deployment

SWARM is DECNET's multi-host deployment posture. One master orchestrates N workers (real hosts), each running a slice of the decky fleet. The control plane speaks HTTP+mTLS; the log plane speaks RFC 5425 syslog over mTLS on TCP 6514. Everything is signed by a single DECNET-managed CA on the master.

If you want a single-box deployment, stop here and read Deployment Modes → UNIHOST. SWARM has more moving parts and is not the right starting point for first runs.

See also: CLI reference, Deployment modes, Logging and syslog, Networking: MACVLAN and IPVLAN, Teardown.


Architecture in one picture

┌──────────────────────── MASTER ────────────────────────┐      ┌───────── WORKER ─────────┐
│                                                        │      │                          │
│  decnet api         :8000   (dashboard / REST)         │      │  decnet agent      :8765 │
│  decnet swarmctl    :8770   (SWARM control plane)      │◀mTLS▶│    (FastAPI/uvicorn)     │
│  decnet listener    :6514   (syslog-over-TLS sink)     │◀mTLS─│  decnet forwarder        │
│  decnet ingester            (parses master.json)       │      │    (tails local log file,│
│                                                        │      │     ships RFC 5425 over  │
│  SQLite/MySQL (shared repo, SwarmHost + DeckyShard)    │      │     TCP 6514 to master)  │
│                                                        │      │                          │
│  ~/.decnet/ca/   (self-signed CA — ca.crt, ca.key)     │      │  ~/.decnet/agent/        │
│  ~/.decnet/master/  (master client cert for swarmctl)  │      │    (CA-issued bundle)    │
│                                                        │      │  docker / compose        │
└────────────────────────────────────────────────────────┘      └──────────────────────────┘

Four long-running processes on the master, two on each worker. Each process is a separate supervised unit — if swarmctl crashes, the main decnet api, the log listener, and the ingester keep running. This mirrors the start_new_session=True subprocess pattern used everywhere else in DECNET (decnet/cli.py::api and friends).

Ports cheat sheet

Port Process Host Protocol mTLS?
8000 decnet api master HTTP no
8770 decnet swarmctl master HTTP no *
6514 decnet listener master syslog (RFC5425) yes
8765 decnet agent worker HTTPS yes
5140 local collector worker syslog no (loopback)

* swarmctl binds to 127.0.0.1 by default and is called by the local decnet CLI. If you need to drive it from outside the master box, put it behind a reverse proxy with your own auth — it is not hardened for public exposure.


Prerequisites

On the master:

  • DECNET installed (pip install -e .) in the venv you plan to run from.
  • Write access to ~/.decnet/ (CA and master bundle land here).
  • A reachable listen address for port 6514 and 8765→master replies.
  • Docker is not needed on the master unless the master is also a worker.

On each worker:

  • DECNET installed.
  • Docker Engine + Compose v2 plugin + Buildx ≥ 0.17 (the agent shells out to docker compose with --build, which in turn invokes buildx for image builds). Verify both before enrolling:
    docker compose version    # expect v2.x.y
    docker buildx version     # expect v0.17.0 or newer
    
    This is the single most common setup trap. Distros vary wildly in what they ship — Debian trixie's stock repos have neither the compose v2 plugin nor a recent-enough buildx, for example. See Installing Compose v2 and Buildx on a worker below.
  • sudo for the user running decnet agent (MACVLAN/IPVLAN needs root). NOPASSWD is convenient for unattended daemons.
  • Outbound TCP to master:6514 (log forward) and inbound TCP on 8765 from the master (deploy/teardown/health RPCs).

Installing Compose v2 and Buildx on a worker

If docker compose version prints anything other than Docker Compose version v2.x.y, or docker buildx version prints older than v0.17.0, install the missing plugin(s). Pick the path that matches your environment.

Option A — Docker's official apt repo (recommended when it's available):

# Debian/Ubuntu. Adds Docker's own package source, then installs the
# compose + buildx plugins alongside whatever docker-ce/docker.io you
# already have.
sudo apt-get update
sudo apt-get install -y ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/debian/gpg \
     -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] \
     https://download.docker.com/linux/debian $(. /etc/os-release && echo $VERSION_CODENAME) stable" \
     | sudo tee /etc/apt/sources.list.d/docker.list
sudo apt-get update
sudo apt-get install -y docker-compose-plugin docker-buildx-plugin
docker compose version    # expect v2.x.y
docker buildx version     # expect v0.17.0+

For Ubuntu, swap debian for ubuntu in both the keyring URL and the sources.list entry.

Option B — standalone binaries (offline or restricted networks):

Both plugins install the same way: download the binary for your architecture and drop it into Docker's CLI plugin directory.

# Confirm the worker's architecture first — x86_64, aarch64, armv7l.
ARCH=$(uname -m)
case "$ARCH" in
  x86_64)  COMPOSE_ARCH=x86_64;  BUILDX_ARCH=amd64 ;;
  aarch64) COMPOSE_ARCH=aarch64; BUILDX_ARCH=arm64 ;;
  armv7l)  COMPOSE_ARCH=armv7;   BUILDX_ARCH=arm-v7 ;;
esac

sudo mkdir -p /usr/local/lib/docker/cli-plugins

# Compose v2
sudo curl -fsSL \
     "https://github.com/docker/compose/releases/download/v2.29.7/docker-compose-linux-${COMPOSE_ARCH}" \
     -o /usr/local/lib/docker/cli-plugins/docker-compose
sudo chmod +x /usr/local/lib/docker/cli-plugins/docker-compose

# Buildx
sudo curl -fsSL \
     "https://github.com/docker/buildx/releases/download/v0.18.0/buildx-v0.18.0.linux-${BUILDX_ARCH}" \
     -o /usr/local/lib/docker/cli-plugins/docker-buildx
sudo chmod +x /usr/local/lib/docker/cli-plugins/docker-buildx

docker compose version
docker buildx version

If the worker can't reach GitHub directly (closed lab network, air-gapped VM, etc.), download the binaries on a box that can reach it and scp them to the worker's /usr/local/lib/docker/cli-plugins/ — that's the entire install.

Watch the architecture. Downloading linux-x86_64 onto an aarch64 worker (or vice versa) gets you exec format error: failed to fetch metadata from the docker CLI and the plugin is listed under "Invalid Plugins" in docker info. uname -m is your friend.

Do not install the legacy docker-compose (v1, the Python one) and call it a day. The DECNET deployer invokes docker compose ... as a subcommand, not docker-compose ... as a binary — they are different programs with different code paths, and v1 is end-of-life.

Symptoms if you get this wrong.

  • No compose plugin at all: CalledProcessError: Command '['docker', 'compose', ...]' returned non-zero exit status 125, agent log shows the docker CLI's help text (because compose is an unknown subcommand).
  • Compose plugin OK but buildx too old: compose build requires buildx 0.17.0 or later in the agent log, followed by up --build exit status 1. Images pull fine, the build step is what fails.
  • Wrong-arch binary: Invalid Plugins: compose failed to fetch metadata: fork/exec ...: exec format error in docker info.

Time sync is a hard requirement — mTLS cert validation fails if worker and master clocks differ by more than a few minutes. Run chronyd/systemd-timesyncd.


Setup walkthrough

This is a complete, literal walkthrough. Follow it top to bottom the first time. Every command is either run on master or on worker — annotated below each block.

1. Master — start the control plane

# Start the SWARM controller. First run creates ~/.decnet/ca/ automatically
# (self-signed CA, ca.crt/ca.key) and ~/.decnet/master/ (client cert for
# the master process's own identity when talking to worker agents).
decnet swarmctl --daemon --host 127.0.0.1 --port 8770

# Start the log listener. First run creates master.log (RFC 5424 forensic
# sink, every line verbatim) and master.json (one JSON object per event for
# the ingester).
mkdir -p ~/.decnet/master-logs
decnet listener --daemon \
     --host 0.0.0.0 --port 6514 \
     --log-path ~/.decnet/master-logs/master.log \
     --json-path ~/.decnet/master-logs/master.json

# Confirm both are up.
curl -sf http://127.0.0.1:8770/health && echo OK
ss -tlnp | grep -E '8770|6514'

--daemon detaches to a new session (same _daemonize() as decnet api). Without it, the command stays in the foreground.

At this point:

  • ~/.decnet/ca/ca.crt is the CA every worker will trust.
  • ~/.decnet/ca/ca.key must never leave the master. Treat it like an SSH host key: losing it means re-enrolling every worker.
  • ~/.decnet/master/ holds the master's own client certificate that swarmctl uses to authenticate outbound RPCs to worker agents.

2. Master — enroll a worker

The enrollment command is a single call that does four things:

  1. Generates a worker keypair + CSR on the master (the private key is written directly to the output bundle; it never touches the wire).
  2. Signs the CSR with the CA, producing worker.crt.
  3. Records a SwarmHost row in the shared repo with status enrolled and the cert fingerprint.
  4. Writes the bundle files to --out-dir for you to ship to the worker.
decnet swarm enroll \
     --name decky-vm \
     --address 192.168.1.13 \
     --sans decky-vm.lan,192.168.1.13 \
     --out-dir /tmp/decky-vm-bundle

Add --updater to also issue a second bundle for the remote-update daemon (see Remote-Updates); the updater bundle lands in /tmp/decky-vm-bundle-updater/.

--name is the worker's DECNET identity — it becomes the cert CN and the source_worker tag on every log line forwarded from that host. Pick names you can grep for. Must be unique; re-enrolling the same name is rejected.

--address is the worker's IP as reachable from the master. This is what the master's control-plane client will connect to for deploy/teardown RPCs.

--sans is a comma-separated list of Subject Alternative Names. Include every DNS name and IP the master might use to reach the worker. At minimum, include the IP you passed to --address.

Output (/tmp/decky-vm-bundle/):

ca.crt        # the DECNET CA certificate
worker.crt    # CA-signed client+server cert for this worker
worker.key    # worker private key (mode 0600)

3. Ship the bundle to the worker

Any secure channel works — this is a plain file copy. scp, rsync, sshpass in a closet lab — pick your poison:

# From the master:
scp -r /tmp/decky-vm-bundle/* anti@192.168.1.13:~/.decnet/agent/

On the worker, the bundle must land at ~/.decnet/agent/ of the user that will run decnet agent. Watch out for sudo: if you run the agent under sudo, $HOME expands to /root, not /home/anti. Either put the bundle under /root/.decnet/agent/, or pass --agent-dir to override.

After copying, chmod 600 ~/.decnet/agent/worker.key and delete the master copy.

4. Worker — start the agent + forwarder

# On the worker, as the user whose $HOME holds the bundle (or with --agent-dir).
sudo decnet agent --daemon \
     --host 0.0.0.0 --port 8765 \
     --agent-dir /home/anti/.decnet/agent

# The forwarder tails the worker's local decky log file and ships each
# line, octet-framed and mTLS-wrapped, to the master listener.
decnet forwarder --daemon \
     --master-host 192.168.1.13-master-ip \
     --master-port 6514 \
     --log-file /var/log/decnet/decnet.log \
     --state-db ~/.decnet/agent/forwarder.db \
     --agent-dir /home/anti/.decnet/agent

--state-db holds a single table that records the forwarder's byte offset into the log file. On reconnect after a master outage, the forwarder resumes from the stored offset — no duplicates, no gaps. Truncation (logrotate) is detected (st_size < offset) and resets the offset to 0.

--master-host / --master-port can also be set via DECNET_SWARM_MASTER_HOST / DECNET_SWARM_MASTER_PORT so operators can bake them into a systemd unit or .env file.

5. Master — confirm the worker is alive

# List enrolled workers. Fresh enrollments are status=enrolled until the
# first successful health ping flips them to active.
decnet swarm list

# Poll worker agents. On success, flips SwarmHost.status to active and
# stamps SwarmHost.last_heartbeat.
decnet swarm check

decnet swarm list
# name=decky-vm  status=active  last_heartbeat=2026-04-18T...

If check reports reachable: false, the usual suspects are: the agent isn't running, the master cannot reach worker:8765 (firewall / NAT), --address at enrollment doesn't match the worker's actual IP, or clock skew is breaking cert validity.

6. Deploy deckies across the swarm

decnet deploy --mode swarm --deckies 6 --services ssh,smb --dry-run
# Round-robins 6 deckies across all enrolled workers (with status IN
# (enrolled, active)) and prints the compose-shard plan.

decnet deploy --mode swarm --deckies 6 --services ssh,smb
# Live run: POSTs each worker's shard to swarmctl, which fans out to each
# agent's /deploy, which calls the same deployer.py used in UNIHOST.

Sharding is round-robin by enrollment order. If you have workers A and B and ask for 3 deckies, A gets 2 and B gets 1. If you want a different distribution, run two separate deploy calls with filtered host lists (feature request; see Known Limitations).

Empty swarm is a hard error: deploy --mode swarm with zero enrolled workers exits non-zero with No enrolled workers.


Command reference

All of these live in decnet/cli.py. Run decnet <cmd> --help for the authoritative option list. What follows are the knobs you will actually care about.

decnet swarmctl

Master-side SWARM control plane. FastAPI app at decnet.web.swarm_api:app, launched as a subprocess of uvicorn. Daemonizable.

--host 127.0.0.1   Bind address. Keep this on loopback unless you know
                   what you're doing; swarmctl is called by the local CLI.
--port 8770        Bind port.
--daemon / -d      Detach to background.

decnet listener

Master-side RFC 5425 syslog-over-TLS sink. Validates every client cert against ~/.decnet/ca/ca.crt, tags each line with the peer's CN as source_worker, and writes two sinks: an RFC 5424 .log file and a parsed .json file (one object per event).

--host 0.0.0.0     Bind address.
--port 6514        IANA-assigned syslog-TLS port.
--log-path <path>  Raw RFC 5424 sink.      default: ./master.log
--json-path <path> Parsed JSON sink.       default: ./master.json
--ca-dir <dir>     CA to verify peers.     default: ~/.decnet/ca
--daemon / -d      Detach to background.

decnet agent

Worker-side control-plane daemon. FastAPI on HTTPS with ssl.CERT_REQUIRED. Endpoints: /deploy, /teardown, /status, /health, /mutate. All require a CA-signed client cert (the master's).

--host 0.0.0.0        Bind address.
--port 8765           Bind port.
--agent-dir <path>    Override bundle location. Required when running under
                      sudo or any user whose $HOME doesn't hold the bundle.
--daemon / -d         Detach to background.

decnet forwarder

Worker-side log shipper. Tails --log-file (default: DECNET_INGEST_LOG_FILE, i.e. the same file the local collector writes), frames each line per RFC 5425 octet-counting, and writes it to master:6514 over mTLS. Offset state is persisted in SQLite so a master crash is recoverable without data loss.

--master-host <ip>     Master IP. env: DECNET_SWARM_MASTER_HOST
--master-port <int>    Listener port. default: 6514
--log-file <path>      File to tail. default: DECNET_INGEST_LOG_FILE
--state-db <path>      Offset SQLite. default: <agent-dir>/forwarder.db
--agent-dir <path>     Bundle dir. default: ~/.decnet/agent
--poll-interval <sec>  File tail interval. default: 0.5
--daemon / -d          Detach to background.

decnet swarm enroll

Issues a worker bundle and records a SwarmHost row.

--name <str>       Worker identity (CN + source_worker tag). Required.
--address <ip>     IP/hostname the master uses to reach the agent. Required.
--sans a,b,c       Subject Alternative Names. default: [--address]
--out-dir <path>   Where to write the bundle. default: ./<name>-bundle
--agent-port <int> Port to record on the host row. default: 8765
--notes <str>      Free-form annotation, shown in `swarm list`.

decnet swarm list

Prints the SwarmHost rows as a table.

--status <enrolled|active|unreachable|decommissioned>
                   Filter. default: all except decommissioned.
--json             Emit JSON, not a table. Useful for scripting.

decnet swarm check

Synchronously polls every active/enrolled agent's /health. On success, flips status to active and stamps last_heartbeat. On failure, flips to unreachable and records the error.

decnet swarm deckies

Lists every deployed decky across the swarm, joined with its owning worker host's identity. swarm list answers which workers are enrolled; this answers which deckies are running and where.

--host <name|uuid>  Filter to a single worker (name is looked up → uuid).
--state <state>     Filter by shard state: pending | running | failed | torn_down.
--json              Emit JSON, not a table.

Columns: decky, host, address, state, services. State is colored (green=running, red=failed, yellow=pending, dim=torn_down).

decnet swarm decommission

Marks a host decommissioned in the repo, tears down any running deckies on it via the agent (if reachable), and revokes the worker's cert from the master's active-set. The worker's bundle files are not deleted from the worker — you are expected to wipe those out of band.

--name <str>  | --uuid <str>   Identify by either. One is required.
--yes                         Skip confirmation prompt.
--keep-deckies                Leave containers running on the worker.
                              Use this when reassigning hardware.

decnet deploy --mode swarm

Round-robins the requested deckies across enrolled workers and dispatches to swarmctl, which POSTs each shard to the matching agent. Compose generation is shared with UNIHOST; only the distribution differs.

--deckies <n>           Total fleet size across all workers.
--services a,b,c        Fixed service set for every decky.
--randomize-services    Per-decky random subset from the catalog.
--archetype <name>      Pick from Archetypes (see wiki page).
--dry-run               Print the shard plan; no RPC.

Log pipeline — what actually happens to an attack event

  1. Attacker hits a decky. The decky's in-container emit helper writes an RFC 5424 line to stdout and to /var/log/decnet/decnet.log inside the container. (See Logging and syslog.)
  2. Worker's local collector picks the event up over loopback syslog (worker:5140). Plaintext is fine here — it never leaves the host.
  3. The collector appends the parsed RFC 5424 to DECNET_INGEST_LOG_FILE on the worker.
  4. decnet forwarder tails that file, octet-frames each line, connects to master:6514 over mTLS (trust root = DECNET CA), and sends it.
  5. decnet listener on the master validates the peer cert, extracts the CN as source_worker, enriches the structured data, and writes to master.log + master.json.
  6. decnet ingester tails master.json and inserts rows into the shared repo — the same code path UNIHOST uses.
  7. The web dashboard (decnet api) queries the repo; live-logs stream over SSE.

Verify on the wire. A tcpdump -i any 'port 6514' on the master should show only ClientHello / ServerHello / Certificate / encrypted ApplicationData records. No plaintext <13>1 2026-... — if you see those, something is wrong and logs are leaking in the clear. The listener refuses plaintext connections outright.

Verify provenance. Every line in master.log has a source_worker=... SD element populated from the peer cert's CN. The master does not trust a source_worker the worker tries to set itself — it is always overwritten with the authenticated cert identity. A compromised worker cannot impersonate another worker's name.


Operational concerns

Pushing code updates without SSH

Once enrolled with --updater, the master can push new code to workers over mTLS — no more scp/sshpass cycles. See Remote-Updates for the decnet swarm update command, auto-rollback semantics, and the --include-self opt-in for upgrading the updater itself.

Master crash / restart

Kill the listener mid-shipment. The forwarder detects the dropped connection, retries with exponential backoff (capped at 30s), buffers writes into the worker's local log file (not RAM), and on reconnect resumes shipping from the last committed offset in forwarder.db.

Guarantee: no duplicates, no loss, across any number of master restarts, as long as the worker's disk is intact. Verified end-to-end in tests/swarm/test_forwarder_resilience.py.

Worker crash / restart

The agent is stateless at the process level — all state lives in the bundle on disk plus whatever Docker has running. systemctl restart decnet-agent (or equivalent) is safe at any time. The forwarder picks up exactly where it left off.

Rotating the CA

Don't. The CA key signs every worker cert. Replacing it means re-enrolling every worker. If the CA key is compromised, treat it as a full rebuild: decommission every worker, delete ~/.decnet/ca/, restart swarmctl (it regenerates a fresh CA), re-enroll every worker with fresh bundles.

Rotating a single worker cert

decnet swarm decommission --name decky-old --yes
decnet swarm enroll --name decky-new --address <same-ip> \
     --out-dir /tmp/decky-new-bundle
# ship the new bundle, restart the agent pointed at it.

There is no in-place rotation — decommission + re-enroll is the path.

Teardown

# Master: tear down all deckies across all workers, then stop control plane.
decnet teardown --all --mode swarm

# On each worker, if you want to remove the bundle:
rm -rf ~/.decnet/agent
systemctl stop decnet-agent decnet-forwarder

# Master, to fully wipe swarm state:
decnet swarm decommission --name <each-worker> --yes
# This leaves ~/.decnet/ca/ intact so you can re-enroll later. To fully
# wipe: rm -rf ~/.decnet/ca ~/.decnet/master

Security posture, briefly

  • Every control-plane connection is mTLS. No token auth, no HTTP fallback, no "just for testing" plaintext knob.
  • Every log-plane connection is mTLS (RFC 5425 on 6514). Plaintext syslog over the wire is refused.
  • The master CA signs both the master's own client cert and every worker cert. Certs carry SANs so hostname verification actually works — the worker will reject a master that presents a cert without the worker's address in the SANs.
  • The listener tags every incoming line with the authenticated peer CN. A worker cannot spoof another worker's identity.
  • swarmctl binds to loopback by default. If you expose it, put real auth in front.

Known limitations

  • No web UI for swarm management yet. CLI only. Dashboard integration is on the roadmap.
  • No automatic discovery. Workers don't broadcast; enrollment is explicit and that's intentional.
  • Single master. No HA. If the master dies, the control plane is gone until it comes back. Workers keep buffering logs and keep serving attackers — they don't need the master to stay up — but you can't issue new deploys or tear anything down while the master is down.
  • Sharding is round-robin. No weights, no affinity, no "run the high-interaction HTTPS decky on the beefy box". Feature request.

Troubleshooting

Symptom Likely cause Fix
swarm check says reachable: false Agent not running, firewall, wrong --address at enrollment, or clock skew curl -k https://<worker>:8765/health from the master, check ntpq/chronyc tracking, re-enroll if the IP was wrong
Forwarder logs ssl.SSLCertVerificationError Bundle mismatch (ca.crt ≠ master's CA) or clock skew Re-download the bundle from swarm enroll, check time sync
Forwarder logs ConnectionRefusedError on 6514 Listener not running, or binding to the wrong interface ss -tlnp | grep 6514 on the master
swarm list shows status=enrolled indefinitely swarm check has never been run, or agent is unreachable Run swarm check; see row 1 if that fails
Lines appear in master.log but not the dashboard Ingester not running, or pointed at the wrong JSON path systemctl status decnet-ingester, confirm DECNET_INGEST_LOG_FILE matches listener --json-path
deploy --mode swarm fails with No enrolled workers Exactly what it says swarm enroll at least one worker first
Worker returns 500 on /deploy with ip addr show <nic> error The worker's agent is re-detecting its own NIC (this is the relocalize step) and can't find a usable interface Run ip route show default on the worker — if empty, the default route is missing; fix the worker's networking before deploying
Worker returns 500 on /deploy with docker compose ... exit status 125 and docker help text in the log Compose v2 plugin is not installed on the worker; the stock docker binary is treating compose as an unknown subcommand docker compose version on the worker. If it doesn't print v2.x.y, see Installing Compose v2 and Buildx on a worker
Worker returns 500 on /deploy with compose build requires buildx 0.17.0 or later Buildx plugin missing or too old on the worker; images pull but the build step fails docker buildx version on the worker. If it's below v0.17.0, see Installing Compose v2 and Buildx on a worker
docker info lists a CLI plugin under "Invalid Plugins: ... exec format error" Wrong-architecture binary installed — e.g. x86_64 binary dropped onto an aarch64 host Re-download the plugin binary matching uname -m and overwrite the file in /usr/local/lib/docker/cli-plugins/
Agent rejects master with BAD_CERTIFICATE Master's own client cert (~/.decnet/master/) isn't in the worker's trust chain Never happens if both sides were issued from the same CA. Check you didn't re-init the CA between swarmctl starts

If things are really broken and you want a clean slate on the master:

systemctl stop decnet-swarmctl decnet-listener   # or your supervisor of choice
rm -rf ~/.decnet/ca ~/.decnet/master ~/.decnet/master-logs
# SwarmHost rows live in the shared repo; clear them if you want a clean DB.
sqlite3 ~/.decnet/decnet.db 'DELETE FROM swarmhost; DELETE FROM deckyshard;'

And on every worker:

systemctl stop decnet-agent decnet-forwarder
rm -rf ~/.decnet/agent

Then start from step 1 of Setup walkthrough.