docs(swarm): add SWARM Mode page and cross-link from Deployment Modes
Comprehensive walkthrough for the newly-landed SWARM control plane: - Architecture diagram (master: swarmctl/listener/ingester/api; worker: agent/forwarder) with ports cheat sheet - Step-by-step setup (CA bootstrap, enrollment, bundle shipment, agent + forwarder startup, check, first swarm deploy) - Full command reference for swarmctl, listener, agent, forwarder, and the swarm enroll/list/check/decommission subcommands - Log-pipeline end-to-end story (RFC 5424 on worker → RFC 5425 mTLS on 6514 → master.json → ingester → dashboard), including tcpdump-based plaintext-leak check and source_worker provenance note - Operational concerns: master crash resume (no dup/loss), worker crash, CA rotation, cert rotation, teardown - Security posture summary - Known limitations (get_host_ip master-side bug, no web UI yet, round-robin only, single master) - Troubleshooting matrix Deployment-Modes: trimmed the old 'swarm is not implemented, drive it from Ansible' section and replaced with a link to the new page. _Sidebar: added SWARM-Mode under User docs.
@@ -107,50 +107,42 @@ path calls `load_ini` and `build_deckies_from_ini`.
|
|||||||
\__________________|__________________/
|
\__________________|__________________/
|
||||||
|
|
|
|
||||||
isolated mgmt / SIEM
|
isolated mgmt / SIEM
|
||||||
|
|
|
||||||
|
[master] — swarmctl, listener, ingester, CA
|
||||||
```
|
```
|
||||||
|
|
||||||
Each real host runs a UNIHOST-shaped deployment over its own slice of the IP
|
SWARM has a dedicated page: **[SWARM Mode](SWARM-Mode)**. That page is the
|
||||||
space. An external orchestrator (Ansible, sshpass-driven scripts, etc.)
|
authoritative reference for setup, enrollment, the log pipeline, and
|
||||||
invokes `decnet deploy --mode swarm ...` on each host in turn. The CLI
|
troubleshooting.
|
||||||
currently accepts `swarm` as a valid mode — the fleet-wide orchestration layer
|
|
||||||
lives outside the DECNET binary and is the operator's responsibility. See the
|
|
||||||
README's architecture section for the intended shape.
|
|
||||||
|
|
||||||
### CLI
|
In brief: DECNET ships a **master** (`decnet swarmctl` + `decnet listener`)
|
||||||
|
that orchestrates **workers** (`decnet agent` + `decnet forwarder`) over
|
||||||
|
HTTP+mTLS on port 8770/8765 and syslog-over-TLS (RFC 5425) on port 6514.
|
||||||
|
A self-managed CA at `~/.decnet/ca/` signs every worker cert at enrollment.
|
||||||
|
|
||||||
Run on each host, coordinating IP ranges so deckies do not collide:
|
Typical first-time flow:
|
||||||
|
|
||||||
```
|
```
|
||||||
# host-A
|
# On the master:
|
||||||
sudo decnet deploy \
|
decnet swarmctl --daemon
|
||||||
--mode swarm \
|
decnet listener --daemon
|
||||||
--deckies 3 \
|
decnet swarm enroll --name decky-vm --address 192.168.1.13 \
|
||||||
--interface eth0 \
|
--out-dir /tmp/decky-vm-bundle
|
||||||
--ip-start 192.168.1.10 \
|
|
||||||
--randomize-services
|
|
||||||
|
|
||||||
# host-B
|
# Ship the bundle to the worker, then on the worker:
|
||||||
sudo decnet deploy \
|
sudo decnet agent --daemon --agent-dir ~/.decnet/agent
|
||||||
--mode swarm \
|
decnet forwarder --daemon --master-host <master-ip>
|
||||||
--deckies 3 \
|
|
||||||
--interface eth0 \
|
# Back on the master:
|
||||||
--ip-start 192.168.1.20 \
|
decnet swarm check
|
||||||
--randomize-services
|
decnet swarm list
|
||||||
|
decnet deploy --mode swarm --deckies 6 --services ssh,smb
|
||||||
```
|
```
|
||||||
|
|
||||||
`--ip-start` is the operator's primary tool for partitioning the subnet across
|
`deploy --mode swarm` round-robins deckies across all enrolled workers,
|
||||||
hosts; `allocate_ips` in `decnet/network.py` starts sequentially from that
|
shards the compose config, and dispatches each shard to the matching
|
||||||
address and skips reserved / in-use IPs.
|
agent. See [SWARM Mode](SWARM-Mode) for the full walkthrough, command
|
||||||
|
reference, security posture, and troubleshooting matrix.
|
||||||
### INI
|
|
||||||
|
|
||||||
For reproducible swarm rollouts, give each host its own INI and drive the
|
|
||||||
rollout from Ansible (or similar):
|
|
||||||
|
|
||||||
```
|
|
||||||
decnet deploy --mode swarm --config ./host-A.ini
|
|
||||||
decnet deploy --mode swarm --config ./host-B.ini
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
531
SWARM-Mode.md
Normal file
531
SWARM-Mode.md
Normal file
@@ -0,0 +1,531 @@
|
|||||||
|
# SWARM Mode — Multi-host Deployment
|
||||||
|
|
||||||
|
SWARM is DECNET's multi-host deployment posture. One **master** orchestrates
|
||||||
|
N **workers** (real hosts), each running a slice of the decky fleet. The
|
||||||
|
control plane speaks HTTP+mTLS; the log plane speaks RFC 5425 syslog over
|
||||||
|
mTLS on TCP 6514. Everything is signed by a single DECNET-managed CA on the
|
||||||
|
master.
|
||||||
|
|
||||||
|
If you want a single-box deployment, stop here and read
|
||||||
|
[Deployment Modes](Deployment-Modes) → UNIHOST. SWARM has more moving parts
|
||||||
|
and is not the right starting point for first runs.
|
||||||
|
|
||||||
|
See also: [CLI reference](CLI-Reference),
|
||||||
|
[Deployment modes](Deployment-Modes),
|
||||||
|
[Logging and syslog](Logging-and-Syslog),
|
||||||
|
[Networking: MACVLAN and IPVLAN](Networking-MACVLAN-IPVLAN),
|
||||||
|
[Teardown](Teardown-and-State).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture in one picture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────────────── MASTER ────────────────────────┐ ┌───────── WORKER ─────────┐
|
||||||
|
│ │ │ │
|
||||||
|
│ decnet api :8000 (dashboard / REST) │ │ decnet agent :8765 │
|
||||||
|
│ decnet swarmctl :8770 (SWARM control plane) │◀mTLS▶│ (FastAPI/uvicorn) │
|
||||||
|
│ decnet listener :6514 (syslog-over-TLS sink) │◀mTLS─│ decnet forwarder │
|
||||||
|
│ decnet ingester (parses master.json) │ │ (tails local log file,│
|
||||||
|
│ │ │ ships RFC 5425 over │
|
||||||
|
│ SQLite/MySQL (shared repo, SwarmHost + DeckyShard) │ │ TCP 6514 to master) │
|
||||||
|
│ │ │ │
|
||||||
|
│ ~/.decnet/ca/ (self-signed CA — ca.crt, ca.key) │ │ ~/.decnet/agent/ │
|
||||||
|
│ ~/.decnet/master/ (master client cert for swarmctl) │ │ (CA-issued bundle) │
|
||||||
|
│ │ │ docker / compose │
|
||||||
|
└────────────────────────────────────────────────────────┘ └──────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
Four long-running processes on the master, two on each worker. Each process
|
||||||
|
is a separate supervised unit — if `swarmctl` crashes, the main `decnet api`,
|
||||||
|
the log listener, and the ingester keep running. This mirrors the
|
||||||
|
`start_new_session=True` subprocess pattern used everywhere else in DECNET
|
||||||
|
(`decnet/cli.py::api` and friends).
|
||||||
|
|
||||||
|
### Ports cheat sheet
|
||||||
|
|
||||||
|
| Port | Process | Host | Protocol | mTLS? |
|
||||||
|
|------|--------------------|---------|------------------|-------|
|
||||||
|
| 8000 | `decnet api` | master | HTTP | no |
|
||||||
|
| 8770 | `decnet swarmctl` | master | HTTP | no * |
|
||||||
|
| 6514 | `decnet listener` | master | syslog (RFC5425) | **yes** |
|
||||||
|
| 8765 | `decnet agent` | worker | HTTPS | **yes** |
|
||||||
|
| 5140 | local collector | worker | syslog | no (loopback) |
|
||||||
|
|
||||||
|
\* `swarmctl` binds to `127.0.0.1` by default and is called by the local
|
||||||
|
`decnet` CLI. If you need to drive it from outside the master box, put it
|
||||||
|
behind a reverse proxy with your own auth — it is not hardened for public
|
||||||
|
exposure.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
On the **master**:
|
||||||
|
|
||||||
|
- DECNET installed (`pip install -e .`) in the venv you plan to run from.
|
||||||
|
- Write access to `~/.decnet/` (CA and master bundle land here).
|
||||||
|
- A reachable listen address for port 6514 and 8765→master replies.
|
||||||
|
- Docker is **not** needed on the master unless the master is also a worker.
|
||||||
|
|
||||||
|
On each **worker**:
|
||||||
|
|
||||||
|
- DECNET installed.
|
||||||
|
- Docker Engine + Compose plugin (the agent shells out to `docker compose`
|
||||||
|
exactly like UNIHOST).
|
||||||
|
- `sudo` for the user running `decnet agent` (MACVLAN/IPVLAN needs root).
|
||||||
|
`NOPASSWD` is convenient for unattended daemons.
|
||||||
|
- Outbound TCP to master:6514 (log forward) and inbound TCP on 8765 from
|
||||||
|
the master (deploy/teardown/health RPCs).
|
||||||
|
|
||||||
|
Time sync is a hard requirement — mTLS cert validation fails if worker and
|
||||||
|
master clocks differ by more than a few minutes. Run `chronyd`/`systemd-timesyncd`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Setup walkthrough
|
||||||
|
|
||||||
|
This is a complete, literal walkthrough. Follow it top to bottom the first
|
||||||
|
time. Every command is either run **on master** or **on worker** —
|
||||||
|
annotated below each block.
|
||||||
|
|
||||||
|
### 1. Master — start the control plane
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Start the SWARM controller. First run creates ~/.decnet/ca/ automatically
|
||||||
|
# (self-signed CA, ca.crt/ca.key) and ~/.decnet/master/ (client cert for
|
||||||
|
# the master process's own identity when talking to worker agents).
|
||||||
|
decnet swarmctl --daemon --host 127.0.0.1 --port 8770
|
||||||
|
|
||||||
|
# Start the log listener. First run creates master.log (RFC 5424 forensic
|
||||||
|
# sink, every line verbatim) and master.json (one JSON object per event for
|
||||||
|
# the ingester).
|
||||||
|
mkdir -p ~/.decnet/master-logs
|
||||||
|
decnet listener --daemon \
|
||||||
|
--host 0.0.0.0 --port 6514 \
|
||||||
|
--log-path ~/.decnet/master-logs/master.log \
|
||||||
|
--json-path ~/.decnet/master-logs/master.json
|
||||||
|
|
||||||
|
# Confirm both are up.
|
||||||
|
curl -sf http://127.0.0.1:8770/health && echo OK
|
||||||
|
ss -tlnp | grep -E '8770|6514'
|
||||||
|
```
|
||||||
|
|
||||||
|
`--daemon` detaches to a new session (same `_daemonize()` as `decnet api`).
|
||||||
|
Without it, the command stays in the foreground.
|
||||||
|
|
||||||
|
At this point:
|
||||||
|
- `~/.decnet/ca/ca.crt` is the CA every worker will trust.
|
||||||
|
- `~/.decnet/ca/ca.key` **must never leave the master**. Treat it like an
|
||||||
|
SSH host key: losing it means re-enrolling every worker.
|
||||||
|
- `~/.decnet/master/` holds the master's own client certificate that
|
||||||
|
`swarmctl` uses to authenticate outbound RPCs to worker agents.
|
||||||
|
|
||||||
|
### 2. Master — enroll a worker
|
||||||
|
|
||||||
|
The enrollment command is a single call that does four things:
|
||||||
|
|
||||||
|
1. Generates a worker keypair + CSR on the master (the private key is
|
||||||
|
written directly to the output bundle; it never touches the wire).
|
||||||
|
2. Signs the CSR with the CA, producing `worker.crt`.
|
||||||
|
3. Records a `SwarmHost` row in the shared repo with status `enrolled` and
|
||||||
|
the cert fingerprint.
|
||||||
|
4. Writes the bundle files to `--out-dir` for you to ship to the worker.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
decnet swarm enroll \
|
||||||
|
--name decky-vm \
|
||||||
|
--address 192.168.1.13 \
|
||||||
|
--sans decky-vm.lan,192.168.1.13 \
|
||||||
|
--out-dir /tmp/decky-vm-bundle
|
||||||
|
```
|
||||||
|
|
||||||
|
`--name` is the worker's DECNET identity — it becomes the cert CN and the
|
||||||
|
`source_worker` tag on every log line forwarded from that host. Pick names
|
||||||
|
you can grep for. Must be unique; re-enrolling the same name is rejected.
|
||||||
|
|
||||||
|
`--address` is the worker's IP as reachable from the master. This is what
|
||||||
|
the master's control-plane client will connect to for deploy/teardown RPCs.
|
||||||
|
|
||||||
|
`--sans` is a comma-separated list of Subject Alternative Names. Include
|
||||||
|
every DNS name and IP the master might use to reach the worker. At minimum,
|
||||||
|
include the IP you passed to `--address`.
|
||||||
|
|
||||||
|
Output (`/tmp/decky-vm-bundle/`):
|
||||||
|
|
||||||
|
```
|
||||||
|
ca.crt # the DECNET CA certificate
|
||||||
|
worker.crt # CA-signed client+server cert for this worker
|
||||||
|
worker.key # worker private key (mode 0600)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Ship the bundle to the worker
|
||||||
|
|
||||||
|
Any secure channel works — this is a plain file copy. `scp`, `rsync`,
|
||||||
|
`sshpass` in a closet lab — pick your poison:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From the master:
|
||||||
|
scp -r /tmp/decky-vm-bundle/* anti@192.168.1.13:~/.decnet/agent/
|
||||||
|
```
|
||||||
|
|
||||||
|
On the worker, the bundle must land at `~/.decnet/agent/` of the user that
|
||||||
|
will run `decnet agent`. **Watch out for `sudo`**: if you run the agent
|
||||||
|
under `sudo`, `$HOME` expands to `/root`, not `/home/anti`. Either put the
|
||||||
|
bundle under `/root/.decnet/agent/`, or pass `--agent-dir` to override.
|
||||||
|
|
||||||
|
After copying, `chmod 600 ~/.decnet/agent/worker.key` and delete the master
|
||||||
|
copy.
|
||||||
|
|
||||||
|
### 4. Worker — start the agent + forwarder
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On the worker, as the user whose $HOME holds the bundle (or with --agent-dir).
|
||||||
|
sudo decnet agent --daemon \
|
||||||
|
--host 0.0.0.0 --port 8765 \
|
||||||
|
--agent-dir /home/anti/.decnet/agent
|
||||||
|
|
||||||
|
# The forwarder tails the worker's local decky log file and ships each
|
||||||
|
# line, octet-framed and mTLS-wrapped, to the master listener.
|
||||||
|
decnet forwarder --daemon \
|
||||||
|
--master-host 192.168.1.13-master-ip \
|
||||||
|
--master-port 6514 \
|
||||||
|
--log-path /var/log/decnet/decnet.log \
|
||||||
|
--state-db ~/.decnet/agent/forwarder.db \
|
||||||
|
--agent-dir /home/anti/.decnet/agent
|
||||||
|
```
|
||||||
|
|
||||||
|
`--state-db` holds a single table that records the forwarder's byte offset
|
||||||
|
into the log file. On reconnect after a master outage, the forwarder
|
||||||
|
**resumes from the stored offset** — no duplicates, no gaps. Truncation
|
||||||
|
(logrotate) is detected (`st_size < offset`) and resets the offset to 0.
|
||||||
|
|
||||||
|
`--master-host` / `--master-port` can also be set via
|
||||||
|
`DECNET_SWARM_MASTER_HOST` / `DECNET_SWARM_MASTER_PORT` so operators can
|
||||||
|
bake them into a systemd unit or `.env` file.
|
||||||
|
|
||||||
|
### 5. Master — confirm the worker is alive
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# List enrolled workers. Fresh enrollments are status=enrolled until the
|
||||||
|
# first successful health ping flips them to active.
|
||||||
|
decnet swarm list
|
||||||
|
|
||||||
|
# Poll worker agents. On success, flips SwarmHost.status to active and
|
||||||
|
# stamps SwarmHost.last_heartbeat.
|
||||||
|
decnet swarm check
|
||||||
|
|
||||||
|
decnet swarm list
|
||||||
|
# name=decky-vm status=active last_heartbeat=2026-04-18T...
|
||||||
|
```
|
||||||
|
|
||||||
|
If `check` reports `reachable: false`, the usual suspects are: the agent
|
||||||
|
isn't running, the master cannot reach worker:8765 (firewall / NAT),
|
||||||
|
`--address` at enrollment doesn't match the worker's actual IP, or clock
|
||||||
|
skew is breaking cert validity.
|
||||||
|
|
||||||
|
### 6. Deploy deckies across the swarm
|
||||||
|
|
||||||
|
```bash
|
||||||
|
decnet deploy --mode swarm --deckies 6 --services ssh,smb --dry-run
|
||||||
|
# Round-robins 6 deckies across all enrolled workers (with status IN
|
||||||
|
# (enrolled, active)) and prints the compose-shard plan.
|
||||||
|
|
||||||
|
decnet deploy --mode swarm --deckies 6 --services ssh,smb
|
||||||
|
# Live run: POSTs each worker's shard to swarmctl, which fans out to each
|
||||||
|
# agent's /deploy, which calls the same deployer.py used in UNIHOST.
|
||||||
|
```
|
||||||
|
|
||||||
|
Sharding is **round-robin** by enrollment order. If you have workers A and
|
||||||
|
B and ask for 3 deckies, A gets 2 and B gets 1. If you want a different
|
||||||
|
distribution, run two separate `deploy` calls with filtered host lists
|
||||||
|
(feature request; see Known Limitations).
|
||||||
|
|
||||||
|
Empty swarm is a hard error: `deploy --mode swarm` with zero enrolled
|
||||||
|
workers exits non-zero with `No enrolled workers`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Command reference
|
||||||
|
|
||||||
|
All of these live in `decnet/cli.py`. Run `decnet <cmd> --help` for the
|
||||||
|
authoritative option list. What follows are the knobs you will actually
|
||||||
|
care about.
|
||||||
|
|
||||||
|
### `decnet swarmctl`
|
||||||
|
|
||||||
|
Master-side SWARM control plane. FastAPI app at `decnet.web.swarm_api:app`,
|
||||||
|
launched as a subprocess of `uvicorn`. Daemonizable.
|
||||||
|
|
||||||
|
```
|
||||||
|
--host 127.0.0.1 Bind address. Keep this on loopback unless you know
|
||||||
|
what you're doing; swarmctl is called by the local CLI.
|
||||||
|
--port 8770 Bind port.
|
||||||
|
--daemon / -d Detach to background.
|
||||||
|
```
|
||||||
|
|
||||||
|
### `decnet listener`
|
||||||
|
|
||||||
|
Master-side RFC 5425 syslog-over-TLS sink. Validates every client cert
|
||||||
|
against `~/.decnet/ca/ca.crt`, tags each line with the peer's CN as
|
||||||
|
`source_worker`, and writes two sinks: an RFC 5424 `.log` file and a parsed
|
||||||
|
`.json` file (one object per event).
|
||||||
|
|
||||||
|
```
|
||||||
|
--host 0.0.0.0 Bind address.
|
||||||
|
--port 6514 IANA-assigned syslog-TLS port.
|
||||||
|
--log-path <path> Raw RFC 5424 sink. default: ./master.log
|
||||||
|
--json-path <path> Parsed JSON sink. default: ./master.json
|
||||||
|
--ca-dir <dir> CA to verify peers. default: ~/.decnet/ca
|
||||||
|
--daemon / -d Detach to background.
|
||||||
|
```
|
||||||
|
|
||||||
|
### `decnet agent`
|
||||||
|
|
||||||
|
Worker-side control-plane daemon. FastAPI on HTTPS with
|
||||||
|
`ssl.CERT_REQUIRED`. Endpoints: `/deploy`, `/teardown`, `/status`,
|
||||||
|
`/health`, `/mutate`. All require a CA-signed client cert (the master's).
|
||||||
|
|
||||||
|
```
|
||||||
|
--host 0.0.0.0 Bind address.
|
||||||
|
--port 8765 Bind port.
|
||||||
|
--agent-dir <path> Override bundle location. Required when running under
|
||||||
|
sudo or any user whose $HOME doesn't hold the bundle.
|
||||||
|
--daemon / -d Detach to background.
|
||||||
|
```
|
||||||
|
|
||||||
|
### `decnet forwarder`
|
||||||
|
|
||||||
|
Worker-side log shipper. Tails `--log-path` (default:
|
||||||
|
`DECNET_INGEST_LOG_FILE`, i.e. the same file the local collector writes),
|
||||||
|
frames each line per RFC 5425 octet-counting, and writes it to
|
||||||
|
master:6514 over mTLS. Offset state is persisted in SQLite so a master
|
||||||
|
crash is recoverable without data loss.
|
||||||
|
|
||||||
|
```
|
||||||
|
--master-host <ip> Master IP. env: DECNET_SWARM_MASTER_HOST
|
||||||
|
--master-port <int> Listener port. default: 6514
|
||||||
|
--log-path <path> File to tail. default: DECNET_INGEST_LOG_FILE
|
||||||
|
--state-db <path> Offset SQLite. default: ~/.decnet/agent/forwarder.db
|
||||||
|
--agent-dir <path> Bundle dir. default: ~/.decnet/agent
|
||||||
|
--poll-interval <sec> File tail interval. default: 0.5
|
||||||
|
--daemon / -d Detach to background.
|
||||||
|
```
|
||||||
|
|
||||||
|
### `decnet swarm enroll`
|
||||||
|
|
||||||
|
Issues a worker bundle and records a `SwarmHost` row.
|
||||||
|
|
||||||
|
```
|
||||||
|
--name <str> Worker identity (CN + source_worker tag). Required.
|
||||||
|
--address <ip> IP/hostname the master uses to reach the agent. Required.
|
||||||
|
--sans a,b,c Subject Alternative Names. default: [--address]
|
||||||
|
--out-dir <path> Where to write the bundle. default: ./<name>-bundle
|
||||||
|
--agent-port <int> Port to record on the host row. default: 8765
|
||||||
|
--notes <str> Free-form annotation, shown in `swarm list`.
|
||||||
|
```
|
||||||
|
|
||||||
|
### `decnet swarm list`
|
||||||
|
|
||||||
|
Prints the `SwarmHost` rows as a table.
|
||||||
|
|
||||||
|
```
|
||||||
|
--status <enrolled|active|unreachable|decommissioned>
|
||||||
|
Filter. default: all except decommissioned.
|
||||||
|
--json Emit JSON, not a table. Useful for scripting.
|
||||||
|
```
|
||||||
|
|
||||||
|
### `decnet swarm check`
|
||||||
|
|
||||||
|
Synchronously polls every active/enrolled agent's `/health`. On success,
|
||||||
|
flips status to `active` and stamps `last_heartbeat`. On failure, flips to
|
||||||
|
`unreachable` and records the error.
|
||||||
|
|
||||||
|
### `decnet swarm decommission`
|
||||||
|
|
||||||
|
Marks a host `decommissioned` in the repo, tears down any running deckies
|
||||||
|
on it via the agent (if reachable), and **revokes** the worker's cert from
|
||||||
|
the master's active-set. The worker's bundle files are not deleted from the
|
||||||
|
worker — you are expected to wipe those out of band.
|
||||||
|
|
||||||
|
```
|
||||||
|
--name <str> | --uuid <str> Identify by either. One is required.
|
||||||
|
--yes Skip confirmation prompt.
|
||||||
|
--keep-deckies Leave containers running on the worker.
|
||||||
|
Use this when reassigning hardware.
|
||||||
|
```
|
||||||
|
|
||||||
|
### `decnet deploy --mode swarm`
|
||||||
|
|
||||||
|
Round-robins the requested deckies across enrolled workers and dispatches
|
||||||
|
to `swarmctl`, which POSTs each shard to the matching agent. Compose
|
||||||
|
generation is shared with UNIHOST; only the **distribution** differs.
|
||||||
|
|
||||||
|
```
|
||||||
|
--deckies <n> Total fleet size across all workers.
|
||||||
|
--services a,b,c Fixed service set for every decky.
|
||||||
|
--randomize-services Per-decky random subset from the catalog.
|
||||||
|
--archetype <name> Pick from Archetypes (see wiki page).
|
||||||
|
--dry-run Print the shard plan; no RPC.
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Log pipeline — what actually happens to an attack event
|
||||||
|
|
||||||
|
1. Attacker hits a decky. The decky's in-container emit helper writes an
|
||||||
|
RFC 5424 line to `stdout` and to `/var/log/decnet/decnet.log` inside the
|
||||||
|
container. (See [Logging and syslog](Logging-and-Syslog).)
|
||||||
|
2. Worker's local collector picks the event up over loopback syslog
|
||||||
|
(worker:5140). Plaintext is fine here — it never leaves the host.
|
||||||
|
3. The collector appends the parsed RFC 5424 to
|
||||||
|
`DECNET_INGEST_LOG_FILE` on the worker.
|
||||||
|
4. `decnet forwarder` tails that file, octet-frames each line, connects
|
||||||
|
to `master:6514` over mTLS (trust root = DECNET CA), and sends it.
|
||||||
|
5. `decnet listener` on the master validates the peer cert, extracts
|
||||||
|
the CN as `source_worker`, enriches the structured data, and writes
|
||||||
|
to `master.log` + `master.json`.
|
||||||
|
6. `decnet ingester` tails `master.json` and inserts rows into the shared
|
||||||
|
repo — the same code path UNIHOST uses.
|
||||||
|
7. The web dashboard (`decnet api`) queries the repo; live-logs stream
|
||||||
|
over SSE.
|
||||||
|
|
||||||
|
**Verify on the wire.** A `tcpdump -i any 'port 6514'` on the master
|
||||||
|
should show only `ClientHello` / `ServerHello` / `Certificate` / encrypted
|
||||||
|
`ApplicationData` records. No plaintext `<13>1 2026-...` — if you see
|
||||||
|
those, something is wrong and logs are leaking in the clear. The listener
|
||||||
|
refuses plaintext connections outright.
|
||||||
|
|
||||||
|
**Verify provenance.** Every line in `master.log` has a `source_worker=...`
|
||||||
|
SD element populated from the peer cert's CN. The master does **not**
|
||||||
|
trust a `source_worker` the worker tries to set itself — it is always
|
||||||
|
overwritten with the authenticated cert identity. A compromised worker
|
||||||
|
cannot impersonate another worker's name.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Operational concerns
|
||||||
|
|
||||||
|
### Master crash / restart
|
||||||
|
|
||||||
|
Kill the listener mid-shipment. The forwarder detects the dropped
|
||||||
|
connection, retries with exponential backoff (capped at 30s), buffers
|
||||||
|
writes **into the worker's local log file** (not RAM), and on reconnect
|
||||||
|
resumes shipping from the last committed offset in `forwarder.db`.
|
||||||
|
|
||||||
|
Guarantee: **no duplicates, no loss**, across any number of master
|
||||||
|
restarts, as long as the worker's disk is intact. Verified end-to-end in
|
||||||
|
`tests/swarm/test_forwarder_resilience.py`.
|
||||||
|
|
||||||
|
### Worker crash / restart
|
||||||
|
|
||||||
|
The agent is stateless at the process level — all state lives in the
|
||||||
|
bundle on disk plus whatever Docker has running. `systemctl restart
|
||||||
|
decnet-agent` (or equivalent) is safe at any time. The forwarder picks
|
||||||
|
up exactly where it left off.
|
||||||
|
|
||||||
|
### Rotating the CA
|
||||||
|
|
||||||
|
Don't. The CA key signs every worker cert. Replacing it means re-enrolling
|
||||||
|
every worker. If the CA key is compromised, treat it as a full rebuild:
|
||||||
|
decommission every worker, delete `~/.decnet/ca/`, restart `swarmctl` (it
|
||||||
|
regenerates a fresh CA), re-enroll every worker with fresh bundles.
|
||||||
|
|
||||||
|
### Rotating a single worker cert
|
||||||
|
|
||||||
|
```
|
||||||
|
decnet swarm decommission --name decky-old --yes
|
||||||
|
decnet swarm enroll --name decky-new --address <same-ip> \
|
||||||
|
--out-dir /tmp/decky-new-bundle
|
||||||
|
# ship the new bundle, restart the agent pointed at it.
|
||||||
|
```
|
||||||
|
|
||||||
|
There is no in-place rotation — decommission + re-enroll is the path.
|
||||||
|
|
||||||
|
### Teardown
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Master: tear down all deckies across all workers, then stop control plane.
|
||||||
|
decnet teardown --all --mode swarm
|
||||||
|
|
||||||
|
# On each worker, if you want to remove the bundle:
|
||||||
|
rm -rf ~/.decnet/agent
|
||||||
|
systemctl stop decnet-agent decnet-forwarder
|
||||||
|
|
||||||
|
# Master, to fully wipe swarm state:
|
||||||
|
decnet swarm decommission --name <each-worker> --yes
|
||||||
|
# This leaves ~/.decnet/ca/ intact so you can re-enroll later. To fully
|
||||||
|
# wipe: rm -rf ~/.decnet/ca ~/.decnet/master
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Security posture, briefly
|
||||||
|
|
||||||
|
- **Every control-plane connection** is mTLS. No token auth, no HTTP
|
||||||
|
fallback, no "just for testing" plaintext knob.
|
||||||
|
- **Every log-plane connection** is mTLS (RFC 5425 on 6514). Plaintext
|
||||||
|
syslog over the wire is refused.
|
||||||
|
- The master CA signs both the master's own client cert and every worker
|
||||||
|
cert. Certs carry SANs so hostname verification actually works — the
|
||||||
|
worker will reject a master that presents a cert without the worker's
|
||||||
|
address in the SANs.
|
||||||
|
- The listener tags every incoming line with the authenticated peer CN.
|
||||||
|
A worker cannot spoof another worker's identity.
|
||||||
|
- `swarmctl` binds to loopback by default. If you expose it, put real
|
||||||
|
auth in front.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Known limitations
|
||||||
|
|
||||||
|
- **`deploy --mode swarm` runs `get_host_ip(--interface)` on the master**
|
||||||
|
before dispatching to workers. This means `--interface` must name a NIC
|
||||||
|
that exists on the master. If your workers have different NIC names
|
||||||
|
(common in heterogeneous fleets), this fails. Workaround: use INI
|
||||||
|
per-worker configs that hardcode the right subnet, and call deploy
|
||||||
|
once per worker. A proper fix (defer network detection to the worker
|
||||||
|
agent) is tracked in `Roadmap-and-Known-Debt`.
|
||||||
|
- **No web UI for swarm management yet.** CLI only. Dashboard integration
|
||||||
|
is on the roadmap.
|
||||||
|
- **No automatic discovery.** Workers don't broadcast; enrollment is
|
||||||
|
explicit and that's intentional.
|
||||||
|
- **Single master.** No HA. If the master dies, the control plane is gone
|
||||||
|
until it comes back. Workers keep buffering logs and keep serving
|
||||||
|
attackers — they don't need the master to stay up — but you can't issue
|
||||||
|
new deploys or tear anything down while the master is down.
|
||||||
|
- **Sharding is round-robin.** No weights, no affinity, no "run the
|
||||||
|
high-interaction HTTPS decky on the beefy box". Feature request.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
| Symptom | Likely cause | Fix |
|
||||||
|
|---|---|---|
|
||||||
|
| `swarm check` says `reachable: false` | Agent not running, firewall, wrong `--address` at enrollment, or clock skew | `curl -k https://<worker>:8765/health` from the master, check `ntpq`/`chronyc tracking`, re-enroll if the IP was wrong |
|
||||||
|
| Forwarder logs `ssl.SSLCertVerificationError` | Bundle mismatch (ca.crt ≠ master's CA) or clock skew | Re-download the bundle from `swarm enroll`, check time sync |
|
||||||
|
| Forwarder logs `ConnectionRefusedError` on 6514 | Listener not running, or binding to the wrong interface | `ss -tlnp \| grep 6514` on the master |
|
||||||
|
| `swarm list` shows `status=enrolled` indefinitely | `swarm check` has never been run, or agent is unreachable | Run `swarm check`; see row 1 if that fails |
|
||||||
|
| Lines appear in `master.log` but not the dashboard | Ingester not running, or pointed at the wrong JSON path | `systemctl status decnet-ingester`, confirm `DECNET_INGEST_LOG_FILE` matches `listener --json-path` |
|
||||||
|
| `deploy --mode swarm` fails with `No enrolled workers` | Exactly what it says | `swarm enroll` at least one worker first |
|
||||||
|
| `deploy --mode swarm` fails on `get_host_ip` | The NIC name you passed doesn't exist on the master | See Known Limitations; use per-host INI files |
|
||||||
|
| Agent rejects master with `BAD_CERTIFICATE` | Master's own client cert (`~/.decnet/master/`) isn't in the worker's trust chain | Never happens if both sides were issued from the same CA. Check you didn't re-init the CA between `swarmctl` starts |
|
||||||
|
|
||||||
|
If things are really broken and you want a clean slate on the master:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl stop decnet-swarmctl decnet-listener # or your supervisor of choice
|
||||||
|
rm -rf ~/.decnet/ca ~/.decnet/master ~/.decnet/master-logs
|
||||||
|
# SwarmHost rows live in the shared repo; clear them if you want a clean DB.
|
||||||
|
sqlite3 ~/.decnet/decnet.db 'DELETE FROM swarmhost; DELETE FROM deckyshard;'
|
||||||
|
```
|
||||||
|
|
||||||
|
And on every worker:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl stop decnet-agent decnet-forwarder
|
||||||
|
rm -rf ~/.decnet/agent
|
||||||
|
```
|
||||||
|
|
||||||
|
Then start from step 1 of [Setup walkthrough](#setup-walkthrough).
|
||||||
@@ -17,6 +17,7 @@
|
|||||||
- [OS-Fingerprint-Spoofing](OS-Fingerprint-Spoofing)
|
- [OS-Fingerprint-Spoofing](OS-Fingerprint-Spoofing)
|
||||||
- [Networking-MACVLAN-IPVLAN](Networking-MACVLAN-IPVLAN)
|
- [Networking-MACVLAN-IPVLAN](Networking-MACVLAN-IPVLAN)
|
||||||
- [Deployment-Modes](Deployment-Modes)
|
- [Deployment-Modes](Deployment-Modes)
|
||||||
|
- [SWARM-Mode](SWARM-Mode)
|
||||||
- [Environment-Variables](Environment-Variables)
|
- [Environment-Variables](Environment-Variables)
|
||||||
- [Teardown-and-State](Teardown-and-State)
|
- [Teardown-and-State](Teardown-and-State)
|
||||||
- [Database-Drivers](Database-Drivers)
|
- [Database-Drivers](Database-Drivers)
|
||||||
|
|||||||
Reference in New Issue
Block a user