docs: document swarm/agent/updater APIs and add PKI-and-mTLS reference

- Module-Reference-Workers: new sections for decnet/swarm/, decnet/agent/,
  and decnet/updater/ covering exported symbols, FastAPI routes, uvicorn
  launcher flags, and the executor seams the test suite monkeypatches.
- PKI-and-mTLS: standalone developer page covering CA generation, leaf
  issuance, SSLContext wiring (client + server), the VERIFY_X509_STRICT
  workaround for Python 3.13, the RFC 5425 log pipeline's reuse of the
  same CA, and the currently-unenforced CN/role-separation gap.
- Sidebar: link the new PKI page from the Developer docs section.
2026-04-19 00:21:15 -04:00
parent e07128a72a
commit e473159ea1
3 changed files with 332 additions and 0 deletions

@@ -267,3 +267,171 @@ Shared helpers for the `LOG_TARGET` env var used by service plugins.
--- ---
See [Module Reference — Core](Module-Reference-Core) for top-level modules (cli, composer, telemetry, etc.) and [Module Reference — Web](Module-Reference-Web) for the FastAPI surface and DB layer. See [Module Reference — Core](Module-Reference-Core) for top-level modules (cli, composer, telemetry, etc.) and [Module Reference — Web](Module-Reference-Web) for the FastAPI surface and DB layer.
---
## Swarm — `decnet/swarm/`
Master-side orchestration of multi-host deployments: HTTP clients for the worker daemons, the PKI that signs their certs, the tar helper that packages the working tree for a remote update, and the syslog-over-TLS forwarder/listener pair. Everything in this package runs either on the master or on every worker that needs to talk back to it — there is no third role.
See [PKI and mTLS](PKI-and-mTLS) for the cert-chain details, cert layout, and why CN is not actually validated at the handler level.
### `decnet/swarm/__init__.py`
Re-exports `AgentClient`, `UpdaterClient`, `MasterIdentity`, `ensure_master_identity`, and the `pki` submodule. Importing `decnet.swarm` is enough for CLI-level callers; nothing else is considered public.
### `decnet/swarm/client.py`
Async HTTP client for the worker-side agent daemon (port 8765). One instance per worker target; `httpx.AsyncClient` is re-used across calls.
- `decnet/swarm/client.py::AgentClient.__init__` — accepts either a `host` dict (from `swarm_hosts` DB rows) or a raw `address` string, resolves the master's own cert bundle via `MasterIdentity`, and builds the mTLS `ssl.SSLContext`. `agent_port` defaults to 8765. `verify_hostname=False` by default — we pin by CA chain, not DNS, because workers enroll with whatever SANs the operator chose.
- `decnet/swarm/client.py::AgentClient.deploy``POST /deploy` with a serialized `DecnetConfig` + `dry_run` + `no_cache`. Read timeout is bumped to 600 s because `docker compose build` can be very slow on underpowered workers.
- `decnet/swarm/client.py::AgentClient.teardown``POST /teardown` with optional `decky_id`.
- `decnet/swarm/client.py::AgentClient.health``GET /health`. The master never gets to this handler without a valid cert (uvicorn rejects the handshake) — this is a real liveness probe, not an auth endpoint.
- `decnet/swarm/client.py::AgentClient.status``GET /status`.
- mTLS wiring (in `__init__`): `ctx.load_cert_chain(...)`, `ctx.load_verify_locations(cafile=...)`, `ctx.verify_mode = ssl.CERT_REQUIRED`, `ctx.check_hostname = self._verify_hostname`.
### `decnet/swarm/updater_client.py`
Sibling client for the self-updater daemon (port 8766). Same mTLS pattern as `AgentClient` but targets a different port and uses multipart/form-data for tarball uploads.
- `decnet/swarm/updater_client.py::UpdaterClient.__init__``updater_port=8766`, same `MasterIdentity` bundle as `AgentClient`. The master uses one cert for both; the TLS layer doesn't care which daemon answers.
- `decnet/swarm/updater_client.py::UpdaterClient.health``GET /health`.
- `decnet/swarm/updater_client.py::UpdaterClient.update``POST /update` with `tarball: bytes` + `sha: str` as multipart fields. 180 s read timeout covers tarball upload + `pip install` + probe-with-retry.
- `decnet/swarm/updater_client.py::UpdaterClient.update_self``POST /update-self`; sends `confirm_self=true` to pass the server-side safety check (see [Remote-Updates](Remote-Updates)). Tolerates the mid-response disconnect that `os.execv` causes by catching `RemoteProtocolError` and treating it as "success pending `/health` poll".
- `decnet/swarm/updater_client.py::UpdaterClient.rollback``POST /rollback`, 404 if no `prev/` slot.
### `decnet/swarm/pki.py`
The one place in the codebase that holds a private key. Everything else consumes `IssuedCert` bundles it produces.
- `decnet/swarm/pki.py::DEFAULT_CA_DIR` = `~/.decnet/ca`; `decnet/swarm/pki.py::DEFAULT_AGENT_DIR` = `~/.decnet/agent`.
- `CA_KEY_BITS = 4096`, `WORKER_KEY_BITS = 2048`, `CA_VALIDITY_DAYS = 3650`, `WORKER_VALIDITY_DAYS = 825`.
- `decnet/swarm/pki.py::CABundle``(key_pem: bytes, cert_pem: bytes)` dataclass for the CA private key + self-signed cert.
- `decnet/swarm/pki.py::IssuedCert``(key_pem, cert_pem, ca_cert_pem, fingerprint_sha256: str)` for a signed leaf bundle. `fingerprint_sha256` is what the DB stores for out-of-band enrollment audit.
- `decnet/swarm/pki.py::generate_ca` — RSA-4096, self-signed, `BasicConstraints(ca=True, path_length=0)`, `KeyUsage(key_cert_sign=True, crl_sign=True)`, signed with SHA-256, 10-year validity.
- `decnet/swarm/pki.py::issue_worker_cert` — RSA-2048 leaf, CN = caller-supplied `worker_name` (`hostname` for agent certs, `updater@hostname` for updater certs), SANs built from the list the caller passes (IPs parsed as `IPAddress`, everything else as `DNSName`), `ExtKeyUsage(serverAuth, clientAuth)` — both flags because the worker is a server to the master and a client when it forwards logs.
- `decnet/swarm/pki.py::write_worker_bundle` — writes `worker.key` (mode 0600), `worker.crt`, `ca.crt` into the bundle dir. Updater bundles write to `~/.decnet/updater/` with `updater.key` / `updater.crt` names instead.
- `decnet/swarm/pki.py::load_worker_bundle` — loads an `IssuedCert` off disk; used by the agent/updater at startup.
- `decnet/swarm/pki.py::fingerprint``sha256(cert_pem_der).hexdigest()`. Cheap, deterministic, stable across cert re-encodings.
### `decnet/swarm/tar_tree.py`
Builds the working-tree tarball that `decnet swarm update` ships to the updater.
- `decnet/swarm/tar_tree.py::DEFAULT_EXCLUDES` — filter tuple: `.venv/`, `__pycache__/`, `.git/`, `wiki-checkout/`, `*.pyc`, `*.pyo`, `*.db*`, `*.log`, `.pytest_cache/`, `.mypy_cache/`, `.tox/`, `*.egg-info/`, `decnet-state.json`, `master.log`, `master.json`, `decnet.db*`. These are enforced regardless of `.gitignore` so untracked dev artefacts never leak onto workers.
- `decnet/swarm/tar_tree.py::_is_excluded``fnmatch` the relative path *and* every leading subpath so a pattern like `.git/` excludes everything underneath.
### `decnet/swarm/log_forwarder.py`
Worker → master half of the RFC 5425 syslog-over-TLS pipeline. Wakes up periodically, reads new lines from the local log file, frames them octet-counted per RFC 5425, and writes them over an mTLS connection to port 6514 on the master.
- `decnet/swarm/log_forwarder.py::ForwarderConfig` — dataclass: `log_path`, `master_host`, `master_port=6514`, `agent_dir=~/.decnet/agent`, optional `state_db` for byte-offset persistence.
- Plaintext syslog across hosts is forbidden by project policy — see [Syslog over TLS](#) notes. Loopback only may use plaintext.
### `decnet/swarm/log_listener.py`
Master-side RFC 5425 receiver. One mTLS-protected TCP socket on 6514; accepts connections from any worker whose cert is signed by the DECNET CA.
- `decnet/swarm/log_listener.py::ListenerConfig``log_path`, `json_path`, `bind_host="0.0.0.0"`, `bind_port=6514`, `ca_dir=~/.decnet/ca`.
- `decnet/swarm/log_listener.py::build_listener_ssl_context` — server-side `ssl.SSLContext`: master presents `ca/master/worker.crt`, requires the peer to present a DECNET CA-signed cert. The CN on the peer cert is the authoritative worker identity — the RFC 5424 HOSTNAME field is untrusted input and is never used for authentication.
---
## Agent — `decnet/agent/`
Worker-side daemon. FastAPI app behind uvicorn with mTLS on port 8765. Accepts deploy / teardown / status requests from the master and executes them locally.
See [Remote-Updates](Remote-Updates) for the lifecycle management around this process — the agent is not self-supervising.
### `decnet/agent/__init__.py`
Empty package marker.
### `decnet/agent/app.py`
- `decnet/agent/app.py::DeployRequest` — pydantic body model: `{config: DecnetConfig, dry_run: bool, no_cache: bool}`.
- `decnet/agent/app.py::TeardownRequest``{decky_id: str | None}`.
- `decnet/agent/app.py::MutateRequest``{decky_id: str, services: list[str]}` (reserved; handler returns 501).
- `GET /health` — returns `{"status": "ok", "marker": "..."}`. mTLS still required — the master's liveness probe carries its cert.
- `GET /status` — awaits `executor.status()`; returns the worker's current deployment snapshot.
- `POST /deploy` — calls `executor.deploy(config, dry_run, no_cache)`. Returns `{"status": "deployed", "deckies": int}` on success, `HTTPException(500)` with the caught exception's message on failure.
- `POST /teardown` — calls `executor.teardown(decky_id)`.
- `POST /mutate` — stub. Returns 501. Per-decky mutation is currently performed as a full `/deploy` with an updated `DecnetConfig` to avoid duplicating mutation logic worker-side.
- FastAPI app itself is built with `docs_url=None`, `redoc_url=None`, `openapi_url=None` — no interactive docs on workers.
### `decnet/agent/server.py`
uvicorn launcher. Not the app process itself — spawns uvicorn as a subprocess so signals land on a predictable PID and the tls-related flags live in one place.
- Requires `~/.decnet/agent/{worker.key, worker.crt, ca.crt}`. Missing bundle → prints an instructional error, exits 2 (operator likely forgot `swarm enroll`).
- Spawns `python -m uvicorn decnet.agent.app:app --host HOST --port PORT --ssl-keyfile <worker.key> --ssl-certfile <worker.crt> --ssl-ca-certs <ca.crt> --ssl-cert-reqs 2`. The `2` is `ssl.CERT_REQUIRED` — no cert = TCP reset before any handler runs.
### `decnet/agent/executor.py`
Thin async shim between the FastAPI handlers and the existing unihost orchestration code.
- `decnet/agent/executor.py::deploy` — async wrapper around `decnet.engine.deployer.deploy`. Runs the blocking work off the event loop. If the worker's local NIC/subnet differs from what the master serialised, the config is relocalised before deploy (see `engine.deployer` for the rewriting rules).
- `decnet/agent/executor.py::teardown` — async wrapper around `decnet.engine.deployer.teardown`.
- `decnet/agent/executor.py::status` — calls `decnet.config.load_state()` and returns the snapshot dict verbatim.
Reads: `decnet-state.json`, Docker daemon. Writes: whatever the engine writes (compose file, docker networks/containers, state file).
---
## Updater — `decnet/updater/`
Worker-side self-update daemon. FastAPI app behind uvicorn with mTLS on port 8766. Runs from `/opt/decnet/venv/` initially, and from `/opt/decnet/updater/venv/` after the first successful `--include-self` push. Never modified by a normal `/update`.
This is the daemon that owns the agent's lifecycle during a push — see [Remote-Updates](Remote-Updates) for the operator-facing view and [PKI and mTLS](PKI-and-mTLS) for the cert story.
### `decnet/updater/__init__.py`
Empty package marker.
### `decnet/updater/app.py`
- `decnet/updater/app.py::_Config` — module-level holder for the three paths the handlers need (`install_dir`, `updater_install_dir`, `agent_dir`). Defaults come from `DECNET_UPDATER_INSTALL_DIR` / `DECNET_UPDATER_UPDATER_DIR` / `DECNET_UPDATER_AGENT_DIR`, which `server.py` sets before spawning uvicorn.
- `decnet/updater/app.py::configure` — injected-paths setter used by the server launcher. Must run before serving.
- `GET /health` — returns `{"status": "ok", "role": "updater", "releases": [...]}`. The `role` field is the only thing that distinguishes this from the agent's `/health` to a caller that doesn't track ports.
- `GET /releases``{"releases": [...]}`; each release is `{slot, sha, installed_at}`.
- `POST /update` — multipart: `tarball: UploadFile`, `sha: str`. Delegates to `executor.run_update`. Returns 500 on generic `UpdateError`, 409 if the update was already rolled back (operator should read the response body for stderr + probe transcripts).
- `POST /update-self` — multipart: `tarball`, `sha`, `confirm_self: str`. The `confirm_self.lower() != "true"` guard is non-negotiable; there is no auto-rollback on this path.
- `POST /rollback` — no body. 404 if there's no `prev/` slot (fresh install), 500 on other failure.
- FastAPI app built with `docs_url=None`, `redoc_url=None`, `openapi_url=None`.
### `decnet/updater/server.py`
Same shape as the agent's server launcher — spawns uvicorn with mTLS flags. Reads `~/.decnet/updater/{updater.key, updater.crt, ca.crt}`.
Before spawning uvicorn, exports:
- `DECNET_UPDATER_INSTALL_DIR` — release root (`/opt/decnet` by default).
- `DECNET_UPDATER_UPDATER_DIR` — updater's own install root (`<install_dir>/updater`).
- `DECNET_UPDATER_AGENT_DIR` — agent bundle dir (for the local mTLS health probe after an update).
- `DECNET_UPDATER_BUNDLE_DIR` — the updater's own cert bundle (`~/.decnet/updater/`).
- `DECNET_UPDATER_HOST`, `DECNET_UPDATER_PORT` — needed so `run_update_self` can rebuild the operator-visible `decnet updater ...` command line when it `os.execv`s into the new binary.
### `decnet/updater/executor.py`
The heart of the update pipeline. Every seam is named `_foo` and monkeypatched by tests so the test suite never shells out.
- `decnet/updater/executor.py::DEFAULT_INSTALL_DIR` = `/opt/decnet`.
- `decnet/updater/executor.py::UpdateError(RuntimeError)` — carries `stderr: str` (pip output) and `rolled_back: bool`.
- `decnet/updater/executor.py::Release``(slot: str, sha: str | None, installed_at: datetime | None)` dataclass, what `/releases` returns.
- `decnet/updater/executor.py::list_releases` — scans `install_dir/releases/*/release.json`; returns them oldest-first.
- `decnet/updater/executor.py::run_update` — the big one. Extracts the tarball into `active.new/`, runs `_run_pip`, rotates, `_stop_agent`, `_spawn_agent`, `_probe_agent`. On probe failure: flip symlink back to `prev`, restart agent, re-probe, raise `UpdateError(rolled_back=True)`.
- `decnet/updater/executor.py::run_rollback` — symbolic wrapper around the swap-and-restart path, for manual use via `POST /rollback`.
- `decnet/updater/executor.py::run_update_self` — separate pipeline targeting `updater_install_dir`. Does not call `_stop_agent`/`_spawn_agent`; ends in `os.execv` so the process image is replaced. Rebuilds the argv from env vars (see `server.py` above) — `sys.argv[1:]` is the uvicorn subprocess invocation and cannot be reused.
- `decnet/updater/executor.py::_run_pip` — on first use, bootstraps `<install_dir>/venv/` with the full dep tree; subsequent calls use `--force-reinstall --no-deps` so the near-no-op case is cheap.
- `decnet/updater/executor.py::_spawn_agent``subprocess.Popen([<venv>/bin/decnet, "agent", "--daemon"], start_new_session=True, cwd=install_dir)`. Writes the new PID to `agent.pid`. `cwd=install_dir` is what lets a persistent `<install_dir>/.env.local` take effect.
- `decnet/updater/executor.py::_stop_agent` — SIGTERM the PID in `agent.pid`, wait up to `AGENT_RESTART_GRACE_S`, SIGKILL the survivor. Falls back to `_discover_agent_pids` when no pidfile exists (manually-started agents) so restart is reliable regardless of how the agent was originally launched.
- `decnet/updater/executor.py::_discover_agent_pids` — scans `/proc/*/cmdline` for any process whose argv contains `decnet` + `agent`. Skips its own PID. Returns an int list.
- `decnet/updater/executor.py::_probe_agent` — mTLS `GET https://127.0.0.1:8765/health` up to 10 times with 1 s backoff. Uses a bare `ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)` rather than `ssl.create_default_context()` — on Python 3.13 the default context enables `VERIFY_X509_STRICT`, which rejects CA certs without AKI (which `generate_ca` doesn't emit).
- `decnet/updater/executor.py::_shared_venv` — returns `<install_dir>/venv`. Central so every caller agrees on one path.
Reads: nothing persistent before the first update; afterwards, the release directories under `install_dir/releases/`. Writes: the release directories, `agent.pid`, `agent.spawn.log`, and the `current` symlink.
### `decnet/updater/routes/`
Reserved for handler splits once the app grows. All routes currently live in `app.py`.

162
PKI-and-mTLS.md Normal file

@@ -0,0 +1,162 @@
# PKI and mTLS
DECNET's cross-host control plane — master ↔ agent (`/deploy`, `/teardown`, …), master ↔ updater (`/update`, `/update-self`, …), and worker → master log forwarding (RFC 5425 syslog-over-TLS) — is gated end-to-end by mutual TLS under a single private CA. This page is the developer-level reference: how the CA is built, how leaf certs are issued, how clients and servers wire the `ssl.SSLContext`, and which trust decisions the code actually enforces versus which ones it still relies on convention for.
For the operator walkthrough of issuing certs, see [SWARM-Mode § Enrollment](SWARM-Mode#decnet-swarm-enroll) and [Remote-Updates § Enrollment](Remote-Updates#enrollment).
## Why mTLS
The decoy network is intentionally internet-exposed. The control plane that builds it — deploy commands, log streams, code pushes — must not be. The threat model the project assumes is:
- An attacker who finds a decky may port-scan the host for a control daemon.
- An attacker who compromises a worker's process space may try to talk to the master.
- A neighbour on the LAN may try to inject forged syslog lines into the master's SIEM.
A firewall alone isn't enough: the decoy network and the control network often share physical infrastructure. So the rule is: **every TCP connection between DECNET components presents a client cert signed by the DECNET CA, and every server rejects peers that don't.** TLS handshakes fail before a single byte reaches a handler. That is the entire security boundary.
## One CA, one root of trust
There is exactly one private key in the system that matters, the CA key. It lives on the master at `~/.decnet/ca/ca.key` (mode 0600). Every other cert in the fleet chains back to it.
- `decnet/swarm/pki.py::CABundle``(key_pem, cert_pem)`.
- `decnet/swarm/pki.py::generate_ca` — RSA-4096, self-signed, SHA-256 signature, 10-year validity. Extensions: `BasicConstraints(ca=True, path_length=0)` (may sign leaves, may not sign sub-CAs) and `KeyUsage(key_cert_sign=True, crl_sign=True)`.
- `decnet/swarm/pki.py::CA_KEY_BITS` = 4096, `CA_VALIDITY_DAYS` = 3650.
No intermediate CAs. The project deliberately stays flat because the fleet is expected to be small (dozens of hosts, not thousands) and a flat chain is trivially auditable with a single `openssl verify -CAfile ca.crt worker.crt`.
### What's intentionally *not* on the CA
- **No Authority Key Identifier (AKI) extension.** `generate_ca` does not set one and `issue_worker_cert` does not copy one down either. This is visible in behaviour: a probe using `ssl.create_default_context()` on Python 3.13 will reject the chain with `Missing Authority Key Identifier` because 3.13's default context enables `VERIFY_X509_STRICT`. The project works around this by using a bare `ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)` in internal probes (see `decnet/updater/executor.py::_probe_agent`). Adding AKI/SKI is a low-risk future change; the trade-off is a chain re-issuance across the whole fleet.
- **No CRL or OCSP responder URL.** Revocation is out-of-band: the operator rotates a compromised leaf by removing its fingerprint from the master DB and re-enrolling the host with a fresh bundle.
## Leaf certs: agent vs updater
Every worker gets up to two leaf certs, both signed by the same CA:
| Leaf | CN | Default SANs | Served on | Bundle dir |
|---|---|---|---|---|
| Agent | `<hostname>` | operator-supplied list `<address>` | `0.0.0.0:8765` | `~/.decnet/agent/` |
| Updater (optional) | `updater@<hostname>` | agent SANs `127.0.0.1` | `0.0.0.0:8766` | `~/.decnet/updater/` |
- `decnet/swarm/pki.py::issue_worker_cert(ca, worker_name, sans, validity_days=825)` — RSA-2048, SHA-256, 825-day validity. CN is the caller-supplied `worker_name` verbatim (the `updater@` prefix is a caller convention, not a PKI-enforced one). `ExtKeyUsage(serverAuth, clientAuth)` is set because every leaf is both a server (it accepts calls from the master) and a client (it originates log-forwarding connections to the master's syslog listener).
- `decnet/swarm/pki.py::IssuedCert``(key_pem, cert_pem, ca_cert_pem, fingerprint_sha256)`. The fingerprint is `sha256(cert_pem_der).hexdigest()` and is what the master stores in `swarm_hosts.worker_cert_fingerprint` / `updater_cert_fingerprint` for audit.
- `decnet/swarm/pki.py::write_worker_bundle` — writes `worker.key` (0600), `worker.crt`, `ca.crt` into the bundle dir. Updater bundles use `updater.key` / `updater.crt` filenames in `~/.decnet/updater/` instead; the code reuses the same writer but the caller names the files.
- `decnet/swarm/pki.py::WORKER_KEY_BITS` = 2048, `decnet/swarm/pki.py::WORKER_VALIDITY_DAYS` = 825.
The master itself also holds a CA-signed leaf at `~/.decnet/ca/master/worker.crt` — the same shape as a worker cert, but that's the one the master presents *as a client* to every worker daemon.
## Enrollment flow
No pre-authenticated bootstrap endpoint. Enrollment is master-driven and the keys never leave the operator's hands:
1. On the master, `decnet swarm enroll --host <name> --address <ip> --sans <csv> [--updater]` generates the leaf bundle(s) locally via `issue_worker_cert`.
2. The fingerprints and metadata are written to `swarm_hosts` in the master DB.
3. The operator copies the bundle(s) to the worker (one-time out-of-band step — the only `scp` the workflow prescribes).
4. On the worker, `sudo decnet agent --daemon` and optionally `sudo decnet updater --daemon` pick up the bundle from the standard path.
After that, the only auth the master ever uses when talking to the worker is the mTLS handshake.
## Client side: how every SSLContext is built
Every outgoing control-plane call — `AgentClient`, `UpdaterClient`, the updater's local health probe after a push — builds its `ssl.SSLContext` the same way:
```python
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
ctx.load_cert_chain(certfile=cert_path, keyfile=key_path)
ctx.load_verify_locations(cafile=ca_cert_path)
ctx.verify_mode = ssl.CERT_REQUIRED
ctx.check_hostname = False
```
Key points:
- **Client cert is always presented.** `load_cert_chain` is called unconditionally; this is a client-authenticated handshake, not a plain TLS one.
- **Hostname check is disabled.** Workers enroll with arbitrary SANs (an IP, a LAN hostname, a `.local` name — whichever the operator chose) and the master may dial them by yet another DNS name routed through its own `/etc/hosts`. Pinning by DNS would create constant handshake failures for no security gain. Trust is derived from the CA chain instead: if the peer cert chains to the DECNET CA, it *is* a DECNET peer, regardless of what name DNS currently maps.
- **`CERT_REQUIRED` is not the default for `PROTOCOL_TLS_CLIENT`** in older Python versions; the code sets it explicitly to avoid any runtime-dependent laxness.
- **Bare `SSLContext`, not `create_default_context()`.** The default context on Python 3.13 enables `ssl.VERIFY_X509_STRICT`, which requires AKI on intermediates. The DECNET CA does not set AKI (see above), so the default context rejects the chain. Using a bare context gives us exactly the knobs we flip and nothing else. If AKI is ever added to `generate_ca`, this workaround can be dropped.
`AgentClient` and `UpdaterClient` both follow this pattern. The code is near-duplicated across `decnet/swarm/client.py` and `decnet/swarm/updater_client.py` (~20 lines each); a shared helper was considered and rejected because it would have to branch on a tiny number of per-client details and the duplication is more legible than the factory.
## Server side: how uvicorn is launched
Both `decnet/agent/server.py` and `decnet/updater/server.py` spawn uvicorn as a subprocess. The TLS flags are identical in shape; only the bundle dir and port differ.
```bash
python -m uvicorn <app> \
--host 0.0.0.0 --port <port> \
--ssl-keyfile <worker.key or updater.key> \
--ssl-certfile <worker.crt or updater.crt> \
--ssl-ca-certs <ca.crt> \
--ssl-cert-reqs 2
```
`--ssl-cert-reqs 2` is `ssl.CERT_REQUIRED`. Any TCP connection that doesn't present a cert signed by the CA in `--ssl-ca-certs` is torn down at handshake time, before uvicorn routes the request to an app handler. This is the only enforcement layer for inbound calls — there is no token, no signature, no application-layer check underneath.
The agent and updater apps themselves construct their FastAPI instances with `docs_url=None`, `redoc_url=None`, `openapi_url=None`. There is no Swagger UI on a worker; the attack surface is exactly the routes the module explicitly registers.
## CN and role separation: what's enforced vs. what isn't
The updater's `decnet/updater/app.py` module docstring says:
> Mounted by uvicorn via `decnet.updater.server` with `--ssl-cert-reqs 2`; the CN on the peer cert tells us which endpoints are legal (`updater@*` only — agent certs are rejected).
**As of this writing, the CN check is not enforced in code.** There is no middleware, dependency, or early-handler gate that reads the peer cert and compares CN. The TLS layer admits any CA-signed peer, and the handler runs. In practice this has not been exploited because:
- The master uses one cert for both daemons (it presents the same `~/.decnet/ca/master/worker.crt` to port 8765 and 8766), so the CN-split is operator-only. There is no "agent cert" a misbehaving master would accidentally present to the updater.
- Worker-side, nothing presents a cert to the opposite daemon — the agent does not call the updater and vice versa.
It is still a gap. A compromised worker holding only an agent bundle *can* call the updater's endpoints on its own loopback if it reaches them. The plan item is:
1. Read the peer cert from the TLS session in a FastAPI dependency (`request.scope["transport"].get_extra_info("peercert")`).
2. Extract CN.
3. For the updater app: reject if CN does not match `updater@*`.
4. For the agent app: reject if CN matches `updater@*` (agents should not masquerade as updaters either).
Until then, treat the CN prefix as a deployment convention that the code documents but does not police.
## Worker → master: syslog-over-TLS (RFC 5425)
The same PKI gates the log pipeline.
- `decnet/swarm/log_forwarder.py::ForwarderConfig` — worker-side sender, opens an mTLS connection to `master_host:6514` using its agent bundle.
- `decnet/swarm/log_listener.py::ListenerConfig` — master-side listener, defaults to `0.0.0.0:6514`, trusts the DECNET CA, requires the peer to present a CA-signed cert.
- `decnet/swarm/log_listener.py::build_listener_ssl_context` — server-side `ssl.SSLContext` mirroring the client-side one (`CERT_REQUIRED`, CA chain pinned). The listener extracts CN from the peer cert and uses that as the authoritative worker identity for the line. **The RFC 5424 HOSTNAME field in the syslog message is never trusted for authentication** — a worker can claim any HOSTNAME it wants; only the CN decides which host is credited.
Plaintext syslog across hosts is a project non-goal and is rejected at review. Loopback-only syslog (service container → worker-local file) is allowed.
## Fingerprints and the master DB
- `decnet/swarm/pki.py::fingerprint(cert_pem)``sha256(cert_pem_der).hex()`.
- Each enrolled host has `worker_cert_fingerprint` (and optionally `updater_cert_fingerprint`) stored on the `swarm_hosts` row when the bundle is issued.
- These are not used for TLS — the CA chain already authenticates the peer. They exist for operator audit: `decnet swarm list` can print them, and a stored fingerprint mismatched against the one a newly-deployed worker serves is a signal that someone re-issued a cert out-of-band.
## Cert rotation
There is no automated rotation. The chosen validities (10 years for the CA, 825 days for leaves — the CA/B Forum ceiling) push that problem out far enough that it's tractable manually:
- To rotate a single worker's cert: re-run `decnet swarm enroll --host <name>` with `--reissue`, copy the new bundle, restart the daemon.
- To rotate the CA: issue a new CA, re-sign every leaf, ship new bundles to every host, restart every daemon. This is a fleet-wide event — the plan is to only do it if the CA key is believed compromised.
## Filesystem layout recap
```
master:
~/.decnet/ca/
ca.key private key (0600) — never copied
ca.crt root cert
master/{worker.key, worker.crt, ca.crt} master's own client bundle
workers/<name>/{worker.key, worker.crt, ca.crt} issued agent bundles
workers/<name>/{updater.key, updater.crt, ca.crt} issued updater bundles (if --updater)
worker:
~/.decnet/agent/{worker.key, worker.crt, ca.crt}
~/.decnet/updater/{updater.key, updater.crt, ca.crt} (if enrolled with --updater)
```
No private keys ever leave the host that owns them, modulo the one-time operator-driven bundle delivery at enrollment. If a bundle is leaked, rotate the leaf and clear the fingerprint from the master DB — the CA key doesn't need to move.
## Further reading
- [Remote-Updates](Remote-Updates) — how the updater uses its cert to authenticate push operations.
- [SWARM-Mode](SWARM-Mode) — operator-facing enrollment walkthrough.
- [Module Reference — Workers § Swarm](Module-Reference-Workers#swarm--decnetswarm) — module-level index of the `decnet/swarm/` package.

@@ -18,6 +18,7 @@
- [Networking-MACVLAN-IPVLAN](Networking-MACVLAN-IPVLAN) - [Networking-MACVLAN-IPVLAN](Networking-MACVLAN-IPVLAN)
- [Deployment-Modes](Deployment-Modes) - [Deployment-Modes](Deployment-Modes)
- [SWARM-Mode](SWARM-Mode) - [SWARM-Mode](SWARM-Mode)
- [Remote-Updates](Remote-Updates)
- [Environment-Variables](Environment-Variables) - [Environment-Variables](Environment-Variables)
- [Teardown-and-State](Teardown-and-State) - [Teardown-and-State](Teardown-and-State)
- [Database-Drivers](Database-Drivers) - [Database-Drivers](Database-Drivers)
@@ -39,6 +40,7 @@
- [Module-Reference-Web](Module-Reference-Web) - [Module-Reference-Web](Module-Reference-Web)
- [Module-Reference-Services](Module-Reference-Services) - [Module-Reference-Services](Module-Reference-Services)
- [Module-Reference-Workers](Module-Reference-Workers) - [Module-Reference-Workers](Module-Reference-Workers)
- [PKI-and-mTLS](PKI-and-mTLS)
- [Testing-and-CI](Testing-and-CI) - [Testing-and-CI](Testing-and-CI)
- [Performance-Story](Performance-Story) - [Performance-Story](Performance-Story)
- [Tracing-and-Profiling](Tracing-and-Profiling) - [Tracing-and-Profiling](Tracing-and-Profiling)