diff --git a/Module-Reference-Workers.md b/Module-Reference-Workers.md index 23f10b0..ba65f18 100644 --- a/Module-Reference-Workers.md +++ b/Module-Reference-Workers.md @@ -267,3 +267,171 @@ Shared helpers for the `LOG_TARGET` env var used by service plugins. --- See [Module Reference — Core](Module-Reference-Core) for top-level modules (cli, composer, telemetry, etc.) and [Module Reference — Web](Module-Reference-Web) for the FastAPI surface and DB layer. + +--- + +## Swarm — `decnet/swarm/` + +Master-side orchestration of multi-host deployments: HTTP clients for the worker daemons, the PKI that signs their certs, the tar helper that packages the working tree for a remote update, and the syslog-over-TLS forwarder/listener pair. Everything in this package runs either on the master or on every worker that needs to talk back to it — there is no third role. + +See [PKI and mTLS](PKI-and-mTLS) for the cert-chain details, cert layout, and why CN is not actually validated at the handler level. + +### `decnet/swarm/__init__.py` + +Re-exports `AgentClient`, `UpdaterClient`, `MasterIdentity`, `ensure_master_identity`, and the `pki` submodule. Importing `decnet.swarm` is enough for CLI-level callers; nothing else is considered public. + +### `decnet/swarm/client.py` + +Async HTTP client for the worker-side agent daemon (port 8765). One instance per worker target; `httpx.AsyncClient` is re-used across calls. + +- `decnet/swarm/client.py::AgentClient.__init__` — accepts either a `host` dict (from `swarm_hosts` DB rows) or a raw `address` string, resolves the master's own cert bundle via `MasterIdentity`, and builds the mTLS `ssl.SSLContext`. `agent_port` defaults to 8765. `verify_hostname=False` by default — we pin by CA chain, not DNS, because workers enroll with whatever SANs the operator chose. +- `decnet/swarm/client.py::AgentClient.deploy` — `POST /deploy` with a serialized `DecnetConfig` + `dry_run` + `no_cache`. Read timeout is bumped to 600 s because `docker compose build` can be very slow on underpowered workers. +- `decnet/swarm/client.py::AgentClient.teardown` — `POST /teardown` with optional `decky_id`. +- `decnet/swarm/client.py::AgentClient.health` — `GET /health`. The master never gets to this handler without a valid cert (uvicorn rejects the handshake) — this is a real liveness probe, not an auth endpoint. +- `decnet/swarm/client.py::AgentClient.status` — `GET /status`. +- mTLS wiring (in `__init__`): `ctx.load_cert_chain(...)`, `ctx.load_verify_locations(cafile=...)`, `ctx.verify_mode = ssl.CERT_REQUIRED`, `ctx.check_hostname = self._verify_hostname`. + +### `decnet/swarm/updater_client.py` + +Sibling client for the self-updater daemon (port 8766). Same mTLS pattern as `AgentClient` but targets a different port and uses multipart/form-data for tarball uploads. + +- `decnet/swarm/updater_client.py::UpdaterClient.__init__` — `updater_port=8766`, same `MasterIdentity` bundle as `AgentClient`. The master uses one cert for both; the TLS layer doesn't care which daemon answers. +- `decnet/swarm/updater_client.py::UpdaterClient.health` — `GET /health`. +- `decnet/swarm/updater_client.py::UpdaterClient.update` — `POST /update` with `tarball: bytes` + `sha: str` as multipart fields. 180 s read timeout covers tarball upload + `pip install` + probe-with-retry. +- `decnet/swarm/updater_client.py::UpdaterClient.update_self` — `POST /update-self`; sends `confirm_self=true` to pass the server-side safety check (see [Remote-Updates](Remote-Updates)). Tolerates the mid-response disconnect that `os.execv` causes by catching `RemoteProtocolError` and treating it as "success pending `/health` poll". +- `decnet/swarm/updater_client.py::UpdaterClient.rollback` — `POST /rollback`, 404 if no `prev/` slot. + +### `decnet/swarm/pki.py` + +The one place in the codebase that holds a private key. Everything else consumes `IssuedCert` bundles it produces. + +- `decnet/swarm/pki.py::DEFAULT_CA_DIR` = `~/.decnet/ca`; `decnet/swarm/pki.py::DEFAULT_AGENT_DIR` = `~/.decnet/agent`. +- `CA_KEY_BITS = 4096`, `WORKER_KEY_BITS = 2048`, `CA_VALIDITY_DAYS = 3650`, `WORKER_VALIDITY_DAYS = 825`. +- `decnet/swarm/pki.py::CABundle` — `(key_pem: bytes, cert_pem: bytes)` dataclass for the CA private key + self-signed cert. +- `decnet/swarm/pki.py::IssuedCert` — `(key_pem, cert_pem, ca_cert_pem, fingerprint_sha256: str)` for a signed leaf bundle. `fingerprint_sha256` is what the DB stores for out-of-band enrollment audit. +- `decnet/swarm/pki.py::generate_ca` — RSA-4096, self-signed, `BasicConstraints(ca=True, path_length=0)`, `KeyUsage(key_cert_sign=True, crl_sign=True)`, signed with SHA-256, 10-year validity. +- `decnet/swarm/pki.py::issue_worker_cert` — RSA-2048 leaf, CN = caller-supplied `worker_name` (`hostname` for agent certs, `updater@hostname` for updater certs), SANs built from the list the caller passes (IPs parsed as `IPAddress`, everything else as `DNSName`), `ExtKeyUsage(serverAuth, clientAuth)` — both flags because the worker is a server to the master and a client when it forwards logs. +- `decnet/swarm/pki.py::write_worker_bundle` — writes `worker.key` (mode 0600), `worker.crt`, `ca.crt` into the bundle dir. Updater bundles write to `~/.decnet/updater/` with `updater.key` / `updater.crt` names instead. +- `decnet/swarm/pki.py::load_worker_bundle` — loads an `IssuedCert` off disk; used by the agent/updater at startup. +- `decnet/swarm/pki.py::fingerprint` — `sha256(cert_pem_der).hexdigest()`. Cheap, deterministic, stable across cert re-encodings. + +### `decnet/swarm/tar_tree.py` + +Builds the working-tree tarball that `decnet swarm update` ships to the updater. + +- `decnet/swarm/tar_tree.py::DEFAULT_EXCLUDES` — filter tuple: `.venv/`, `__pycache__/`, `.git/`, `wiki-checkout/`, `*.pyc`, `*.pyo`, `*.db*`, `*.log`, `.pytest_cache/`, `.mypy_cache/`, `.tox/`, `*.egg-info/`, `decnet-state.json`, `master.log`, `master.json`, `decnet.db*`. These are enforced regardless of `.gitignore` so untracked dev artefacts never leak onto workers. +- `decnet/swarm/tar_tree.py::_is_excluded` — `fnmatch` the relative path *and* every leading subpath so a pattern like `.git/` excludes everything underneath. + +### `decnet/swarm/log_forwarder.py` + +Worker → master half of the RFC 5425 syslog-over-TLS pipeline. Wakes up periodically, reads new lines from the local log file, frames them octet-counted per RFC 5425, and writes them over an mTLS connection to port 6514 on the master. + +- `decnet/swarm/log_forwarder.py::ForwarderConfig` — dataclass: `log_path`, `master_host`, `master_port=6514`, `agent_dir=~/.decnet/agent`, optional `state_db` for byte-offset persistence. +- Plaintext syslog across hosts is forbidden by project policy — see [Syslog over TLS](#) notes. Loopback only may use plaintext. + +### `decnet/swarm/log_listener.py` + +Master-side RFC 5425 receiver. One mTLS-protected TCP socket on 6514; accepts connections from any worker whose cert is signed by the DECNET CA. + +- `decnet/swarm/log_listener.py::ListenerConfig` — `log_path`, `json_path`, `bind_host="0.0.0.0"`, `bind_port=6514`, `ca_dir=~/.decnet/ca`. +- `decnet/swarm/log_listener.py::build_listener_ssl_context` — server-side `ssl.SSLContext`: master presents `ca/master/worker.crt`, requires the peer to present a DECNET CA-signed cert. The CN on the peer cert is the authoritative worker identity — the RFC 5424 HOSTNAME field is untrusted input and is never used for authentication. + +--- + +## Agent — `decnet/agent/` + +Worker-side daemon. FastAPI app behind uvicorn with mTLS on port 8765. Accepts deploy / teardown / status requests from the master and executes them locally. + +See [Remote-Updates](Remote-Updates) for the lifecycle management around this process — the agent is not self-supervising. + +### `decnet/agent/__init__.py` + +Empty package marker. + +### `decnet/agent/app.py` + +- `decnet/agent/app.py::DeployRequest` — pydantic body model: `{config: DecnetConfig, dry_run: bool, no_cache: bool}`. +- `decnet/agent/app.py::TeardownRequest` — `{decky_id: str | None}`. +- `decnet/agent/app.py::MutateRequest` — `{decky_id: str, services: list[str]}` (reserved; handler returns 501). +- `GET /health` — returns `{"status": "ok", "marker": "..."}`. mTLS still required — the master's liveness probe carries its cert. +- `GET /status` — awaits `executor.status()`; returns the worker's current deployment snapshot. +- `POST /deploy` — calls `executor.deploy(config, dry_run, no_cache)`. Returns `{"status": "deployed", "deckies": int}` on success, `HTTPException(500)` with the caught exception's message on failure. +- `POST /teardown` — calls `executor.teardown(decky_id)`. +- `POST /mutate` — stub. Returns 501. Per-decky mutation is currently performed as a full `/deploy` with an updated `DecnetConfig` to avoid duplicating mutation logic worker-side. +- FastAPI app itself is built with `docs_url=None`, `redoc_url=None`, `openapi_url=None` — no interactive docs on workers. + +### `decnet/agent/server.py` + +uvicorn launcher. Not the app process itself — spawns uvicorn as a subprocess so signals land on a predictable PID and the tls-related flags live in one place. + +- Requires `~/.decnet/agent/{worker.key, worker.crt, ca.crt}`. Missing bundle → prints an instructional error, exits 2 (operator likely forgot `swarm enroll`). +- Spawns `python -m uvicorn decnet.agent.app:app --host HOST --port PORT --ssl-keyfile --ssl-certfile --ssl-ca-certs --ssl-cert-reqs 2`. The `2` is `ssl.CERT_REQUIRED` — no cert = TCP reset before any handler runs. + +### `decnet/agent/executor.py` + +Thin async shim between the FastAPI handlers and the existing unihost orchestration code. + +- `decnet/agent/executor.py::deploy` — async wrapper around `decnet.engine.deployer.deploy`. Runs the blocking work off the event loop. If the worker's local NIC/subnet differs from what the master serialised, the config is relocalised before deploy (see `engine.deployer` for the rewriting rules). +- `decnet/agent/executor.py::teardown` — async wrapper around `decnet.engine.deployer.teardown`. +- `decnet/agent/executor.py::status` — calls `decnet.config.load_state()` and returns the snapshot dict verbatim. + +Reads: `decnet-state.json`, Docker daemon. Writes: whatever the engine writes (compose file, docker networks/containers, state file). + +--- + +## Updater — `decnet/updater/` + +Worker-side self-update daemon. FastAPI app behind uvicorn with mTLS on port 8766. Runs from `/opt/decnet/venv/` initially, and from `/opt/decnet/updater/venv/` after the first successful `--include-self` push. Never modified by a normal `/update`. + +This is the daemon that owns the agent's lifecycle during a push — see [Remote-Updates](Remote-Updates) for the operator-facing view and [PKI and mTLS](PKI-and-mTLS) for the cert story. + +### `decnet/updater/__init__.py` + +Empty package marker. + +### `decnet/updater/app.py` + +- `decnet/updater/app.py::_Config` — module-level holder for the three paths the handlers need (`install_dir`, `updater_install_dir`, `agent_dir`). Defaults come from `DECNET_UPDATER_INSTALL_DIR` / `DECNET_UPDATER_UPDATER_DIR` / `DECNET_UPDATER_AGENT_DIR`, which `server.py` sets before spawning uvicorn. +- `decnet/updater/app.py::configure` — injected-paths setter used by the server launcher. Must run before serving. +- `GET /health` — returns `{"status": "ok", "role": "updater", "releases": [...]}`. The `role` field is the only thing that distinguishes this from the agent's `/health` to a caller that doesn't track ports. +- `GET /releases` — `{"releases": [...]}`; each release is `{slot, sha, installed_at}`. +- `POST /update` — multipart: `tarball: UploadFile`, `sha: str`. Delegates to `executor.run_update`. Returns 500 on generic `UpdateError`, 409 if the update was already rolled back (operator should read the response body for stderr + probe transcripts). +- `POST /update-self` — multipart: `tarball`, `sha`, `confirm_self: str`. The `confirm_self.lower() != "true"` guard is non-negotiable; there is no auto-rollback on this path. +- `POST /rollback` — no body. 404 if there's no `prev/` slot (fresh install), 500 on other failure. +- FastAPI app built with `docs_url=None`, `redoc_url=None`, `openapi_url=None`. + +### `decnet/updater/server.py` + +Same shape as the agent's server launcher — spawns uvicorn with mTLS flags. Reads `~/.decnet/updater/{updater.key, updater.crt, ca.crt}`. + +Before spawning uvicorn, exports: + +- `DECNET_UPDATER_INSTALL_DIR` — release root (`/opt/decnet` by default). +- `DECNET_UPDATER_UPDATER_DIR` — updater's own install root (`/updater`). +- `DECNET_UPDATER_AGENT_DIR` — agent bundle dir (for the local mTLS health probe after an update). +- `DECNET_UPDATER_BUNDLE_DIR` — the updater's own cert bundle (`~/.decnet/updater/`). +- `DECNET_UPDATER_HOST`, `DECNET_UPDATER_PORT` — needed so `run_update_self` can rebuild the operator-visible `decnet updater ...` command line when it `os.execv`s into the new binary. + +### `decnet/updater/executor.py` + +The heart of the update pipeline. Every seam is named `_foo` and monkeypatched by tests so the test suite never shells out. + +- `decnet/updater/executor.py::DEFAULT_INSTALL_DIR` = `/opt/decnet`. +- `decnet/updater/executor.py::UpdateError(RuntimeError)` — carries `stderr: str` (pip output) and `rolled_back: bool`. +- `decnet/updater/executor.py::Release` — `(slot: str, sha: str | None, installed_at: datetime | None)` dataclass, what `/releases` returns. +- `decnet/updater/executor.py::list_releases` — scans `install_dir/releases/*/release.json`; returns them oldest-first. +- `decnet/updater/executor.py::run_update` — the big one. Extracts the tarball into `active.new/`, runs `_run_pip`, rotates, `_stop_agent`, `_spawn_agent`, `_probe_agent`. On probe failure: flip symlink back to `prev`, restart agent, re-probe, raise `UpdateError(rolled_back=True)`. +- `decnet/updater/executor.py::run_rollback` — symbolic wrapper around the swap-and-restart path, for manual use via `POST /rollback`. +- `decnet/updater/executor.py::run_update_self` — separate pipeline targeting `updater_install_dir`. Does not call `_stop_agent`/`_spawn_agent`; ends in `os.execv` so the process image is replaced. Rebuilds the argv from env vars (see `server.py` above) — `sys.argv[1:]` is the uvicorn subprocess invocation and cannot be reused. +- `decnet/updater/executor.py::_run_pip` — on first use, bootstraps `/venv/` with the full dep tree; subsequent calls use `--force-reinstall --no-deps` so the near-no-op case is cheap. +- `decnet/updater/executor.py::_spawn_agent` — `subprocess.Popen([/bin/decnet, "agent", "--daemon"], start_new_session=True, cwd=install_dir)`. Writes the new PID to `agent.pid`. `cwd=install_dir` is what lets a persistent `/.env.local` take effect. +- `decnet/updater/executor.py::_stop_agent` — SIGTERM the PID in `agent.pid`, wait up to `AGENT_RESTART_GRACE_S`, SIGKILL the survivor. Falls back to `_discover_agent_pids` when no pidfile exists (manually-started agents) so restart is reliable regardless of how the agent was originally launched. +- `decnet/updater/executor.py::_discover_agent_pids` — scans `/proc/*/cmdline` for any process whose argv contains `decnet` + `agent`. Skips its own PID. Returns an int list. +- `decnet/updater/executor.py::_probe_agent` — mTLS `GET https://127.0.0.1:8765/health` up to 10 times with 1 s backoff. Uses a bare `ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)` rather than `ssl.create_default_context()` — on Python 3.13 the default context enables `VERIFY_X509_STRICT`, which rejects CA certs without AKI (which `generate_ca` doesn't emit). +- `decnet/updater/executor.py::_shared_venv` — returns `/venv`. Central so every caller agrees on one path. + +Reads: nothing persistent before the first update; afterwards, the release directories under `install_dir/releases/`. Writes: the release directories, `agent.pid`, `agent.spawn.log`, and the `current` symlink. + +### `decnet/updater/routes/` + +Reserved for handler splits once the app grows. All routes currently live in `app.py`. diff --git a/PKI-and-mTLS.md b/PKI-and-mTLS.md new file mode 100644 index 0000000..ca693c6 --- /dev/null +++ b/PKI-and-mTLS.md @@ -0,0 +1,162 @@ +# PKI and mTLS + +DECNET's cross-host control plane — master ↔ agent (`/deploy`, `/teardown`, …), master ↔ updater (`/update`, `/update-self`, …), and worker → master log forwarding (RFC 5425 syslog-over-TLS) — is gated end-to-end by mutual TLS under a single private CA. This page is the developer-level reference: how the CA is built, how leaf certs are issued, how clients and servers wire the `ssl.SSLContext`, and which trust decisions the code actually enforces versus which ones it still relies on convention for. + +For the operator walkthrough of issuing certs, see [SWARM-Mode § Enrollment](SWARM-Mode#decnet-swarm-enroll) and [Remote-Updates § Enrollment](Remote-Updates#enrollment). + +## Why mTLS + +The decoy network is intentionally internet-exposed. The control plane that builds it — deploy commands, log streams, code pushes — must not be. The threat model the project assumes is: + +- An attacker who finds a decky may port-scan the host for a control daemon. +- An attacker who compromises a worker's process space may try to talk to the master. +- A neighbour on the LAN may try to inject forged syslog lines into the master's SIEM. + +A firewall alone isn't enough: the decoy network and the control network often share physical infrastructure. So the rule is: **every TCP connection between DECNET components presents a client cert signed by the DECNET CA, and every server rejects peers that don't.** TLS handshakes fail before a single byte reaches a handler. That is the entire security boundary. + +## One CA, one root of trust + +There is exactly one private key in the system that matters, the CA key. It lives on the master at `~/.decnet/ca/ca.key` (mode 0600). Every other cert in the fleet chains back to it. + +- `decnet/swarm/pki.py::CABundle` — `(key_pem, cert_pem)`. +- `decnet/swarm/pki.py::generate_ca` — RSA-4096, self-signed, SHA-256 signature, 10-year validity. Extensions: `BasicConstraints(ca=True, path_length=0)` (may sign leaves, may not sign sub-CAs) and `KeyUsage(key_cert_sign=True, crl_sign=True)`. +- `decnet/swarm/pki.py::CA_KEY_BITS` = 4096, `CA_VALIDITY_DAYS` = 3650. + +No intermediate CAs. The project deliberately stays flat because the fleet is expected to be small (dozens of hosts, not thousands) and a flat chain is trivially auditable with a single `openssl verify -CAfile ca.crt worker.crt`. + +### What's intentionally *not* on the CA + +- **No Authority Key Identifier (AKI) extension.** `generate_ca` does not set one and `issue_worker_cert` does not copy one down either. This is visible in behaviour: a probe using `ssl.create_default_context()` on Python 3.13 will reject the chain with `Missing Authority Key Identifier` because 3.13's default context enables `VERIFY_X509_STRICT`. The project works around this by using a bare `ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)` in internal probes (see `decnet/updater/executor.py::_probe_agent`). Adding AKI/SKI is a low-risk future change; the trade-off is a chain re-issuance across the whole fleet. +- **No CRL or OCSP responder URL.** Revocation is out-of-band: the operator rotates a compromised leaf by removing its fingerprint from the master DB and re-enrolling the host with a fresh bundle. + +## Leaf certs: agent vs updater + +Every worker gets up to two leaf certs, both signed by the same CA: + +| Leaf | CN | Default SANs | Served on | Bundle dir | +|---|---|---|---|---| +| Agent | `` | operator-supplied list ∪ `
` | `0.0.0.0:8765` | `~/.decnet/agent/` | +| Updater (optional) | `updater@` | agent SANs ∪ `127.0.0.1` | `0.0.0.0:8766` | `~/.decnet/updater/` | + +- `decnet/swarm/pki.py::issue_worker_cert(ca, worker_name, sans, validity_days=825)` — RSA-2048, SHA-256, 825-day validity. CN is the caller-supplied `worker_name` verbatim (the `updater@` prefix is a caller convention, not a PKI-enforced one). `ExtKeyUsage(serverAuth, clientAuth)` is set because every leaf is both a server (it accepts calls from the master) and a client (it originates log-forwarding connections to the master's syslog listener). +- `decnet/swarm/pki.py::IssuedCert` — `(key_pem, cert_pem, ca_cert_pem, fingerprint_sha256)`. The fingerprint is `sha256(cert_pem_der).hexdigest()` and is what the master stores in `swarm_hosts.worker_cert_fingerprint` / `updater_cert_fingerprint` for audit. +- `decnet/swarm/pki.py::write_worker_bundle` — writes `worker.key` (0600), `worker.crt`, `ca.crt` into the bundle dir. Updater bundles use `updater.key` / `updater.crt` filenames in `~/.decnet/updater/` instead; the code reuses the same writer but the caller names the files. +- `decnet/swarm/pki.py::WORKER_KEY_BITS` = 2048, `decnet/swarm/pki.py::WORKER_VALIDITY_DAYS` = 825. + +The master itself also holds a CA-signed leaf at `~/.decnet/ca/master/worker.crt` — the same shape as a worker cert, but that's the one the master presents *as a client* to every worker daemon. + +## Enrollment flow + +No pre-authenticated bootstrap endpoint. Enrollment is master-driven and the keys never leave the operator's hands: + +1. On the master, `decnet swarm enroll --host --address --sans [--updater]` generates the leaf bundle(s) locally via `issue_worker_cert`. +2. The fingerprints and metadata are written to `swarm_hosts` in the master DB. +3. The operator copies the bundle(s) to the worker (one-time out-of-band step — the only `scp` the workflow prescribes). +4. On the worker, `sudo decnet agent --daemon` and optionally `sudo decnet updater --daemon` pick up the bundle from the standard path. + +After that, the only auth the master ever uses when talking to the worker is the mTLS handshake. + +## Client side: how every SSLContext is built + +Every outgoing control-plane call — `AgentClient`, `UpdaterClient`, the updater's local health probe after a push — builds its `ssl.SSLContext` the same way: + +```python +ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) +ctx.load_cert_chain(certfile=cert_path, keyfile=key_path) +ctx.load_verify_locations(cafile=ca_cert_path) +ctx.verify_mode = ssl.CERT_REQUIRED +ctx.check_hostname = False +``` + +Key points: + +- **Client cert is always presented.** `load_cert_chain` is called unconditionally; this is a client-authenticated handshake, not a plain TLS one. +- **Hostname check is disabled.** Workers enroll with arbitrary SANs (an IP, a LAN hostname, a `.local` name — whichever the operator chose) and the master may dial them by yet another DNS name routed through its own `/etc/hosts`. Pinning by DNS would create constant handshake failures for no security gain. Trust is derived from the CA chain instead: if the peer cert chains to the DECNET CA, it *is* a DECNET peer, regardless of what name DNS currently maps. +- **`CERT_REQUIRED` is not the default for `PROTOCOL_TLS_CLIENT`** in older Python versions; the code sets it explicitly to avoid any runtime-dependent laxness. +- **Bare `SSLContext`, not `create_default_context()`.** The default context on Python 3.13 enables `ssl.VERIFY_X509_STRICT`, which requires AKI on intermediates. The DECNET CA does not set AKI (see above), so the default context rejects the chain. Using a bare context gives us exactly the knobs we flip and nothing else. If AKI is ever added to `generate_ca`, this workaround can be dropped. + +`AgentClient` and `UpdaterClient` both follow this pattern. The code is near-duplicated across `decnet/swarm/client.py` and `decnet/swarm/updater_client.py` (~20 lines each); a shared helper was considered and rejected because it would have to branch on a tiny number of per-client details and the duplication is more legible than the factory. + +## Server side: how uvicorn is launched + +Both `decnet/agent/server.py` and `decnet/updater/server.py` spawn uvicorn as a subprocess. The TLS flags are identical in shape; only the bundle dir and port differ. + +```bash +python -m uvicorn \ + --host 0.0.0.0 --port \ + --ssl-keyfile \ + --ssl-certfile \ + --ssl-ca-certs \ + --ssl-cert-reqs 2 +``` + +`--ssl-cert-reqs 2` is `ssl.CERT_REQUIRED`. Any TCP connection that doesn't present a cert signed by the CA in `--ssl-ca-certs` is torn down at handshake time, before uvicorn routes the request to an app handler. This is the only enforcement layer for inbound calls — there is no token, no signature, no application-layer check underneath. + +The agent and updater apps themselves construct their FastAPI instances with `docs_url=None`, `redoc_url=None`, `openapi_url=None`. There is no Swagger UI on a worker; the attack surface is exactly the routes the module explicitly registers. + +## CN and role separation: what's enforced vs. what isn't + +The updater's `decnet/updater/app.py` module docstring says: + +> Mounted by uvicorn via `decnet.updater.server` with `--ssl-cert-reqs 2`; the CN on the peer cert tells us which endpoints are legal (`updater@*` only — agent certs are rejected). + +**As of this writing, the CN check is not enforced in code.** There is no middleware, dependency, or early-handler gate that reads the peer cert and compares CN. The TLS layer admits any CA-signed peer, and the handler runs. In practice this has not been exploited because: + +- The master uses one cert for both daemons (it presents the same `~/.decnet/ca/master/worker.crt` to port 8765 and 8766), so the CN-split is operator-only. There is no "agent cert" a misbehaving master would accidentally present to the updater. +- Worker-side, nothing presents a cert to the opposite daemon — the agent does not call the updater and vice versa. + +It is still a gap. A compromised worker holding only an agent bundle *can* call the updater's endpoints on its own loopback if it reaches them. The plan item is: + +1. Read the peer cert from the TLS session in a FastAPI dependency (`request.scope["transport"].get_extra_info("peercert")`). +2. Extract CN. +3. For the updater app: reject if CN does not match `updater@*`. +4. For the agent app: reject if CN matches `updater@*` (agents should not masquerade as updaters either). + +Until then, treat the CN prefix as a deployment convention that the code documents but does not police. + +## Worker → master: syslog-over-TLS (RFC 5425) + +The same PKI gates the log pipeline. + +- `decnet/swarm/log_forwarder.py::ForwarderConfig` — worker-side sender, opens an mTLS connection to `master_host:6514` using its agent bundle. +- `decnet/swarm/log_listener.py::ListenerConfig` — master-side listener, defaults to `0.0.0.0:6514`, trusts the DECNET CA, requires the peer to present a CA-signed cert. +- `decnet/swarm/log_listener.py::build_listener_ssl_context` — server-side `ssl.SSLContext` mirroring the client-side one (`CERT_REQUIRED`, CA chain pinned). The listener extracts CN from the peer cert and uses that as the authoritative worker identity for the line. **The RFC 5424 HOSTNAME field in the syslog message is never trusted for authentication** — a worker can claim any HOSTNAME it wants; only the CN decides which host is credited. + +Plaintext syslog across hosts is a project non-goal and is rejected at review. Loopback-only syslog (service container → worker-local file) is allowed. + +## Fingerprints and the master DB + +- `decnet/swarm/pki.py::fingerprint(cert_pem)` — `sha256(cert_pem_der).hex()`. +- Each enrolled host has `worker_cert_fingerprint` (and optionally `updater_cert_fingerprint`) stored on the `swarm_hosts` row when the bundle is issued. +- These are not used for TLS — the CA chain already authenticates the peer. They exist for operator audit: `decnet swarm list` can print them, and a stored fingerprint mismatched against the one a newly-deployed worker serves is a signal that someone re-issued a cert out-of-band. + +## Cert rotation + +There is no automated rotation. The chosen validities (10 years for the CA, 825 days for leaves — the CA/B Forum ceiling) push that problem out far enough that it's tractable manually: + +- To rotate a single worker's cert: re-run `decnet swarm enroll --host ` with `--reissue`, copy the new bundle, restart the daemon. +- To rotate the CA: issue a new CA, re-sign every leaf, ship new bundles to every host, restart every daemon. This is a fleet-wide event — the plan is to only do it if the CA key is believed compromised. + +## Filesystem layout recap + +``` +master: + ~/.decnet/ca/ + ca.key private key (0600) — never copied + ca.crt root cert + master/{worker.key, worker.crt, ca.crt} master's own client bundle + workers//{worker.key, worker.crt, ca.crt} issued agent bundles + workers//{updater.key, updater.crt, ca.crt} issued updater bundles (if --updater) + +worker: + ~/.decnet/agent/{worker.key, worker.crt, ca.crt} + ~/.decnet/updater/{updater.key, updater.crt, ca.crt} (if enrolled with --updater) +``` + +No private keys ever leave the host that owns them, modulo the one-time operator-driven bundle delivery at enrollment. If a bundle is leaked, rotate the leaf and clear the fingerprint from the master DB — the CA key doesn't need to move. + +## Further reading + +- [Remote-Updates](Remote-Updates) — how the updater uses its cert to authenticate push operations. +- [SWARM-Mode](SWARM-Mode) — operator-facing enrollment walkthrough. +- [Module Reference — Workers § Swarm](Module-Reference-Workers#swarm--decnetswarm) — module-level index of the `decnet/swarm/` package. diff --git a/_Sidebar.md b/_Sidebar.md index d946d0a..eb32020 100644 --- a/_Sidebar.md +++ b/_Sidebar.md @@ -18,6 +18,7 @@ - [Networking-MACVLAN-IPVLAN](Networking-MACVLAN-IPVLAN) - [Deployment-Modes](Deployment-Modes) - [SWARM-Mode](SWARM-Mode) +- [Remote-Updates](Remote-Updates) - [Environment-Variables](Environment-Variables) - [Teardown-and-State](Teardown-and-State) - [Database-Drivers](Database-Drivers) @@ -39,6 +40,7 @@ - [Module-Reference-Web](Module-Reference-Web) - [Module-Reference-Services](Module-Reference-Services) - [Module-Reference-Workers](Module-Reference-Workers) +- [PKI-and-mTLS](PKI-and-mTLS) - [Testing-and-CI](Testing-and-CI) - [Performance-Story](Performance-Story) - [Tracing-and-Profiling](Tracing-and-Profiling)