1
PKI and mTLS
anti edited this page 2026-04-19 00:21:15 -04:00
This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

PKI and mTLS

DECNET's cross-host control plane — master ↔ agent (/deploy, /teardown, …), master ↔ updater (/update, /update-self, …), and worker → master log forwarding (RFC 5425 syslog-over-TLS) — is gated end-to-end by mutual TLS under a single private CA. This page is the developer-level reference: how the CA is built, how leaf certs are issued, how clients and servers wire the ssl.SSLContext, and which trust decisions the code actually enforces versus which ones it still relies on convention for.

For the operator walkthrough of issuing certs, see SWARM-Mode § Enrollment and Remote-Updates § Enrollment.

Why mTLS

The decoy network is intentionally internet-exposed. The control plane that builds it — deploy commands, log streams, code pushes — must not be. The threat model the project assumes is:

  • An attacker who finds a decky may port-scan the host for a control daemon.
  • An attacker who compromises a worker's process space may try to talk to the master.
  • A neighbour on the LAN may try to inject forged syslog lines into the master's SIEM.

A firewall alone isn't enough: the decoy network and the control network often share physical infrastructure. So the rule is: every TCP connection between DECNET components presents a client cert signed by the DECNET CA, and every server rejects peers that don't. TLS handshakes fail before a single byte reaches a handler. That is the entire security boundary.

One CA, one root of trust

There is exactly one private key in the system that matters, the CA key. It lives on the master at ~/.decnet/ca/ca.key (mode 0600). Every other cert in the fleet chains back to it.

  • decnet/swarm/pki.py::CABundle(key_pem, cert_pem).
  • decnet/swarm/pki.py::generate_ca — RSA-4096, self-signed, SHA-256 signature, 10-year validity. Extensions: BasicConstraints(ca=True, path_length=0) (may sign leaves, may not sign sub-CAs) and KeyUsage(key_cert_sign=True, crl_sign=True).
  • decnet/swarm/pki.py::CA_KEY_BITS = 4096, CA_VALIDITY_DAYS = 3650.

No intermediate CAs. The project deliberately stays flat because the fleet is expected to be small (dozens of hosts, not thousands) and a flat chain is trivially auditable with a single openssl verify -CAfile ca.crt worker.crt.

What's intentionally not on the CA

  • No Authority Key Identifier (AKI) extension. generate_ca does not set one and issue_worker_cert does not copy one down either. This is visible in behaviour: a probe using ssl.create_default_context() on Python 3.13 will reject the chain with Missing Authority Key Identifier because 3.13's default context enables VERIFY_X509_STRICT. The project works around this by using a bare ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT) in internal probes (see decnet/updater/executor.py::_probe_agent). Adding AKI/SKI is a low-risk future change; the trade-off is a chain re-issuance across the whole fleet.
  • No CRL or OCSP responder URL. Revocation is out-of-band: the operator rotates a compromised leaf by removing its fingerprint from the master DB and re-enrolling the host with a fresh bundle.

Leaf certs: agent vs updater

Every worker gets up to two leaf certs, both signed by the same CA:

Leaf CN Default SANs Served on Bundle dir
Agent <hostname> operator-supplied list <address> 0.0.0.0:8765 ~/.decnet/agent/
Updater (optional) updater@<hostname> agent SANs 127.0.0.1 0.0.0.0:8766 ~/.decnet/updater/
  • decnet/swarm/pki.py::issue_worker_cert(ca, worker_name, sans, validity_days=825) — RSA-2048, SHA-256, 825-day validity. CN is the caller-supplied worker_name verbatim (the updater@ prefix is a caller convention, not a PKI-enforced one). ExtKeyUsage(serverAuth, clientAuth) is set because every leaf is both a server (it accepts calls from the master) and a client (it originates log-forwarding connections to the master's syslog listener).
  • decnet/swarm/pki.py::IssuedCert(key_pem, cert_pem, ca_cert_pem, fingerprint_sha256). The fingerprint is sha256(cert_pem_der).hexdigest() and is what the master stores in swarm_hosts.worker_cert_fingerprint / updater_cert_fingerprint for audit.
  • decnet/swarm/pki.py::write_worker_bundle — writes worker.key (0600), worker.crt, ca.crt into the bundle dir. Updater bundles use updater.key / updater.crt filenames in ~/.decnet/updater/ instead; the code reuses the same writer but the caller names the files.
  • decnet/swarm/pki.py::WORKER_KEY_BITS = 2048, decnet/swarm/pki.py::WORKER_VALIDITY_DAYS = 825.

The master itself also holds a CA-signed leaf at ~/.decnet/ca/master/worker.crt — the same shape as a worker cert, but that's the one the master presents as a client to every worker daemon.

Enrollment flow

No pre-authenticated bootstrap endpoint. Enrollment is master-driven and the keys never leave the operator's hands:

  1. On the master, decnet swarm enroll --host <name> --address <ip> --sans <csv> [--updater] generates the leaf bundle(s) locally via issue_worker_cert.
  2. The fingerprints and metadata are written to swarm_hosts in the master DB.
  3. The operator copies the bundle(s) to the worker (one-time out-of-band step — the only scp the workflow prescribes).
  4. On the worker, sudo decnet agent --daemon and optionally sudo decnet updater --daemon pick up the bundle from the standard path.

After that, the only auth the master ever uses when talking to the worker is the mTLS handshake.

Client side: how every SSLContext is built

Every outgoing control-plane call — AgentClient, UpdaterClient, the updater's local health probe after a push — builds its ssl.SSLContext the same way:

ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
ctx.load_cert_chain(certfile=cert_path, keyfile=key_path)
ctx.load_verify_locations(cafile=ca_cert_path)
ctx.verify_mode = ssl.CERT_REQUIRED
ctx.check_hostname = False

Key points:

  • Client cert is always presented. load_cert_chain is called unconditionally; this is a client-authenticated handshake, not a plain TLS one.
  • Hostname check is disabled. Workers enroll with arbitrary SANs (an IP, a LAN hostname, a .local name — whichever the operator chose) and the master may dial them by yet another DNS name routed through its own /etc/hosts. Pinning by DNS would create constant handshake failures for no security gain. Trust is derived from the CA chain instead: if the peer cert chains to the DECNET CA, it is a DECNET peer, regardless of what name DNS currently maps.
  • CERT_REQUIRED is not the default for PROTOCOL_TLS_CLIENT in older Python versions; the code sets it explicitly to avoid any runtime-dependent laxness.
  • Bare SSLContext, not create_default_context(). The default context on Python 3.13 enables ssl.VERIFY_X509_STRICT, which requires AKI on intermediates. The DECNET CA does not set AKI (see above), so the default context rejects the chain. Using a bare context gives us exactly the knobs we flip and nothing else. If AKI is ever added to generate_ca, this workaround can be dropped.

AgentClient and UpdaterClient both follow this pattern. The code is near-duplicated across decnet/swarm/client.py and decnet/swarm/updater_client.py (~20 lines each); a shared helper was considered and rejected because it would have to branch on a tiny number of per-client details and the duplication is more legible than the factory.

Server side: how uvicorn is launched

Both decnet/agent/server.py and decnet/updater/server.py spawn uvicorn as a subprocess. The TLS flags are identical in shape; only the bundle dir and port differ.

python -m uvicorn <app> \
    --host 0.0.0.0 --port <port> \
    --ssl-keyfile <worker.key or updater.key> \
    --ssl-certfile <worker.crt or updater.crt> \
    --ssl-ca-certs <ca.crt> \
    --ssl-cert-reqs 2

--ssl-cert-reqs 2 is ssl.CERT_REQUIRED. Any TCP connection that doesn't present a cert signed by the CA in --ssl-ca-certs is torn down at handshake time, before uvicorn routes the request to an app handler. This is the only enforcement layer for inbound calls — there is no token, no signature, no application-layer check underneath.

The agent and updater apps themselves construct their FastAPI instances with docs_url=None, redoc_url=None, openapi_url=None. There is no Swagger UI on a worker; the attack surface is exactly the routes the module explicitly registers.

CN and role separation: what's enforced vs. what isn't

The updater's decnet/updater/app.py module docstring says:

Mounted by uvicorn via decnet.updater.server with --ssl-cert-reqs 2; the CN on the peer cert tells us which endpoints are legal (updater@* only — agent certs are rejected).

As of this writing, the CN check is not enforced in code. There is no middleware, dependency, or early-handler gate that reads the peer cert and compares CN. The TLS layer admits any CA-signed peer, and the handler runs. In practice this has not been exploited because:

  • The master uses one cert for both daemons (it presents the same ~/.decnet/ca/master/worker.crt to port 8765 and 8766), so the CN-split is operator-only. There is no "agent cert" a misbehaving master would accidentally present to the updater.
  • Worker-side, nothing presents a cert to the opposite daemon — the agent does not call the updater and vice versa.

It is still a gap. A compromised worker holding only an agent bundle can call the updater's endpoints on its own loopback if it reaches them. The plan item is:

  1. Read the peer cert from the TLS session in a FastAPI dependency (request.scope["transport"].get_extra_info("peercert")).
  2. Extract CN.
  3. For the updater app: reject if CN does not match updater@*.
  4. For the agent app: reject if CN matches updater@* (agents should not masquerade as updaters either).

Until then, treat the CN prefix as a deployment convention that the code documents but does not police.

Worker → master: syslog-over-TLS (RFC 5425)

The same PKI gates the log pipeline.

  • decnet/swarm/log_forwarder.py::ForwarderConfig — worker-side sender, opens an mTLS connection to master_host:6514 using its agent bundle.
  • decnet/swarm/log_listener.py::ListenerConfig — master-side listener, defaults to 0.0.0.0:6514, trusts the DECNET CA, requires the peer to present a CA-signed cert.
  • decnet/swarm/log_listener.py::build_listener_ssl_context — server-side ssl.SSLContext mirroring the client-side one (CERT_REQUIRED, CA chain pinned). The listener extracts CN from the peer cert and uses that as the authoritative worker identity for the line. The RFC 5424 HOSTNAME field in the syslog message is never trusted for authentication — a worker can claim any HOSTNAME it wants; only the CN decides which host is credited.

Plaintext syslog across hosts is a project non-goal and is rejected at review. Loopback-only syslog (service container → worker-local file) is allowed.

Fingerprints and the master DB

  • decnet/swarm/pki.py::fingerprint(cert_pem)sha256(cert_pem_der).hex().
  • Each enrolled host has worker_cert_fingerprint (and optionally updater_cert_fingerprint) stored on the swarm_hosts row when the bundle is issued.
  • These are not used for TLS — the CA chain already authenticates the peer. They exist for operator audit: decnet swarm list can print them, and a stored fingerprint mismatched against the one a newly-deployed worker serves is a signal that someone re-issued a cert out-of-band.

Cert rotation

There is no automated rotation. The chosen validities (10 years for the CA, 825 days for leaves — the CA/B Forum ceiling) push that problem out far enough that it's tractable manually:

  • To rotate a single worker's cert: re-run decnet swarm enroll --host <name> with --reissue, copy the new bundle, restart the daemon.
  • To rotate the CA: issue a new CA, re-sign every leaf, ship new bundles to every host, restart every daemon. This is a fleet-wide event — the plan is to only do it if the CA key is believed compromised.

Filesystem layout recap

master:
  ~/.decnet/ca/
    ca.key                                 private key (0600) — never copied
    ca.crt                                  root cert
    master/{worker.key, worker.crt, ca.crt} master's own client bundle
    workers/<name>/{worker.key, worker.crt, ca.crt}    issued agent bundles
    workers/<name>/{updater.key, updater.crt, ca.crt}  issued updater bundles (if --updater)

worker:
  ~/.decnet/agent/{worker.key, worker.crt, ca.crt}
  ~/.decnet/updater/{updater.key, updater.crt, ca.crt} (if enrolled with --updater)

No private keys ever leave the host that owns them, modulo the one-time operator-driven bundle delivery at enrollment. If a bundle is leaked, rotate the leaf and clear the fingerprint from the master DB — the CA key doesn't need to move.

Further reading