The wizard API used to hang because /deckies/deploy ran docker compose
build && up -d synchronously, holding the request thread for minutes.
The worker side of that pipeline now returns 202 Accepted immediately
and runs the deploy in an asyncio.create_task.
On task completion (success or failure) the worker pushes a one-off
heartbeat carrying a lifecycle delta per decky:
{decky_name, operation, status: succeeded|failed, error?, completed_at}
Master pivots these onto open DeckyLifecycle rows in the heartbeat
handler (next commit). The scheduled 30s heartbeat tick is the
fallback if the immediate push drops.
- decnet/agent/app.py: /deploy and /mutate return 202; dry_run mutate
still validates synchronously and returns 200.
- decnet/agent/executor.py: deploy_async + mutate_async wrap the work
and push the completion delta.
- decnet/agent/heartbeat.py: push_lifecycle_delta() helper builds a
one-off body and POSTs with the same mTLS context as the loop.
- decnet/swarm/client.py: revert deploy/mutate to control timeout
(master no longer holds the HTTP request open for compose work).
Worker state.json gains no lifecycle field -- master DeckyLifecycle is
the source of truth; the master sweep handles crashed-mid-deploy
recovery.
- Implement /mutate handler: load_state, update services + last_mutated,
save_state, write_compose, compose up -d via asyncio.to_thread. 404
for missing state / unknown decky_id. dry_run short-circuits before
any side effect.
- Add AgentClient.mutate(decky_id, services, *, dry_run=False) using
_TIMEOUT_DEPLOY (compose up can pull/build, exceeds control timeout).
- mutator/engine.py: in swarm mode with decky.host_uuid set, resolve
worker via _resolve_swarm_host and dispatch through AgentClient.mutate
instead of writing a compose file on master. Master-resident deckies
(unihost mode, or swarm with host_uuid=None) keep the local path.
Add validate_public_binding() called from the master API lifespan: when
DECNET_API_HOST is non-loopback, refuse to start if DECNET_CORS_ORIGINS
still contains a loopback origin (catches the "operator flipped to
0.0.0.0 to make it work and forgot to update CORS" footgun) or if
DECNET_CANARY_HTTP_BASE is plaintext http:// to a non-loopback host.
Log CRITICAL when DECNET_LIMITER_ENABLED=false on a public binding.
The validator no-ops under pytest so unrelated suites don't trip on it.
Add DECNET_VERIFY_HOSTNAME env knob; AgentClient and UpdaterClient
consult it when verify_hostname is None, giving production deploys
TLS hostname verification on top of the existing CA + fingerprint pin.
Default off so dev enrollments with mismatched SANs keep working.
AgentClient now verifies the worker's TLS cert fingerprint against
SwarmHost.client_cert_fingerprint at __aenter__ time, on top of CA
validation. Required before fanning master-orchestrated topology
deploys out across multiple swarm hosts: CA pinning alone allows any
cert signed by the master CA, which is too coarse once a single
deploy can target N hosts.
Mismatch raises FingerprintMismatchError so callers can distinguish
"wrong worker on the wire" from a transport hiccup.
Three new RPCs mirroring the existing deploy/teardown/status pattern:
- apply_topology(hydrated, version_hash) — long-timeout (600s) for
image pulls + compose up.
- teardown_topology(topology_id) — 300s timeout; enough for a
stubborn compose-down without hanging a heartbeat.
- get_topology_state() — short control-plane read for reconcile.
The per-call timeout swap uses the same trick as .deploy().
Step 5 of the agent <-> topology integration.
Decommissioning a worker from the dashboard (or swarm controller) now
asks the agent to wipe its own install before the master forgets it.
The agent stops decky containers + every decnet-* systemd unit, then
deletes /opt/decnet*, /etc/systemd/system/decnet-*, /var/lib/decnet/*,
and /usr/local/bin/decnet*. Logs under /var/log are preserved.
The reaper runs as a detached /tmp script (start_new_session=True) so
it survives the agent process being killed. Self-destruct dispatch is
best-effort — a dead worker doesn't block master-side cleanup.
decnet.swarm.client exposes:
- MasterIdentity / ensure_master_identity(): the master's own CA-signed
client bundle, issued once into ~/.decnet/ca/master/.
- AgentClient: async-context httpx wrapper that talks to a worker agent
over mTLS. health/status/deploy/teardown methods mirror the agent API.
SSL context is built from a bare ssl.SSLContext(PROTOCOL_TLS_CLIENT)
instead of httpx.create_ssl_context — the latter layers on default-CA
and purpose logic that broke private-CA mTLS. Server cert is pinned by
CA + chain, not DNS (workers enroll with arbitrary SANs).
tests/swarm/test_client_agent_roundtrip.py spins uvicorn in-process
with real certs on disk and verifies:
- A CA-signed master client passes health + status calls.
- An impostor whose cert comes from a different CA cannot connect.