feat(agent): /deploy and /mutate become 202 fire-and-forget

The wizard API used to hang because /deckies/deploy ran docker compose
build && up -d synchronously, holding the request thread for minutes.
The worker side of that pipeline now returns 202 Accepted immediately
and runs the deploy in an asyncio.create_task.

On task completion (success or failure) the worker pushes a one-off
heartbeat carrying a lifecycle delta per decky:
  {decky_name, operation, status: succeeded|failed, error?, completed_at}

Master pivots these onto open DeckyLifecycle rows in the heartbeat
handler (next commit).  The scheduled 30s heartbeat tick is the
fallback if the immediate push drops.

- decnet/agent/app.py: /deploy and /mutate return 202; dry_run mutate
  still validates synchronously and returns 200.
- decnet/agent/executor.py: deploy_async + mutate_async wrap the work
  and push the completion delta.
- decnet/agent/heartbeat.py: push_lifecycle_delta() helper builds a
  one-off body and POSTs with the same mTLS context as the loop.
- decnet/swarm/client.py: revert deploy/mutate to control timeout
  (master no longer holds the HTTP request open for compose work).

Worker state.json gains no lifecycle field -- master DeckyLifecycle is
the source of truth; the master sweep handles crashed-mid-deploy
recovery.
This commit is contained in:
2026-05-22 16:31:23 -04:00
parent c0ad380020
commit d1ca96b2f4
6 changed files with 292 additions and 122 deletions

View File

@@ -246,13 +246,11 @@ class AgentClient:
"dry_run": dry_run,
"no_cache": no_cache,
}
# Swap in a long-deploy timeout for this call only.
old = self._require_client().timeout
self._require_client().timeout = _TIMEOUT_DEPLOY
try:
resp = await self._require_client().post("/deploy", json=body)
finally:
self._require_client().timeout = old
# Worker /deploy is async (202 fire-and-forget): the response only
# acks acceptance; the real work runs in the agent's event loop
# and reports terminal state via heartbeat lifecycle deltas. No
# need for the long deploy timeout here.
resp = await self._require_client().post("/deploy", json=body)
resp.raise_for_status()
return resp.json()
@@ -268,14 +266,8 @@ class AgentClient:
"services": list(services),
"dry_run": dry_run,
}
# Worker /mutate runs `compose up -d` which can pull/build; same
# long-tail latency as /deploy. Swap the deploy timeout in.
old = self._require_client().timeout
self._require_client().timeout = _TIMEOUT_DEPLOY
try:
resp = await self._require_client().post("/mutate", json=body)
finally:
self._require_client().timeout = old
# Worker /mutate is async (202): control-timeout is right.
resp = await self._require_client().post("/mutate", json=body)
resp.raise_for_status()
return resp.json()