fix(swarm): don't paint healthy deckies as failed when a shard-sibling fails
docker compose up is partial-success-friendly — a build failure on one service doesn't roll back the others. But the master was catching the agent's 500 and tagging every decky in the shard as 'failed' with the same error message. From the UI that looked like all three deckies died even though two were live on the worker. On dispatch exception, probe the agent's /status to learn which deckies actually have running containers, and upsert per-decky state accordingly. Only fall back to marking the whole shard failed if the status probe itself is unreachable. Enhance agent.executor.status() to include a 'runtime' map keyed by decky name with per-service container state, so the master has something concrete to consult.
This commit is contained in:
@@ -98,14 +98,28 @@ async def dispatch_decnet_config(
|
||||
return SwarmHostResult(host_uuid=host_uuid, host_name=host["name"], ok=True, detail=body)
|
||||
except Exception as exc:
|
||||
log.exception("swarm.deploy dispatch failed host=%s", host["name"])
|
||||
# Compose-up is partial-success-friendly: one decky failing to
|
||||
# build doesn't roll back the ones that already came up. Ask the
|
||||
# agent which containers actually exist before painting the whole
|
||||
# shard red — otherwise decky1 and decky2 look "failed" even
|
||||
# though they're live on the worker.
|
||||
runtime: dict[str, Any] = {}
|
||||
try:
|
||||
async with AgentClient(host=host) as probe:
|
||||
snap = await probe.status()
|
||||
runtime = snap.get("runtime") or {}
|
||||
except Exception:
|
||||
log.warning("swarm.deploy: runtime probe failed host=%s — marking shard failed", host["name"])
|
||||
for d in shard:
|
||||
rstate = runtime.get(d.name) or {}
|
||||
is_up = bool(rstate.get("running"))
|
||||
await repo.upsert_decky_shard(
|
||||
{
|
||||
"decky_name": d.name,
|
||||
"host_uuid": host_uuid,
|
||||
"services": json.dumps(d.services),
|
||||
"state": "failed",
|
||||
"last_error": str(exc)[:512],
|
||||
"state": "running" if is_up else "failed",
|
||||
"last_error": None if is_up else str(exc)[:512],
|
||||
"updated_at": datetime.now(timezone.utc),
|
||||
}
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user