fix(stats): keep TopologyDecky.state in sync with docker so ACTIVE DECKIES counts right
Dashboard's ACTIVE DECKIES (active_deckies in get_stats_summary) counts TopologyDecky rows where state='running'. No code path was flipping that state away from the default 'pending', so the count read 0/N even when every container was running fine — the dashboard was lying. Two complementary fixes: 1. deploy_topology — after the post-deploy compose ps verification, reconcile each TopologyDecky.state from the corresponding base container's docker state. running → 'running'; anything else → 'failed'. Reuses the ps_rows already gathered for the ACTIVE-vs-DEGRADED status decision; no extra docker hit. 2. apply_add_decky — _materialise_decky_spawn now returns True/False; on True the row is updated to state='running' before _assert_valid_after. Catches the case where a decky added via the live mutator queue stays at 'pending' indefinitely (the deployer's reconcile only runs on a fresh deploy_topology pass). Existing topology deckies in active topologies will still read as 'pending' until the next deploy_topology runs, since this is forward-only. An operator-side fix is to teardown + redeploy or run the (forthcoming) reconcile-on-startup pass.
This commit is contained in:
@@ -1005,8 +1005,18 @@ async def deploy_topology(repo, topology_id: str, *, dry_run: bool = False) -> N
|
||||
lambda: _compose_ps(compose_path),
|
||||
)
|
||||
bad: list[str] = []
|
||||
# Build the per-decky state map. The base container's compose
|
||||
# service name == decky name, which is what we cache on the
|
||||
# TopologyDecky row. Service containers (named ``<decky>-<svc>``)
|
||||
# don't gate the decky's state — service-level failures are visible
|
||||
# in compose ps separately and don't downgrade the decky as a whole.
|
||||
decky_state_by_name: dict[str, str] = {}
|
||||
for row in ps_rows:
|
||||
state = str(row.get("State", "")).lower()
|
||||
service_name = str(row.get("Service") or "")
|
||||
if service_name and "-" not in service_name:
|
||||
# Plain decky base; cache its docker state.
|
||||
decky_state_by_name[service_name] = state or "unknown"
|
||||
if state and state != "running":
|
||||
name = str(row.get("Name") or row.get("Service") or "?")
|
||||
exit_code = row.get("ExitCode")
|
||||
@@ -1015,6 +1025,27 @@ async def deploy_topology(repo, topology_id: str, *, dry_run: bool = False) -> N
|
||||
+ (f" (exit={exit_code})" if exit_code not in (None, 0, "") else "")
|
||||
)
|
||||
|
||||
# Reconcile each TopologyDecky.state from compose's view. Without
|
||||
# this, the row stays at the default 'pending' forever and the
|
||||
# dashboard's ACTIVE DECKIES count reads 0/N even when everything's
|
||||
# actually up.
|
||||
for decky in hydrated["deckies"]:
|
||||
cfg = decky.get("decky_config") or {}
|
||||
decky_name = cfg.get("name") or decky.get("name")
|
||||
if not decky_name:
|
||||
continue
|
||||
ds = decky_state_by_name.get(decky_name, "unknown")
|
||||
new_state = "running" if ds == "running" else "failed"
|
||||
try:
|
||||
await repo.update_topology_decky(
|
||||
decky["uuid"], {"state": new_state},
|
||||
)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
log.warning(
|
||||
"post-deploy state reconcile failed topology=%s decky=%s: %s",
|
||||
topology_id, decky_name, exc,
|
||||
)
|
||||
|
||||
if bad:
|
||||
reason = "post-deploy check: " + ", ".join(bad[:8]) + (
|
||||
f" and {len(bad) - 8} more" if len(bad) > 8 else ""
|
||||
|
||||
Reference in New Issue
Block a user