Commit Graph

6 Commits

Author SHA1 Message Date
ee24a7551f fix(types): T7 — eliminate all remaining 38 mypy errors; fix DeckyRow subscript in engine tests 2026-05-01 02:07:53 -04:00
d314470d7f fix(stats): keep TopologyDecky.state in sync with docker so ACTIVE DECKIES counts right
Dashboard's ACTIVE DECKIES (active_deckies in get_stats_summary) counts
TopologyDecky rows where state='running'.  No code path was flipping
that state away from the default 'pending', so the count read 0/N
even when every container was running fine — the dashboard was lying.

Two complementary fixes:

1. deploy_topology — after the post-deploy compose ps verification,
   reconcile each TopologyDecky.state from the corresponding base
   container's docker state.  running → 'running'; anything else →
   'failed'.  Reuses the ps_rows already gathered for the
   ACTIVE-vs-DEGRADED status decision; no extra docker hit.

2. apply_add_decky — _materialise_decky_spawn now returns True/False;
   on True the row is updated to state='running' before
   _assert_valid_after.  Catches the case where a decky added via the
   live mutator queue stays at 'pending' indefinitely (the deployer's
   reconcile only runs on a fresh deploy_topology pass).

Existing topology deckies in active topologies will still read as
'pending' until the next deploy_topology runs, since this is
forward-only.  An operator-side fix is to teardown + redeploy or run
the (forthcoming) reconcile-on-startup pass.
2026-04-29 11:09:32 -04:00
d595240f55 fix(engine): post-deploy verify topology containers, mark DEGRADED on boot crash
deploy_topology was flipping to ACTIVE the moment 'compose up -d'
returned 0, but compose returns 0 as soon as containers are *started*.
A service that crashes on boot (port bind failure, bad image, missing
entrypoint) left the topology row sitting at ACTIVE indefinitely while
half the substrate was dead.

After compose returns, we now run 'compose ps --all --format json',
parse the newline-delimited per-container rows, and downgrade to
DEGRADED with a reason listing the first eight unhealthy containers if
anything isn't in state='running'.  Operators see real state on the
topology page instead of an optimistic flag.

_compose_ps swallows compose-level errors (returns []) so an unrelated
docker hiccup doesn't gate the success path — the existing in-flight
exception path still catches genuine deploy failures with FAILED.
2026-04-28 23:39:50 -04:00
5802de1f86 feat(canary): seed baseline canaries on MazeNET deckies
Topology deploys now plant the configured canary baseline set on every
decky in the topology, mirroring the fleet-deploy hook. Containers are
resolved via resolve_topology_container — <decky>-ssh when the decky
exposes an ssh service, else the topology base container
decnet_t_<id8>_<decky>.

The planter's plant/revoke/seed_baseline grow an optional container=
kwarg; default preserves the fleet <name>-ssh resolution.
2026-04-28 22:30:11 -04:00
862e4dbb31 merge: testing → main (reconcile 2-week divergence) 2026-04-28 18:36:00 -04:00
c384a3103a refactor: separate engine, collector, mutator, and fleet into independent subpackages
- decnet/engine/ — container lifecycle (deploy, teardown, status); _kill_api removed
- decnet/collector/ — Docker log streaming (moved from web/collector.py)
- decnet/mutator/ — mutation engine (no longer imports from cli or duplicates deployer code)
- decnet/fleet.py — shared decky-building logic extracted from cli.py

Cross-contamination eliminated:
- web router no longer imports from decnet.cli
- mutator no longer imports from decnet.cli
- cli no longer imports from decnet.web
- _kill_api() moved to cli (process management, not engine concern)
- _compose_with_retry duplicate removed from mutator
2026-04-12 00:26:22 -04:00