Commit Graph

5 Commits

Author SHA1 Message Date
df18cb44cc fix(swarm): don't paint healthy deckies as failed when a shard-sibling fails
docker compose up is partial-success-friendly — a build failure on one
service doesn't roll back the others. But the master was catching the
agent's 500 and tagging every decky in the shard as 'failed' with the
same error message. From the UI that looked like all three deckies died
even though two were live on the worker.

On dispatch exception, probe the agent's /status to learn which deckies
actually have running containers, and upsert per-decky state accordingly.
Only fall back to marking the whole shard failed if the status probe
itself is unreachable.

Enhance agent.executor.status() to include a 'runtime' map keyed by
decky name with per-service container state, so the master has something
concrete to consult.
2026-04-19 20:11:08 -04:00
7765b36c50 feat(updater): remote self-update daemon with auto-rollback
Adds a separate `decnet updater` daemon on each worker that owns the
agent's release directory and installs tarball pushes from the master
over mTLS. A normal `/update` never touches the updater itself, so the
updater is always a known-good rescuer if a bad agent push breaks
/health — the rotation is reversed and the agent restarted against the
previous release. `POST /update-self` handles updater upgrades
explicitly (no auto-rollback).

- decnet/updater/: executor, FastAPI app, uvicorn launcher
- decnet/swarm/updater_client.py, tar_tree.py: master-side push
- cli: `decnet updater`, `decnet swarm update [--host|--all]
  [--include-self] [--dry-run]`, `--updater` on `swarm enroll`
- enrollment API issues a second cert (CN=updater@<host>) signed by the
  same CA; SwarmHost records updater_cert_fingerprint
- tests: executor, app, CLI, tar tree, enroll-with-updater (37 new)
- wiki: Remote-Updates page + sidebar + SWARM-Mode cross-link
2026-04-18 21:40:21 -04:00
8914c27220 feat(swarm): add decnet swarm deckies to list deployed shards by host
`swarm list` only shows enrolled workers — there was no way to see which
deckies are running and where. Adds GET /swarm/deckies on the controller
(joins DeckyShard with SwarmHost for name/address/status) plus the CLI
wrapper with --host / --state filters and --json.
2026-04-18 21:10:07 -04:00
811136e600 refactor(swarm): one file per endpoint, matching existing router layout
Splits the three grouped router files into eight api_<verb>_<resource>.py
modules under decnet/web/router/swarm/ to match the convention used by
router/fleet/ and router/config/. Shared request/response models live in
_schemas.py. Keeps each endpoint easy to locate and modify without
stepping on siblings.
2026-04-18 19:23:06 -04:00
63b0a58527 feat(swarm): master-side SWARM controller (swarmctl) + agent CLI
Adds decnet/web/swarm_api.py as an independent FastAPI app with routers
for host enrollment, deployment dispatch (sharding DecnetConfig across
enrolled workers via AgentClient), and active health probing. Runs as
its own uvicorn subprocess via 'decnet swarmctl', mirroring the isolation
pattern used by 'decnet api'. Also wires up 'decnet agent' CLI entry for
the worker side.

29 tests added under tests/swarm/test_swarm_api.py cover enrollment
(including bundle generation + duplicate rejection), host CRUD, sharding
correctness, non-swarm-mode rejection, teardown, and health probes with
a stubbed AgentClient.
2026-04-18 19:18:33 -04:00