Commit Graph

10 Commits

Author SHA1 Message Date
d1ca96b2f4 feat(agent): /deploy and /mutate become 202 fire-and-forget
The wizard API used to hang because /deckies/deploy ran docker compose
build && up -d synchronously, holding the request thread for minutes.
The worker side of that pipeline now returns 202 Accepted immediately
and runs the deploy in an asyncio.create_task.

On task completion (success or failure) the worker pushes a one-off
heartbeat carrying a lifecycle delta per decky:
  {decky_name, operation, status: succeeded|failed, error?, completed_at}

Master pivots these onto open DeckyLifecycle rows in the heartbeat
handler (next commit).  The scheduled 30s heartbeat tick is the
fallback if the immediate push drops.

- decnet/agent/app.py: /deploy and /mutate return 202; dry_run mutate
  still validates synchronously and returns 200.
- decnet/agent/executor.py: deploy_async + mutate_async wrap the work
  and push the completion delta.
- decnet/agent/heartbeat.py: push_lifecycle_delta() helper builds a
  one-off body and POSTs with the same mTLS context as the loop.
- decnet/swarm/client.py: revert deploy/mutate to control timeout
  (master no longer holds the HTTP request open for compose work).

Worker state.json gains no lifecycle field -- master DeckyLifecycle is
the source of truth; the master sweep handles crashed-mid-deploy
recovery.
2026-05-22 16:31:23 -04:00
776861a1b7 fix(types): T7 — eliminate all remaining 38 mypy errors; fix DeckyRow subscript in engine tests 2026-05-01 02:19:53 -04:00
430262e01a feat(fleet): systemd unit + bus signal for fleet reconciler
Two pieces, one PR because they share a deployment surface:

1. systemd. decnet-reconciler.service.j2 mirrors the orchestrator unit
   shape (docker group, hardened sandbox, append-logs).  Read-only
   /var/lib/decnet so it can read decnet-state.json without write
   access.  Auto-discovered by `decnet init` via the existing
   decnet-*.service.j2 glob — no init.py change needed.  Added to
   decnet.target so `systemctl start decnet.target` brings it up
   alongside collector / sniffer / mutator / etc.  Also added to the
   agent reaper script so self-destruct cleans it up on workers.

2. Bus signal. reconcile_once now publishes
   `decky.<host_uuid:name>.state` on every insert / delete /
   state-changed transition.  Reuses the existing DECKY_STATE topic
   family (no bus/topics.py change → no wiki update needed per the
   bus-signals doc rule).  Composite host_uuid:name segment keeps
   fleet rows distinguishable from MazeNET TopologyDecky rows whose
   ids are bare UUIDs.  Quiet ticks publish nothing — convergence
   means silence.

Bus is plumbed through the worker, defaults to None for unit-test
callers.  publish_safely keeps the source-of-truth contract: DB write
is authoritative, the publish is best-effort notification.

Captures previous_state into a local before update_fleet_decky_state
runs — a fake repo that mutates rows in-place would otherwise see the
post-update state and report previous == current.  Real repos don't
have this concern but the fix is cheap and makes the function less
order-dependent.
2026-04-26 21:21:36 -04:00
f576564f02 fix(agent): also wipe /etc/decnet during self-destruct 2026-04-19 21:04:31 -04:00
00d5799a79 fix(agent): escape systemd cgroup when spawning self-destruct reaper
The reaper was being SIGTERM'd mid-rm because `start_new_session=True`
only forks a new POSIX session — it does not escape decnet-agent.service's
cgroup. When the reaper ran `systemctl stop decnet-agent`, systemd
tore down the whole cgroup (reaper included) before `rm -rf /opt/decnet*`
finished, leaving the install on disk.

Spawn the reaper via `systemd-run --collect --unit decnet-reaper-<pid>`
so it runs in a fresh transient scope, outside the agent unit. Falls
back to bare Popen for non-systemd hosts.
2026-04-19 21:00:43 -04:00
14250cacad feat(swarm): self-destruct agent on decommission
Decommissioning a worker from the dashboard (or swarm controller) now
asks the agent to wipe its own install before the master forgets it.
The agent stops decky containers + every decnet-* systemd unit, then
deletes /opt/decnet*, /etc/systemd/system/decnet-*, /var/lib/decnet/*,
and /usr/local/bin/decnet*. Logs under /var/log are preserved.

The reaper runs as a detached /tmp script (start_new_session=True) so
it survives the agent process being killed. Self-destruct dispatch is
best-effort — a dead worker doesn't block master-side cleanup.
2026-04-19 20:47:09 -04:00
df18cb44cc fix(swarm): don't paint healthy deckies as failed when a shard-sibling fails
docker compose up is partial-success-friendly — a build failure on one
service doesn't roll back the others. But the master was catching the
agent's 500 and tagging every decky in the shard as 'failed' with the
same error message. From the UI that looked like all three deckies died
even though two were live on the worker.

On dispatch exception, probe the agent's /status to learn which deckies
actually have running containers, and upsert per-decky state accordingly.
Only fall back to marking the whole shard failed if the status probe
itself is unreachable.

Enhance agent.executor.status() to include a 'runtime' map keyed by
decky name with per-service container state, so the master has something
concrete to consult.
2026-04-19 20:11:08 -04:00
4db9c7464c fix(swarm): relocalize master-built config on worker before deploy
deploy --mode swarm was failing on every heterogeneous fleet: the master
populates config.interface from its own box (detect_interface() → its
default NIC), then ships that verbatim. The worker's deployer then calls
get_host_ip(config.interface), hits 'ip addr show wlp6s0' on a VM whose
NIC is enp0s3, and 500s.

Fix: agent.executor._relocalize() runs on every swarm-mode deploy.
Re-detects the worker's interface/subnet/gateway/host_ip locally and
swaps them into the config before calling deployer.deploy(). When the
worker's subnet doesn't match the master's, decky IPs are re-allocated
from the worker's subnet via allocate_ips() so they're reachable.

Unihost-mode configs are left untouched — they're already built against
the local box and second-guessing them would be wrong.

Validated against anti@192.168.1.13: master dispatched interface=wlp6s0,
agent logged 'relocalized interface=enp0s3', deployer ran successfully,
dry-run returned ok=deployed.

4 new tests cover both branches (matching-subnet preserves decky IPs;
mismatch re-allocates), the end-to-end executor.deploy() path, and the
unihost short-circuit.
2026-04-18 20:41:21 -04:00
cd0057c129 feat(swarm): DeckyConfig.host_uuid + fix agent log/status field refs
- decnet.models.DeckyConfig grows an optional 'host_uuid' (the SwarmHost
  that runs this decky). Defaults to None so legacy unihost state files
  deserialize unchanged.
- decnet.agent.executor: replace non-existent config.name references
  with config.mode / config.interface in logs and status payload.
- tests/swarm/test_state_schema.py covers legacy-dict roundtrip, field
  default, and swarm-mode assignments.
2026-04-18 19:10:25 -04:00
8257bcc031 feat(swarm): worker agent + fix pre-existing base_repo coverage test
Worker agent (decnet.agent):
- mTLS FastAPI service exposing /deploy, /teardown, /status, /health,
  /mutate. uvicorn enforces CERT_REQUIRED with the DECNET CA pinned.
- executor.py offloads the blocking deployer onto asyncio.to_thread so
  the event loop stays responsive.
- server.py refuses to start without an enrolled bundle in
  ~/.decnet/agent/ — unauthenticated agents are not a supported mode.
- docs/openapi disabled on the agent — narrow attack surface.

tests/test_base_repo.py: DummyRepo was missing get_attacker_artifacts
(pre-existing abstractmethod) and so could not be instantiated. Added
the stub + coverage for the new swarm CRUD surface on BaseRepository.
2026-04-18 07:15:53 -04:00