DECNET

Author	SHA1	Message	Date
anti	5c0631e12c	feat(agent,forwarder,updater): publish system.<worker>.health heartbeats (DEBT-031 workers 7-9) All three workers now share a run_health_heartbeat helper in decnet.bus.publish. Each publishes system.<worker>.health on a 30s tick with {worker, ts} plus optional per-worker extras. Subscribers can watch system.*.health to see every DECNET worker on a host at once. - agent: heartbeat runs inside the FastAPI lifespan alongside the existing master-facing heartbeat; bus-disabled path is a no-op. - forwarder: heartbeat task spawned at run_forwarder entry, cancelled in the finally block so a crashed master loop never leaks the task. - updater: new FastAPI lifespan hosts the heartbeat. Heartbeat helper swallows extra() failures and is cancellation-safe so lifespan teardown never hangs on it.	2026-04-21 17:02:10 -04:00
anti	0cdcfe2653	feat(agent/collector): topology-label discovery and master-authoritative supersede Legacy fleet deckies live in decnet-state.json; MazeNET topology containers don't. Tag them at compose-time with decnet.topology.service=true and let the collector match on that label. Spin up the agent's log collector on the first successful /topology/apply (not in the lifespan — that would break the no-docker-on-boot invariant) and tear it down with the app. Land log lines in DECNET_AGENT_LOG_FILE, separate from master-side DECNET_INGEST_LOG_FILE, so a dev box running both roles can't forward its own ingest back to itself. When master pushes a topology that differs from whatever is pinned locally, teardown the predecessor and accept the new one. Refusing with 409 left the agent stranded after partial deploys. record_error now persists the hydrated blob so a later teardown can still walk the LAN list — otherwise a half-failed apply strands containers + bridges with no breadcrumb back to them.	2026-04-21 10:23:10 -04:00
anti	12e18b75db	feat(swarm): expose needs_resync on TopologySummary + upsert record_error Two small observability follow-ups to the phase-1 agent/topology wiring: TopologySummary now carries needs_resync so operators can see the heartbeat's resync flag via the topology list/detail API without dropping into the DB. TopologyStore.record_error becomes an upsert — when a docker/compose failure fires during the first materialise (put() never reached), we still land a marker row so GET /topology/state surfaces the error and the next heartbeat carries an empty applied_version_hash. That empty hash is what master's heartbeat check relies on to flag the topology for resync instead of assuming the apply succeeded.	2026-04-21 01:41:30 -04:00
anti	e8f9c955b3	feat(swarm): heartbeat-driven topology resync for agent-pinned deployments Agent heartbeats now carry an applied-topology snapshot. The master heartbeat handler compares the reported version_hash against what canonical_hash yields for the hydrated topology pinned to that host and flags Topology.needs_resync on divergence (or when the agent reports no topology at all while master expects one). The mutator watch loop gains reconcile_agent_resyncs, which re-pushes the current hydrated blob via AgentClient.apply_topology without touching status, then clears the flag on success. Push failures leave the flag set so the next tick retries.	2026-04-21 01:35:12 -04:00
anti	13cb0ff38e	feat(agent): topology apply/teardown/state endpoints New mTLS-protected routes on the agent: - POST /topology/apply — master pushes {hydrated, version_hash}. Validates the hash matches locally (serialisation drift guard), runs the topology through the same validator/composer pipeline used master-side, then creates bridges + compose up + records the apply in topology.db. - POST /topology/teardown — dismantles compose, removes bridges, clears topology.db. Idempotent. - GET /topology/state — returns applied row + live docker observation for the heartbeat. Implementation lives in decnet/agent/topology_ops.py; it reuses the private compose helpers from decnet.engine.deployer so we don't duplicate compose/project-name plumbing. The apply path is sync under the hood (docker SDK + subprocess); we hop to a thread so the event loop keeps servicing other agent traffic. v1 is one-topology-per-agent; cross-topology apply returns 409. Step 4 of the agent <-> topology integration.	2026-04-21 01:25:15 -04:00
anti	aea3e7e05b	feat(agent): sqlite-backed topology_store as applied-state cache Single-row sqlite tracking which topology the agent last applied and its version hash. Sync/stdlib, same pattern as the log-forwarder offset store. v1 is one-topology-per-agent; attempting to apply a different topology over a populated row raises AlreadyApplied so the endpoint can return 409. observed() snapshots live docker state (decnet-topology-* bridges + decnet-* containers) for the heartbeat. The store is a cache, not authority — no auto-restore on boot. Master remains the only source of truth. Step 3 of the agent <-> topology integration.	2026-04-21 01:22:01 -04:00
anti	12b5c25cd7	fix(agent-routes): added undocumented responses	2026-04-20 01:24:05 -04:00
anti	bf01804736	feat(agent): periodic heartbeat loop posting status to swarmctl New decnet.agent.heartbeat asyncio loop wired into the agent FastAPI lifespan. Every 30 s the worker POSTs executor.status() to the master's /swarm/heartbeat with its DECNET_HOST_UUID for self-identity; the existing agent mTLS bundle provides the client cert the master pins against SwarmHost.client_cert_fingerprint. start() is a silent no-op when identity env (HOST_UUID, MASTER_HOST) is unset or the worker bundle is missing, so dev runs and un-enrolled hosts don't crash the agent app. On non-204 responses the loop logs loudly but keeps ticking — an operator may re-enrol mid-session, and fail-closed pinning shouldn't be self-silencing.	2026-04-19 21:49:34 -04:00
anti	f576564f02	fix(agent): also wipe /etc/decnet during self-destruct	2026-04-19 21:04:31 -04:00
anti	00d5799a79	fix(agent): escape systemd cgroup when spawning self-destruct reaper The reaper was being SIGTERM'd mid-rm because `start_new_session=True` only forks a new POSIX session — it does not escape decnet-agent.service's cgroup. When the reaper ran `systemctl stop decnet-agent`, systemd tore down the whole cgroup (reaper included) before `rm -rf /opt/decnet*` finished, leaving the install on disk. Spawn the reaper via `systemd-run --collect --unit decnet-reaper-<pid>` so it runs in a fresh transient scope, outside the agent unit. Falls back to bare Popen for non-systemd hosts.	2026-04-19 21:00:43 -04:00
anti	14250cacad	feat(swarm): self-destruct agent on decommission Decommissioning a worker from the dashboard (or swarm controller) now asks the agent to wipe its own install before the master forgets it. The agent stops decky containers + every decnet-* systemd unit, then deletes /opt/decnet, /etc/systemd/system/decnet-, /var/lib/decnet/, and /usr/local/bin/decnet. Logs under /var/log are preserved. The reaper runs as a detached /tmp script (start_new_session=True) so it survives the agent process being killed. Self-destruct dispatch is best-effort — a dead worker doesn't block master-side cleanup.	2026-04-19 20:47:09 -04:00
anti	df18cb44cc	fix(swarm): don't paint healthy deckies as failed when a shard-sibling fails docker compose up is partial-success-friendly — a build failure on one service doesn't roll back the others. But the master was catching the agent's 500 and tagging every decky in the shard as 'failed' with the same error message. From the UI that looked like all three deckies died even though two were live on the worker. On dispatch exception, probe the agent's /status to learn which deckies actually have running containers, and upsert per-decky state accordingly. Only fall back to marking the whole shard failed if the status probe itself is unreachable. Enhance agent.executor.status() to include a 'runtime' map keyed by decky name with per-service container state, so the master has something concrete to consult.	2026-04-19 20:11:08 -04:00
anti	65fc9ac2b9	fix(tests): clean up two pre-existing failures before config work - decnet/agent/app.py /health: drop leftover 'push-test-2' canary planted during live VM push verification and never cleaned up; test_health_endpoint asserts the exact dict shape. - tests/test_factory.py: switch the lazy-engine check from mysql+aiomysql (not in pyproject) to mysql+asyncmy (the driver the project actually ships). The test does not hit the wire so the dialect swap is safe. Both were red on `pytest tests/` before any config/auto-spawn work began; fixing them here so the upcoming commits land on a green full-suite baseline.	2026-04-19 03:17:17 -04:00
anti	ebeaf08a49	fix(updater): fall back to /proc scan when agent.pid is missing If the agent was started outside the updater (manually, during dev, or from a prior systemd unit), there is no agent.pid for _stop_agent to target, so a successful code install leaves the old in-memory agent process still serving requests. Scan /proc for any decnet agent command and SIGTERM all matches so restart is reliable regardless of how the agent was originally launched.	2026-04-18 23:42:26 -04:00
anti	4db9c7464c	fix(swarm): relocalize master-built config on worker before deploy deploy --mode swarm was failing on every heterogeneous fleet: the master populates config.interface from its own box (detect_interface() → its default NIC), then ships that verbatim. The worker's deployer then calls get_host_ip(config.interface), hits 'ip addr show wlp6s0' on a VM whose NIC is enp0s3, and 500s. Fix: agent.executor._relocalize() runs on every swarm-mode deploy. Re-detects the worker's interface/subnet/gateway/host_ip locally and swaps them into the config before calling deployer.deploy(). When the worker's subnet doesn't match the master's, decky IPs are re-allocated from the worker's subnet via allocate_ips() so they're reachable. Unihost-mode configs are left untouched — they're already built against the local box and second-guessing them would be wrong. Validated against anti@192.168.1.13: master dispatched interface=wlp6s0, agent logged 'relocalized interface=enp0s3', deployer ran successfully, dry-run returned ok=deployed. 4 new tests cover both branches (matching-subnet preserves decky IPs; mismatch re-allocates), the end-to-end executor.deploy() path, and the unihost short-circuit.	2026-04-18 20:41:21 -04:00
anti	cd0057c129	feat(swarm): DeckyConfig.host_uuid + fix agent log/status field refs - decnet.models.DeckyConfig grows an optional 'host_uuid' (the SwarmHost that runs this decky). Defaults to None so legacy unihost state files deserialize unchanged. - decnet.agent.executor: replace non-existent config.name references with config.mode / config.interface in logs and status payload. - tests/swarm/test_state_schema.py covers legacy-dict roundtrip, field default, and swarm-mode assignments.	2026-04-18 19:10:25 -04:00
anti	8257bcc031	feat(swarm): worker agent + fix pre-existing base_repo coverage test Worker agent (decnet.agent): - mTLS FastAPI service exposing /deploy, /teardown, /status, /health, /mutate. uvicorn enforces CERT_REQUIRED with the DECNET CA pinned. - executor.py offloads the blocking deployer onto asyncio.to_thread so the event loop stays responsive. - server.py refuses to start without an enrolled bundle in ~/.decnet/agent/ — unauthenticated agents are not a supported mode. - docs/openapi disabled on the agent — narrow attack surface. tests/test_base_repo.py: DummyRepo was missing get_attacker_artifacts (pre-existing abstractmethod) and so could not be instantiated. Added the stub + coverage for the new swarm CRUD surface on BaseRepository.	2026-04-18 07:15:53 -04:00

17 Commits