Legacy fleet deckies live in decnet-state.json; MazeNET topology
containers don't. Tag them at compose-time with
decnet.topology.service=true and let the collector match on that label.
Spin up the agent's log collector on the first successful /topology/apply
(not in the lifespan — that would break the no-docker-on-boot invariant)
and tear it down with the app. Land log lines in DECNET_AGENT_LOG_FILE,
separate from master-side DECNET_INGEST_LOG_FILE, so a dev box running
both roles can't forward its own ingest back to itself.
When master pushes a topology that differs from whatever is pinned
locally, teardown the predecessor and accept the new one. Refusing with
409 left the agent stranded after partial deploys. record_error now
persists the hydrated blob so a later teardown can still walk the LAN
list — otherwise a half-failed apply strands containers + bridges with
no breadcrumb back to them.
Two small observability follow-ups to the phase-1 agent/topology wiring:
TopologySummary now carries needs_resync so operators can see the
heartbeat's resync flag via the topology list/detail API without
dropping into the DB.
TopologyStore.record_error becomes an upsert — when a docker/compose
failure fires during the first materialise (put() never reached), we
still land a marker row so GET /topology/state surfaces the error and
the next heartbeat carries an empty applied_version_hash. That empty
hash is what master's heartbeat check relies on to flag the topology
for resync instead of assuming the apply succeeded.
Agent heartbeats now carry an applied-topology snapshot. The master
heartbeat handler compares the reported version_hash against what
canonical_hash yields for the hydrated topology pinned to that host
and flags Topology.needs_resync on divergence (or when the agent
reports no topology at all while master expects one).
The mutator watch loop gains reconcile_agent_resyncs, which re-pushes
the current hydrated blob via AgentClient.apply_topology without
touching status, then clears the flag on success. Push failures leave
the flag set so the next tick retries.
New mTLS-protected routes on the agent:
- POST /topology/apply — master pushes {hydrated, version_hash}.
Validates the hash matches locally (serialisation drift guard),
runs the topology through the same validator/composer pipeline
used master-side, then creates bridges + compose up + records the
apply in topology.db.
- POST /topology/teardown — dismantles compose, removes bridges,
clears topology.db. Idempotent.
- GET /topology/state — returns applied row + live docker
observation for the heartbeat.
Implementation lives in decnet/agent/topology_ops.py; it reuses the
private compose helpers from decnet.engine.deployer so we don't
duplicate compose/project-name plumbing. The apply path is sync
under the hood (docker SDK + subprocess); we hop to a thread so the
event loop keeps servicing other agent traffic.
v1 is one-topology-per-agent; cross-topology apply returns 409.
Step 4 of the agent <-> topology integration.
Single-row sqlite tracking which topology the agent last applied and
its version hash. Sync/stdlib, same pattern as the log-forwarder
offset store. v1 is one-topology-per-agent; attempting to apply a
different topology over a populated row raises AlreadyApplied so the
endpoint can return 409. observed() snapshots live docker state
(decnet-topology-* bridges + decnet-* containers) for the heartbeat.
The store is a cache, not authority — no auto-restore on boot.
Master remains the only source of truth.
Step 3 of the agent <-> topology integration.
New decnet.agent.heartbeat asyncio loop wired into the agent FastAPI
lifespan. Every 30 s the worker POSTs executor.status() to the master's
/swarm/heartbeat with its DECNET_HOST_UUID for self-identity; the
existing agent mTLS bundle provides the client cert the master pins
against SwarmHost.client_cert_fingerprint.
start() is a silent no-op when identity env (HOST_UUID, MASTER_HOST) is
unset or the worker bundle is missing, so dev runs and un-enrolled hosts
don't crash the agent app. On non-204 responses the loop logs loudly but
keeps ticking — an operator may re-enrol mid-session, and fail-closed
pinning shouldn't be self-silencing.
The reaper was being SIGTERM'd mid-rm because `start_new_session=True`
only forks a new POSIX session — it does not escape decnet-agent.service's
cgroup. When the reaper ran `systemctl stop decnet-agent`, systemd
tore down the whole cgroup (reaper included) before `rm -rf /opt/decnet*`
finished, leaving the install on disk.
Spawn the reaper via `systemd-run --collect --unit decnet-reaper-<pid>`
so it runs in a fresh transient scope, outside the agent unit. Falls
back to bare Popen for non-systemd hosts.
Decommissioning a worker from the dashboard (or swarm controller) now
asks the agent to wipe its own install before the master forgets it.
The agent stops decky containers + every decnet-* systemd unit, then
deletes /opt/decnet*, /etc/systemd/system/decnet-*, /var/lib/decnet/*,
and /usr/local/bin/decnet*. Logs under /var/log are preserved.
The reaper runs as a detached /tmp script (start_new_session=True) so
it survives the agent process being killed. Self-destruct dispatch is
best-effort — a dead worker doesn't block master-side cleanup.
docker compose up is partial-success-friendly — a build failure on one
service doesn't roll back the others. But the master was catching the
agent's 500 and tagging every decky in the shard as 'failed' with the
same error message. From the UI that looked like all three deckies died
even though two were live on the worker.
On dispatch exception, probe the agent's /status to learn which deckies
actually have running containers, and upsert per-decky state accordingly.
Only fall back to marking the whole shard failed if the status probe
itself is unreachable.
Enhance agent.executor.status() to include a 'runtime' map keyed by
decky name with per-service container state, so the master has something
concrete to consult.
- decnet/agent/app.py /health: drop leftover 'push-test-2' canary
planted during live VM push verification and never cleaned up;
test_health_endpoint asserts the exact dict shape.
- tests/test_factory.py: switch the lazy-engine check from
mysql+aiomysql (not in pyproject) to mysql+asyncmy (the driver the
project actually ships). The test does not hit the wire so the
dialect swap is safe.
Both were red on `pytest tests/` before any config/auto-spawn work
began; fixing them here so the upcoming commits land on a green
full-suite baseline.
If the agent was started outside the updater (manually, during dev,
or from a prior systemd unit), there is no agent.pid for _stop_agent
to target, so a successful code install leaves the old in-memory
agent process still serving requests. Scan /proc for any decnet agent
command and SIGTERM all matches so restart is reliable regardless of
how the agent was originally launched.
deploy --mode swarm was failing on every heterogeneous fleet: the master
populates config.interface from its own box (detect_interface() → its
default NIC), then ships that verbatim. The worker's deployer then calls
get_host_ip(config.interface), hits 'ip addr show wlp6s0' on a VM whose
NIC is enp0s3, and 500s.
Fix: agent.executor._relocalize() runs on every swarm-mode deploy.
Re-detects the worker's interface/subnet/gateway/host_ip locally and
swaps them into the config before calling deployer.deploy(). When the
worker's subnet doesn't match the master's, decky IPs are re-allocated
from the worker's subnet via allocate_ips() so they're reachable.
Unihost-mode configs are left untouched — they're already built against
the local box and second-guessing them would be wrong.
Validated against anti@192.168.1.13: master dispatched interface=wlp6s0,
agent logged 'relocalized interface=enp0s3', deployer ran successfully,
dry-run returned ok=deployed.
4 new tests cover both branches (matching-subnet preserves decky IPs;
mismatch re-allocates), the end-to-end executor.deploy() path, and the
unihost short-circuit.
- decnet.models.DeckyConfig grows an optional 'host_uuid' (the SwarmHost
that runs this decky). Defaults to None so legacy unihost state files
deserialize unchanged.
- decnet.agent.executor: replace non-existent config.name references
with config.mode / config.interface in logs and status payload.
- tests/swarm/test_state_schema.py covers legacy-dict roundtrip, field
default, and swarm-mode assignments.
Worker agent (decnet.agent):
- mTLS FastAPI service exposing /deploy, /teardown, /status, /health,
/mutate. uvicorn enforces CERT_REQUIRED with the DECNET CA pinned.
- executor.py offloads the blocking deployer onto asyncio.to_thread so
the event loop stays responsive.
- server.py refuses to start without an enrolled bundle in
~/.decnet/agent/ — unauthenticated agents are not a supported mode.
- docs/openapi disabled on the agent — narrow attack surface.
tests/test_base_repo.py: DummyRepo was missing get_attacker_artifacts
(pre-existing abstractmethod) and so could not be instantiated. Added
the stub + coverage for the new swarm CRUD surface on BaseRepository.