DECNET

Author	SHA1	Message	Date
anti	f3eaab5d37	refactor(bus): extract publish_safely + extend topics for DEBT-031 Shared publish_safely helper at decnet/bus/publish.py so the nine workers about to be wired into the bus don't each copy-paste the "never raise back at the caller" contract. Mutator drops its private copy and imports the canonical one. topics.py gains the attacker.* hierarchy (observed, scored, session.started, session.ended) and a system_health(worker) builder for per-worker health heartbeats — both prerequisites for the worker rollout under DEBT-031.	2026-04-21 16:32:30 -04:00
anti	1968f6e741	test(mutator,web): cover bus publishes, bus-wake, and SSE events route - tests/topology/test_mutator.py: reconcile_topologies publishes applying+applied on success, applying+failed+status on failure; and stays safe when bus=None. _wake_on_enqueue sets its asyncio.Event on every matching enqueue event. - tests/api/topology/test_mutations.py: POST /mutations publishes mutation.enqueued after a successful DB write, via a FakeBus injected in place of the app-wide bus singleton. - tests/api/topology/test_events_stream.py: SSE route returns 401 unauthenticated, 404 for unknown topologies, and (driving the async generator directly) emits a snapshot on connect plus forwards a published mutation.applied as an `event: mutation.applied` SSE frame.	2026-04-21 14:39:12 -04:00
anti	fbf289ff63	feat(bus): host-local UNIX-socket pub/sub worker (DEBT-029) Land the `decnet bus` worker and `get_bus()` factory. Transport is a host-local UNIX-domain socket (0660, group=decnet); authz is the file mode. Wire framing is a tiny verb-line + 4-byte-BE length + orjson body. NATS-style wildcard topics (`*`, `>`). At-most-once, fire-and-forget — DB stays the source of truth. `FakeBus` / `NullBus` for tests and the disabled path. Cross-host federation is deferred to a future `--bridge-tcp` mode; DEBT-030 is master-only and unblocked.	2026-04-21 13:49:02 -04:00
anti	d9f3824086	test(topology): cover compose labels and tolerate docker filter kwarg test_compose asserts the new decnet.topology.* labels land on both base deckies (role=base, no service marker) and service fragments (service=true). The stub docker client in test_deploy grew a filters kwarg so it keeps matching the real .networks.list(filters=...) call signature now used by the deployer.	2026-04-21 10:24:15 -04:00
anti	0cdcfe2653	feat(agent/collector): topology-label discovery and master-authoritative supersede Legacy fleet deckies live in decnet-state.json; MazeNET topology containers don't. Tag them at compose-time with decnet.topology.service=true and let the collector match on that label. Spin up the agent's log collector on the first successful /topology/apply (not in the lifespan — that would break the no-docker-on-boot invariant) and tear it down with the app. Land log lines in DECNET_AGENT_LOG_FILE, separate from master-side DECNET_INGEST_LOG_FILE, so a dev box running both roles can't forward its own ingest back to itself. When master pushes a topology that differs from whatever is pinned locally, teardown the predecessor and accept the new one. Refusing with 409 left the agent stranded after partial deploys. record_error now persists the hydrated blob so a later teardown can still walk the LAN list — otherwise a half-failed apply strands containers + bridges with no breadcrumb back to them.	2026-04-21 10:23:10 -04:00
anti	12e18b75db	feat(swarm): expose needs_resync on TopologySummary + upsert record_error Two small observability follow-ups to the phase-1 agent/topology wiring: TopologySummary now carries needs_resync so operators can see the heartbeat's resync flag via the topology list/detail API without dropping into the DB. TopologyStore.record_error becomes an upsert — when a docker/compose failure fires during the first materialise (put() never reached), we still land a marker row so GET /topology/state surfaces the error and the next heartbeat carries an empty applied_version_hash. That empty hash is what master's heartbeat check relies on to flag the topology for resync instead of assuming the apply succeeded.	2026-04-21 01:41:30 -04:00
anti	0a14dbc9f4	test(agent): pin no-auto-restore-on-boot invariant for topology cache Four regression tests guarding Step 8 of the agent/topology wiring: - Lifespan startup must not call docker.from_env even with a populated topology.db — replace docker with a boom-stub and assert zero calls. - GET /topology/state returns the cached row verbatim without re-materialising bridges/containers; live observation is read-only. - Static guard: TopologyStore must not grow a restore/replay/reapply method without someone re-reading the module docstring. - Raw sqlite read + a second TopologyStore instance confirm the store is passive — nothing scrubs stale rows on open, which is the behaviour master's resync flow depends on.	2026-04-21 01:37:05 -04:00
anti	e8f9c955b3	feat(swarm): heartbeat-driven topology resync for agent-pinned deployments Agent heartbeats now carry an applied-topology snapshot. The master heartbeat handler compares the reported version_hash against what canonical_hash yields for the hydrated topology pinned to that host and flags Topology.needs_resync on divergence (or when the agent reports no topology at all while master expects one). The mutator watch loop gains reconcile_agent_resyncs, which re-pushes the current hydrated blob via AgentClient.apply_topology without touching status, then clears the flag on success. Push failures leave the flag set so the next tick retries.	2026-04-21 01:35:12 -04:00
anti	05d1ebbaaa	feat(engine): route agent-pinned topologies via AgentClient deploy_topology and teardown_topology now branch on target_host_uuid. When set: - Hydrate the topology locally (validator runs exactly as before). - Compute canonical_hash; push {hydrated, version_hash} to the pinned agent through AgentClient.apply_topology. - Status machine still moves PENDING -> DEPLOYING -> ACTIVE on 2xx, PENDING -> DEPLOYING -> FAILED on error; master remains the sole owner of the row. Teardown flips to TEARING_DOWN, fires /topology/teardown, then TORN_DOWN — we log a warning on agent error but still settle to TORN_DOWN so operators can delete the row (agent garbage is cleaned on the next re-enroll). Unihost deploys are unchanged — the field defaults to NULL so every existing flow takes the local path. Step 6 of the agent <-> topology integration.	2026-04-21 01:27:59 -04:00
anti	5f8a746d6e	feat(swarm): AgentClient topology apply/teardown/state methods Three new RPCs mirroring the existing deploy/teardown/status pattern: - apply_topology(hydrated, version_hash) — long-timeout (600s) for image pulls + compose up. - teardown_topology(topology_id) — 300s timeout; enough for a stubborn compose-down without hanging a heartbeat. - get_topology_state() — short control-plane read for reconcile. The per-call timeout swap uses the same trick as .deploy(). Step 5 of the agent <-> topology integration.	2026-04-21 01:26:21 -04:00
anti	13cb0ff38e	feat(agent): topology apply/teardown/state endpoints New mTLS-protected routes on the agent: - POST /topology/apply — master pushes {hydrated, version_hash}. Validates the hash matches locally (serialisation drift guard), runs the topology through the same validator/composer pipeline used master-side, then creates bridges + compose up + records the apply in topology.db. - POST /topology/teardown — dismantles compose, removes bridges, clears topology.db. Idempotent. - GET /topology/state — returns applied row + live docker observation for the heartbeat. Implementation lives in decnet/agent/topology_ops.py; it reuses the private compose helpers from decnet.engine.deployer so we don't duplicate compose/project-name plumbing. The apply path is sync under the hood (docker SDK + subprocess); we hop to a thread so the event loop keeps servicing other agent traffic. v1 is one-topology-per-agent; cross-topology apply returns 409. Step 4 of the agent <-> topology integration.	2026-04-21 01:25:15 -04:00
anti	aea3e7e05b	feat(agent): sqlite-backed topology_store as applied-state cache Single-row sqlite tracking which topology the agent last applied and its version hash. Sync/stdlib, same pattern as the log-forwarder offset store. v1 is one-topology-per-agent; attempting to apply a different topology over a populated row raises AlreadyApplied so the endpoint can return 409. observed() snapshots live docker state (decnet-topology-* bridges + decnet-* containers) for the heartbeat. The store is a cache, not authority — no auto-restore on boot. Master remains the only source of truth. Step 3 of the agent <-> topology integration.	2026-04-21 01:22:01 -04:00
anti	98465af226	feat(topology): canonical_hash for applied-state comparison Tiny pure helper both master and agent will use to answer "is the applied state the one we expect?". SHA-256 of canonical JSON with volatile keys (timestamps, status, version, canvas x/y/w/h) stripped so the hash only captures deployment-relevant state. Step 2 of the agent <-> topology integration.	2026-04-21 01:20:42 -04:00
anti	5a0cf5d7c8	feat(topology): add target_host_uuid to pin topologies to swarm agents Adds the `target_host_uuid` FK on `Topology` plus wiring through the two create endpoints (`POST /topologies`, `POST /topologies/blank`). Validates the mode/host pair: `mode='agent'` now requires a known, routable host; `mode='unihost'` must leave the field unset. Surfaced on `TopologySummary` so list/detail responses expose it. Purely additive at the schema level — existing unihost flows unchanged (field defaults to `NULL`). Step 1 of the agent <-> topology integration.	2026-04-21 01:19:45 -04:00
anti	d06b04221f	feat(api/topology): live mutation queue endpoints (POST/GET /mutations)	2026-04-20 19:38:55 -04:00
anti	ff0b2efbb0	feat(api/topology): pending-only child CRUD for LANs, deckies, edges	2026-04-20 19:37:16 -04:00
anti	999113e3c3	feat(api/topology): POST/DELETE/deploy endpoints for MazeNET topologies	2026-04-20 19:34:35 -04:00
anti	f182c98ffa	feat(api): phase 3 step 2 — topology read endpoints (list/get/status/catalog) GET /api/v1/topologies — paginated list with status filter. Extends repo.list_topologies() to accept limit/offset and adds count_topologies() for the total envelope field. GET /api/v1/topologies/{id} — hydrated TopologyDetail; 404 if missing. GET /api/v1/topologies/{id}/status-events — audit trail, limit-capped. Catalog helpers for the phase-4 canvas UI: * GET /topologies/services — full service catalog. * GET /topologies/next-subnet?base=172.20 — wraps SubnetAllocator against reserved_subnets across non-torn-down topologies. * GET /topologies/{id}/lans/{lan_id}/next-ip — IPAllocator pre-seeded with existing decky IPs in that LAN. All read routes are viewer-or-admin. Sub-routers are included in an order that keeps literal catalog paths (/services, /next-subnet) from being shadowed by the /{topology_id} trie branch.	2026-04-20 18:25:33 -04:00
anti	2379b2aeda	feat(api): phase 3 step 1 — topology request/response models + router skeleton Add Pydantic DTOs in decnet/web/db/models.py covering every phase-3 endpoint shape: TopologyGenerateRequest, TopologySummary/Detail, child create/update requests, MutationEnqueueRequest (Literal op guard), MutationRow with JSON-payload decoder, validation/version/not-editable error envelopes, and the three catalog responses. Create decnet/web/router/topology/ as an import-safe package exporting topology_router (prefix /topologies) — sub-routers land step-by-step in subsequent commits. Mount under the main api router alongside swarm_mgmt. tests/api/topology/test_models.py pins repo-dict ↔ DTO parity so future repo-row drift breaks the contract test before the endpoints.	2026-04-20 18:16:30 -04:00
anti	a76b9ecdf9	feat(mazenet): step 7 — topology_mutations queue + mutator reconciler Adds the live-mutation pipeline for active/degraded topologies: * TopologyMutation table with composite index (state, topology_id) so the watch-loop guard query stays O(log n). * claim_next_mutation is a single atomic UPDATE ... WHERE state='pending' so racing reconcilers deterministically pick one winner; losers see rowcount=0 and skip. * reconcile_topologies drains pending rows per live topology, applies via decnet.mutator.ops.dispatch, and on failure marks the mutation failed + transitions topology to degraded. * run_watch_loop gains a gated branch: flat-fleet mutate_all runs every tick unchanged; the reconciler only enters when the cheap has_pending_topology_mutation guard returns True. * apply_* ops re-check hard invariants (names, IP collisions, subnet overlap, known services, service_config shape) after every mutation so the repo never lands in an invalid state. * CLI: 'decnet topology mutate' / 'mutations' subcommands.	2026-04-20 18:02:37 -04:00
anti	91df57d36b	feat(topology): pending-only mutation repo methods with cascade + guards MazeNET phase 2 step 6. Equips the repo layer with the CRUD the web editor needs before deploy. - TopologyNotEditable exception: raised when a pending-only method hits a non-pending topology. The intent is "free-form edits stop at deploy; the mutator (step 7) takes over for live topologies." - _assert_pending helper checks status inside the session. - update_lan / update_topology_decky accept enforce_pending=True for pre-deploy callers (existing internal callers default to False so behavior is unchanged). - delete_lan: cascades edges; refuses if any decky has only one edge (= this LAN is its home) to prevent orphans. - delete_topology_decky: cascades edges. - delete_topology_edge: bare-bones removal. All four mutators accept expected_version for optimistic concurrency. Existing tests continue to pass (no behavior change for persist/deploy).	2026-04-20 17:50:29 -04:00
anti	9afaac7612	feat(topology): nullable layout coords on LAN + TopologyDecky MazeNET phase 2 step 5. Pure storage — the generator emits None for x/y and the web canvas fills them in later. No logic changes; no compose, deploy, or validator impact.	2026-04-20 17:48:29 -04:00
anti	e475c0957e	feat(topology): optimistic concurrency via Topology.version + expected_version MazeNET phase 2 step 4. Readies the repo layer for concurrent editors (web canvas + CLI + mutator) without lost-write races. - Topology.version: monotonically bumped on supervised child-row writes. - VersionConflict exception carries {current, expected} for the UI. - _check_and_bump_version helper reads Topology in the same session, compares against expected_version, raises on mismatch, bumps on match. Commit happens in the caller's existing transaction so check+bump+write are atomic per mutation. - add_lan / update_lan / add_topology_decky / update_topology_decky / add_topology_edge accept expected_version=None by default, preserving every existing caller's behavior. When expected_version is None, no check runs and version stays put — internal callers (persist) that don't care about concurrency keep working unchanged.	2026-04-20 17:47:28 -04:00
anti	2544d0294a	feat(topology): add pre-deploy validator and wire into deploy_topology MazeNET phase 2 step 3. Blocks deploys of hand-authored topologies that would fail mid-bring-up (orphan deckies, duplicate IPs, overlapping subnets, unknown services) with a structured error list instead of a docker error at startup. Rules (one function each, composable by the editor for inline hints): - exactly one DMZ - every LAN has a bridge chain to the DMZ (BFS via multi-homed deckies) - no orphan deckies - unique LAN and decky names per topology - no IP collisions + IPs inside their LAN's subnet - no LAN subnet overlaps - every service in decnet.fleet.all_service_names() - service_config keys match the decky's declared services deploy_topology runs the validator after hydrate, before any status transition or Docker call; errors raise ValidationError and status stays at pending.	2026-04-20 17:45:32 -04:00
anti	d4f4c58277	feat(topology): thread per-service config overrides through compose MazeNET phase 2 step 2. Mirrors the flat-fleet service_config pattern (DeckyConfig.service_config → composer → svc.compose_fragment) into the topology compose pipeline, so a hand-authored decky can carry overrides like {"ssh": {"password": "megapassword"}} and the ssh fragment reads them just like the flat path does. - _PlannedDecky gains service_config: dict[str, dict]. - persist() stores it under decky_config["service_config"]. - topology/compose.py passes cfg.get("service_config", {}).get(svc, {}) to svc.compose_fragment(service_cfg=...). Schema unchanged — service_config lives inside the existing decky_config JSON blob. Zero changes in decnet/services/*.	2026-04-20 17:42:37 -04:00
anti	1bd1846e40	feat(topology): extract IP + subnet allocators as reusable services MazeNET phase 2 step 1. Pulls inline IP/subnet allocation out of the generator into decnet/topology/allocator.py so the editor + reconciler can reuse the same primitives without duplicating logic. - IPAllocator: stateful host-IP handout with reserve/release/is_free. - SubnetAllocator: /24 handout under a base prefix, skips reservations. - reserved_subnets(repo): collects claimed subnets across every non-torn_down topology so concurrent drafts cannot collide. - generate() accepts reserved_subnets= to skip existing claims. Generator output is byte-identical under seed (behavior preserved).	2026-04-20 17:41:17 -04:00
anti	80e3c28234	test(topology): deploy dry-run + failure-path + live docker e2e Covers dry-run compose emission (no status change), FAILED transition with reason logged on daemon errors, teardown from FAILED, and a live-marked end-to-end test that creates/removes bridge networks against a real docker daemon (skipped on CI).	2026-04-20 16:57:43 -04:00
anti	2a030bf3a9	feat(topology): add compose generator and deployer integration Adds per-topology compose generation (one Docker bridge network per LAN, multi-homed bridge deckies, ip_forward sysctl for L3 forwarders) plus async deploy_topology/teardown_topology in the engine. Leaf-first teardown via BFS-named LAN reverse sort; partial-state safe on failure.	2026-04-20 16:54:40 -04:00
anti	33f139ecfa	feat(mazenet): topology package — config, status machine, generator, persistence Adds decnet/topology/ with: - config.TopologyConfig: pydantic model driving generation (depth, branching_factor, deckies_per_lan_min/max, bridge_forward_probability, cross_edge_probability, subnet_base_prefix, service selection, seed). Emits GeneratedTopology dataclass (lans, deckies, edges). - status.TopologyStatus + assert_transition: seven-state machine with an explicit legal-transition table. torn_down is terminal; degraded is schema-reserved for future Healer use. - generator.generate: deterministic DAG generation under config.seed. Builds a tree of LANs (DMZ at root), plants deckies in each LAN, promotes one decky per non-DMZ LAN to a parent bridge, and rolls cross-edges per cross_edge_probability for DAG shape. - persistence: persist() writes a plan to the repo as pending; transition_status() enforces state-machine legality; hydrate() loads topology + children into a single dict. Covered by tests/topology/{test_status,test_generator,test_persistence}.	2026-04-20 16:48:20 -04:00
anti	47cd200e1d	feat(mazenet): repo methods for topology/LAN/decky/edge/status events Adds topology CRUD to BaseRepository (NotImplementedError defaults) and implements them in SQLModelRepository: create/get/list/delete topologies, add/update/list LANs and TopologyDeckies, add/list edges, plus an atomic update_topology_status that appends a TopologyStatusEvent in the same transaction. Cascade delete sweeps children before the topology row. Covered by tests/topology/test_repo.py (roundtrip, per-topology name uniqueness, status event log, cascade delete, status filter) and an extension to tests/test_base_repo.py for the NotImplementedError surface.	2026-04-20 16:43:49 -04:00
anti	4197441c01	fix(ci): skip live service isolation Some checks failed CI / Lint (ruff) (push) Successful in 12s Details CI / SAST (bandit) (push) Successful in 15s Details CI / Dependency audit (pip-audit) (push) Successful in 22s Details CI / Test (Standard) (3.11) (push) Successful in 2m47s Details CI / Test (Live) (3.11) (push) Successful in 1m7s Details CI / Test (Fuzz) (3.11) (push) Failing after 45m40s Details CI / Merge dev → testing (push) Has been skipped Details CI / Prepare Merge to Main (push) Has been skipped Details CI / Finalize Merge to Main (push) Has been skipped Details	2026-04-20 13:14:48 -04:00
anti	1b70d6db87	fix(ci): added skipif on mysql absence Some checks failed CI / Lint (ruff) (push) Successful in 12s Details CI / SAST (bandit) (push) Successful in 15s Details CI / Dependency audit (pip-audit) (push) Successful in 24s Details CI / Test (Standard) (3.11) (push) Successful in 2m51s Details CI / Test (Live) (3.11) (push) Failing after 1m2s Details CI / Test (Fuzz) (3.11) (push) Has been skipped Details CI / Merge dev → testing (push) Has been skipped Details CI / Prepare Merge to Main (push) Has been skipped Details CI / Finalize Merge to Main (push) Has been skipped Details	2026-04-20 13:07:31 -04:00
anti	f064690452	fixed(tests): jwt_lazy Some checks failed CI / Lint (ruff) (push) Successful in 12s Details CI / SAST (bandit) (push) Successful in 15s Details CI / Dependency audit (pip-audit) (push) Successful in 23s Details CI / Test (Standard) (3.11) (push) Successful in 5m6s Details CI / Test (Standard) (3.12) (push) Failing after 3h14m38s Details CI / Test (Live) (3.11) (push) Has been cancelled Details CI / Test (Fuzz) (3.11) (push) Has been cancelled Details CI / Merge dev → testing (push) Has been cancelled Details CI / Prepare Merge to Main (push) Has been cancelled Details CI / Finalize Merge to Main (push) Has been cancelled Details	2026-04-20 02:26:54 -04:00
anti	dd82cd3f39	fixed(tests): mode_gating Some checks failed CI / Lint (ruff) (push) Successful in 12s Details CI / SAST (bandit) (push) Successful in 15s Details CI / Dependency audit (pip-audit) (push) Successful in 23s Details CI / Test (Standard) (3.11) (push) Failing after 5m0s Details CI / Test (Live) (3.11) (push) Has been cancelled Details CI / Test (Fuzz) (3.11) (push) Has been cancelled Details CI / Merge dev → testing (push) Has been cancelled Details CI / Prepare Merge to Main (push) Has been cancelled Details CI / Finalize Merge to Main (push) Has been cancelled Details CI / Test (Standard) (3.12) (push) Has been cancelled Details	2026-04-20 02:18:11 -04:00
anti	47f2ca8d5f	added(tests): schemathesis contract fuzzing at the agent and swarmctl level Some checks failed CI / Lint (ruff) (push) Successful in 17s Details CI / SAST (bandit) (push) Failing after 19s Details CI / Dependency audit (pip-audit) (push) Failing after 38s Details CI / Test (Standard) (3.11) (push) Has been skipped Details CI / Test (Standard) (3.12) (push) Has been skipped Details CI / Test (Live) (3.11) (push) Has been skipped Details CI / Test (Fuzz) (3.11) (push) Has been skipped Details CI / Merge dev → testing (push) Has been skipped Details CI / Prepare Merge to Main (push) Has been skipped Details CI / Finalize Merge to Main (push) Has been skipped Details	2026-04-20 01:27:39 -04:00
anti	da3e675f86	fix(tests): fixed locust fixtures and rampups, since >100 generally isn't very well managed	2026-04-20 01:26:56 -04:00
anti	4abfac1a98	fix: monkeypatch test db URL	2026-04-20 00:03:12 -04:00
anti	9eca33938d	chore: deleted swp file	2026-04-19 23:51:59 -04:00
anti	195580c74d	test: fix templates paths, CLI gating, and stress-suite harness - tests/*: update templates/ → decnet/templates/ paths after module move - tests/mysql_spinup.sh: use root:root and asyncmy driver - tests/test_auto_spawn.py: patch decnet.cli.utils._pid_dir (package split) - tests/test_cli.py: set DECNET_MODE=master in api-command tests - tests/stress/conftest.py: run locust out-of-process via its CLI + CSV stats shim to avoid urllib3 RecursionError from late gevent monkey-patch; raise uvicorn startup timeout to 60s, accept 401 from auth-gated health, strip inherited DECNET_ env, surface stderr on 0-request runs - tests/stress/test_stress.py: loosen baseline thresholds to match hw	2026-04-19 23:50:53 -04:00
anti	262a84ca53	refactor(cli): split decnet/cli.py monolith into decnet/cli/ package The 1,878-line cli.py held every Typer command plus process/HTTP helpers and mode-gating logic. Split into one module per command using a register(app) pattern so submodules never import app at module scope, eliminating circular-import risk. - utils.py: process helpers, _http_request, _kill_all_services, console, log - gating.py: MASTER_ONLY_* sets, _require_master_mode, _gate_commands_by_mode - deploy.py: deploy + _deploy_swarm (tightly coupled) - lifecycle.py: status, teardown, redeploy - workers.py: probe, collect, mutate, correlate - inventory.py, swarm.py, db.py, and one file per remaining command __init__.py calls register(app) on each module then runs the mode gate last, and re-exports the private symbols tests patch against (_db_reset_mysql_async, _kill_all_services, _require_master_mode, etc.). Test patches retargeted to the submodule where each name now resolves. Enroll-bundle tarball test updated to assert decnet/cli/__init__.py. No behavioral change.	2026-04-19 22:42:52 -04:00
anti	d1b7e94325	fix(swarm): inject peer cert into ASGI scope for uvicorn <= 0.44 Uvicorn's h11/httptools HTTP protocols don't populate scope['extensions']['tls'], so /swarm/heartbeat's per-request cert pinning was 403ing every call despite CERT_REQUIRED validating the cert at handshake. Patch RequestResponseCycle.__init__ on both protocol modules to read the peer cert off the asyncio transport and write DER bytes into scope['extensions']['tls']['client_cert_chain']. Importing the module from swarm_api.py auto-installs the patch in the swarmctl uvicorn worker before any request is served.	2026-04-19 22:09:11 -04:00
anti	bf01804736	feat(agent): periodic heartbeat loop posting status to swarmctl New decnet.agent.heartbeat asyncio loop wired into the agent FastAPI lifespan. Every 30 s the worker POSTs executor.status() to the master's /swarm/heartbeat with its DECNET_HOST_UUID for self-identity; the existing agent mTLS bundle provides the client cert the master pins against SwarmHost.client_cert_fingerprint. start() is a silent no-op when identity env (HOST_UUID, MASTER_HOST) is unset or the worker bundle is missing, so dev runs and un-enrolled hosts don't crash the agent app. On non-204 responses the loop logs loudly but keeps ticking — an operator may re-enrol mid-session, and fail-closed pinning shouldn't be self-silencing.	2026-04-19 21:49:34 -04:00
anti	62f7c88b90	feat(swarmctl): --tls with auto-issued or BYOC server cert swarmctl CLI gains --tls/--cert/--key/--client-ca flags. With --tls the controller runs uvicorn under HTTPS + mTLS (CERT_REQUIRED) so worker heartbeats can reach it cross-host. Default is still 127.0.0.1 plaintext for backwards compat with the master-CLI enrollment flow. Auto-issue path (no --cert/--key given): a server cert signed by the existing DECNET CA is issued once and parked under ~/.decnet/swarmctl/. Workers already ship that CA's ca.crt from the enroll bundle, so they verify the endpoint with no extra trust config. BYOC via --cert/--key when the operator wants a publicly-trusted or externally-managed cert. The auto-cert path is idempotent across restarts to keep a stable fingerprint for any long-lived mTLS sessions.	2026-04-19 21:46:32 -04:00
anti	148e51011c	feat(swarm): agent→master heartbeat with per-host cert pinning New POST /swarm/heartbeat on the swarm controller. Workers post every ~30s with the output of executor.status(); the master bumps SwarmHost.last_heartbeat and re-upserts each DeckyShard with a fresh DeckyConfig snapshot and runtime-derived state (running/degraded). Security: CA-signed mTLS alone is not sufficient — a decommissioned worker's still-valid cert could resurrect ghost shards. The endpoint extracts the presented peer cert (primary: scope["extensions"]["tls"], fallback: transport.get_extra_info("ssl_object")) and SHA-256-pins it to the SwarmHost.client_cert_fingerprint stored for the claimed host_uuid. Extraction is factored into _extract_peer_fingerprint so tests can exercise both uvicorn scope shapes and the both-unavailable fail-closed path without mocking uvicorn's TLS pipeline. Adds get_swarm_host_by_fingerprint to the repo interface (SQLModel impl reuses the indexed client_cert_fingerprint column).	2026-04-19 21:37:15 -04:00
anti	f576564f02	fix(agent): also wipe /etc/decnet during self-destruct	2026-04-19 21:04:31 -04:00
anti	00d5799a79	fix(agent): escape systemd cgroup when spawning self-destruct reaper The reaper was being SIGTERM'd mid-rm because `start_new_session=True` only forks a new POSIX session — it does not escape decnet-agent.service's cgroup. When the reaper ran `systemctl stop decnet-agent`, systemd tore down the whole cgroup (reaper included) before `rm -rf /opt/decnet*` finished, leaving the install on disk. Spawn the reaper via `systemd-run --collect --unit decnet-reaper-<pid>` so it runs in a fresh transient scope, outside the agent unit. Falls back to bare Popen for non-systemd hosts.	2026-04-19 21:00:43 -04:00
anti	14250cacad	feat(swarm): self-destruct agent on decommission Decommissioning a worker from the dashboard (or swarm controller) now asks the agent to wipe its own install before the master forgets it. The agent stops decky containers + every decnet-* systemd unit, then deletes /opt/decnet, /etc/systemd/system/decnet-, /var/lib/decnet/, and /usr/local/bin/decnet. Logs under /var/log are preserved. The reaper runs as a detached /tmp script (start_new_session=True) so it survives the agent process being killed. Self-destruct dispatch is best-effort — a dead worker doesn't block master-side cleanup.	2026-04-19 20:47:09 -04:00
anti	9d68bb45c7	feat(web): async teardowns — 202 + background task, UI allows parallel queue Teardowns were synchronous all the way through: POST blocked on the worker's docker-compose-down cycle (seconds to minutes), the frontend locked tearingDown to a single string so only one button could be armed at a time, and operators couldn't queue a second teardown until the first returned. On a flaky worker that meant staring at a spinner for the whole RTT. Backend: POST /swarm/hosts/{uuid}/teardown returns 202 the instant the request is validated. Affected shards flip to state='tearing_down' synchronously before the response so the UI reflects progress immediately, then the actual AgentClient call + DB cleanup run in an asyncio.create_task (tracked in a module-level set to survive GC and to be drainable by tests). On failure the shard flips to 'teardown_failed' with the error recorded — nothing is re-raised, since there's no caller to catch it. Frontend: swap tearingDown / decommissioning from 'string \| null' to 'Set<string>'. Each button tracks its own in-flight state; the poll loop picks up the final shard state from the backend. Multiple teardowns can now be queued without blocking each other.	2026-04-19 20:30:56 -04:00
anti	07ec4bc269	fix(fleet): INI fully replaces prior decky state on redeploy Submitting an INI with a single [decky1] was silently redeploying the deckies from the previous deploy too. POST /deckies/deploy merged the new INI into the stored DecnetConfig by name, so a 1-decky INI on top of a prior 3-decky run still pushed 3 deckies to the worker. Those stale decky2/decky3 kept their old IPs, collided on the parent NIC, and the agent failed with 'Address already in use' — the deploy the operator never asked for. The INI is the source of truth for which deckies exist this deploy. Full replace: config.deckies = list(new_decky_configs). Operators who want to add more deckies should list them all in the INI. Update the deploy-limit test to reflect the new replace semantics, and add a regression test asserting prior state is dropped.	2026-04-19 20:24:29 -04:00
anti	df18cb44cc	fix(swarm): don't paint healthy deckies as failed when a shard-sibling fails docker compose up is partial-success-friendly — a build failure on one service doesn't roll back the others. But the master was catching the agent's 500 and tagging every decky in the shard as 'failed' with the same error message. From the UI that looked like all three deckies died even though two were live on the worker. On dispatch exception, probe the agent's /status to learn which deckies actually have running containers, and upsert per-decky state accordingly. Only fall back to marking the whole shard failed if the status probe itself is unreachable. Enhance agent.executor.status() to include a 'runtime' map keyed by decky name with per-service container state, so the master has something concrete to consult.	2026-04-19 20:11:08 -04:00

... 3 4 5 6 7 ...

434 Commits