DECNET

Author	SHA1	Message	Date
anti	195580c74d	test: fix templates paths, CLI gating, and stress-suite harness - tests/*: update templates/ → decnet/templates/ paths after module move - tests/mysql_spinup.sh: use root:root and asyncmy driver - tests/test_auto_spawn.py: patch decnet.cli.utils._pid_dir (package split) - tests/test_cli.py: set DECNET_MODE=master in api-command tests - tests/stress/conftest.py: run locust out-of-process via its CLI + CSV stats shim to avoid urllib3 RecursionError from late gevent monkey-patch; raise uvicorn startup timeout to 60s, accept 401 from auth-gated health, strip inherited DECNET_ env, surface stderr on 0-request runs - tests/stress/test_stress.py: loosen baseline thresholds to match hw	2026-04-19 23:50:53 -04:00
anti	262a84ca53	refactor(cli): split decnet/cli.py monolith into decnet/cli/ package The 1,878-line cli.py held every Typer command plus process/HTTP helpers and mode-gating logic. Split into one module per command using a register(app) pattern so submodules never import app at module scope, eliminating circular-import risk. - utils.py: process helpers, _http_request, _kill_all_services, console, log - gating.py: MASTER_ONLY_* sets, _require_master_mode, _gate_commands_by_mode - deploy.py: deploy + _deploy_swarm (tightly coupled) - lifecycle.py: status, teardown, redeploy - workers.py: probe, collect, mutate, correlate - inventory.py, swarm.py, db.py, and one file per remaining command __init__.py calls register(app) on each module then runs the mode gate last, and re-exports the private symbols tests patch against (_db_reset_mysql_async, _kill_all_services, _require_master_mode, etc.). Test patches retargeted to the submodule where each name now resolves. Enroll-bundle tarball test updated to assert decnet/cli/__init__.py. No behavioral change.	2026-04-19 22:42:52 -04:00
anti	d1b7e94325	fix(swarm): inject peer cert into ASGI scope for uvicorn <= 0.44 Uvicorn's h11/httptools HTTP protocols don't populate scope['extensions']['tls'], so /swarm/heartbeat's per-request cert pinning was 403ing every call despite CERT_REQUIRED validating the cert at handshake. Patch RequestResponseCycle.__init__ on both protocol modules to read the peer cert off the asyncio transport and write DER bytes into scope['extensions']['tls']['client_cert_chain']. Importing the module from swarm_api.py auto-installs the patch in the swarmctl uvicorn worker before any request is served.	2026-04-19 22:09:11 -04:00
anti	bf01804736	feat(agent): periodic heartbeat loop posting status to swarmctl New decnet.agent.heartbeat asyncio loop wired into the agent FastAPI lifespan. Every 30 s the worker POSTs executor.status() to the master's /swarm/heartbeat with its DECNET_HOST_UUID for self-identity; the existing agent mTLS bundle provides the client cert the master pins against SwarmHost.client_cert_fingerprint. start() is a silent no-op when identity env (HOST_UUID, MASTER_HOST) is unset or the worker bundle is missing, so dev runs and un-enrolled hosts don't crash the agent app. On non-204 responses the loop logs loudly but keeps ticking — an operator may re-enrol mid-session, and fail-closed pinning shouldn't be self-silencing.	2026-04-19 21:49:34 -04:00
anti	62f7c88b90	feat(swarmctl): --tls with auto-issued or BYOC server cert swarmctl CLI gains --tls/--cert/--key/--client-ca flags. With --tls the controller runs uvicorn under HTTPS + mTLS (CERT_REQUIRED) so worker heartbeats can reach it cross-host. Default is still 127.0.0.1 plaintext for backwards compat with the master-CLI enrollment flow. Auto-issue path (no --cert/--key given): a server cert signed by the existing DECNET CA is issued once and parked under ~/.decnet/swarmctl/. Workers already ship that CA's ca.crt from the enroll bundle, so they verify the endpoint with no extra trust config. BYOC via --cert/--key when the operator wants a publicly-trusted or externally-managed cert. The auto-cert path is idempotent across restarts to keep a stable fingerprint for any long-lived mTLS sessions.	2026-04-19 21:46:32 -04:00
anti	148e51011c	feat(swarm): agent→master heartbeat with per-host cert pinning New POST /swarm/heartbeat on the swarm controller. Workers post every ~30s with the output of executor.status(); the master bumps SwarmHost.last_heartbeat and re-upserts each DeckyShard with a fresh DeckyConfig snapshot and runtime-derived state (running/degraded). Security: CA-signed mTLS alone is not sufficient — a decommissioned worker's still-valid cert could resurrect ghost shards. The endpoint extracts the presented peer cert (primary: scope["extensions"]["tls"], fallback: transport.get_extra_info("ssl_object")) and SHA-256-pins it to the SwarmHost.client_cert_fingerprint stored for the claimed host_uuid. Extraction is factored into _extract_peer_fingerprint so tests can exercise both uvicorn scope shapes and the both-unavailable fail-closed path without mocking uvicorn's TLS pipeline. Adds get_swarm_host_by_fingerprint to the repo interface (SQLModel impl reuses the indexed client_cert_fingerprint column).	2026-04-19 21:37:15 -04:00
anti	f576564f02	fix(agent): also wipe /etc/decnet during self-destruct	2026-04-19 21:04:31 -04:00
anti	00d5799a79	fix(agent): escape systemd cgroup when spawning self-destruct reaper The reaper was being SIGTERM'd mid-rm because `start_new_session=True` only forks a new POSIX session — it does not escape decnet-agent.service's cgroup. When the reaper ran `systemctl stop decnet-agent`, systemd tore down the whole cgroup (reaper included) before `rm -rf /opt/decnet*` finished, leaving the install on disk. Spawn the reaper via `systemd-run --collect --unit decnet-reaper-<pid>` so it runs in a fresh transient scope, outside the agent unit. Falls back to bare Popen for non-systemd hosts.	2026-04-19 21:00:43 -04:00
anti	14250cacad	feat(swarm): self-destruct agent on decommission Decommissioning a worker from the dashboard (or swarm controller) now asks the agent to wipe its own install before the master forgets it. The agent stops decky containers + every decnet-* systemd unit, then deletes /opt/decnet, /etc/systemd/system/decnet-, /var/lib/decnet/, and /usr/local/bin/decnet. Logs under /var/log are preserved. The reaper runs as a detached /tmp script (start_new_session=True) so it survives the agent process being killed. Self-destruct dispatch is best-effort — a dead worker doesn't block master-side cleanup.	2026-04-19 20:47:09 -04:00
anti	9d68bb45c7	feat(web): async teardowns — 202 + background task, UI allows parallel queue Teardowns were synchronous all the way through: POST blocked on the worker's docker-compose-down cycle (seconds to minutes), the frontend locked tearingDown to a single string so only one button could be armed at a time, and operators couldn't queue a second teardown until the first returned. On a flaky worker that meant staring at a spinner for the whole RTT. Backend: POST /swarm/hosts/{uuid}/teardown returns 202 the instant the request is validated. Affected shards flip to state='tearing_down' synchronously before the response so the UI reflects progress immediately, then the actual AgentClient call + DB cleanup run in an asyncio.create_task (tracked in a module-level set to survive GC and to be drainable by tests). On failure the shard flips to 'teardown_failed' with the error recorded — nothing is re-raised, since there's no caller to catch it. Frontend: swap tearingDown / decommissioning from 'string \| null' to 'Set<string>'. Each button tracks its own in-flight state; the poll loop picks up the final shard state from the backend. Multiple teardowns can now be queued without blocking each other.	2026-04-19 20:30:56 -04:00
anti	07ec4bc269	fix(fleet): INI fully replaces prior decky state on redeploy Submitting an INI with a single [decky1] was silently redeploying the deckies from the previous deploy too. POST /deckies/deploy merged the new INI into the stored DecnetConfig by name, so a 1-decky INI on top of a prior 3-decky run still pushed 3 deckies to the worker. Those stale decky2/decky3 kept their old IPs, collided on the parent NIC, and the agent failed with 'Address already in use' — the deploy the operator never asked for. The INI is the source of truth for which deckies exist this deploy. Full replace: config.deckies = list(new_decky_configs). Operators who want to add more deckies should list them all in the INI. Update the deploy-limit test to reflect the new replace semantics, and add a regression test asserting prior state is dropped.	2026-04-19 20:24:29 -04:00
anti	df18cb44cc	fix(swarm): don't paint healthy deckies as failed when a shard-sibling fails docker compose up is partial-success-friendly — a build failure on one service doesn't roll back the others. But the master was catching the agent's 500 and tagging every decky in the shard as 'failed' with the same error message. From the UI that looked like all three deckies died even though two were live on the worker. On dispatch exception, probe the agent's /status to learn which deckies actually have running containers, and upsert per-decky state accordingly. Only fall back to marking the whole shard failed if the status probe itself is unreachable. Enhance agent.executor.status() to include a 'runtime' map keyed by decky name with per-service container state, so the master has something concrete to consult.	2026-04-19 20:11:08 -04:00
anti	91549e6936	fix(deploy): prevent 'Address already in use' from stale IPAM and half-torn-down containers Two compounding root causes produced the recurring 'Address already in use' error on redeploy: 1. _ensure_network only compared driver+name; if a prior deploy's IPAM pool drifted (different subnet/gateway/range), Docker kept handing out addresses from the old pool and raced the real LAN. Now also compares Subnet/Gateway/IPRange and rebuilds on drift. 2. A prior half-failed 'up' could leave containers still holding the IPs and ports the new run wants. Run 'compose down --remove-orphans' as a best-effort pre-up cleanup so IPAM starts from a clean state. Also surface docker compose stderr to the structured log on failure so the agent's journal captures Docker's actual message (which IP, which port) instead of just the exit code.	2026-04-19 19:59:06 -04:00
anti	585541016f	fix(engine): teardown(decky_id=...) built malformed service names The nested list-comp `[f"{id}-{svc}" for svc in [d.services for d ...]]` iterated over a list of lists, so `svc` was the whole services list and f-string stringified it -> `decky3-['sip']`. docker compose saw "no such service" and the per-decky teardown failed 500. Flatten: find the matching decky once, then iterate its services. Noop early on unknown decky_id and on empty service lists. Regression test asserts the emitted compose args have no '[' or quote characters.	2026-04-19 19:42:42 -04:00
anti	5dad1bb315	feat(swarm): remote teardown API + UI (per-decky and per-host) Agents already exposed POST /teardown; the master was missing the plumbing to reach it. Add: - POST /api/v1/swarm/hosts/{uuid}/teardown — admin-gated. Body {decky_id: str\|null}: null tears the whole host, a value tears one decky. On worker failure the master returns 502 and leaves DB shards intact so master and agent stay aligned. - BaseRepository.delete_decky_shard(name) + sqlmodel impl for per-decky cleanup after a single-decky teardown. - SwarmHosts page: "Teardown all" button (keeps host enrolled). - SwarmDeckies page: per-row "Teardown" button. Also exclude setuptools' build/ staging dir from the enrollment tarball — `pip install -e` on the master generates build/lib/decnet_web/node_modules and the bundle walker was leaking it to agents. Align pyproject's bandit exclude with the git-hook invocation so both skip decnet/templates/.	2026-04-19 19:39:28 -04:00
anti	6708f26e6b	fix(packaging): move templates/ into decnet/ package so they ship with pip install The docker build contexts and syslog_bridge.py lived at repo root, which meant setuptools (include = ["decnet"]) never shipped them. Agents installed via `pip install $RELEASE_DIR` got site-packages/decnet/* but no templates/, so every deploy blew up in deployer._sync_logging_helper with FileNotFoundError on templates/syslog_bridge.py. Move templates/ -> decnet/templates/ and declare it as setuptools package-data. Path resolutions in services/*.py and engine/deployer.py drop one .parent since templates now lives beside the code. Test fixtures, bandit exclude path, and coverage omit glob updated to match.	2026-04-19 19:30:04 -04:00
anti	2bef3edb72	feat(swarm): unbundle master-only code from agent tarball + sync systemd units on update Agents now ship with collector/prober/sniffer as systemd services; mutator, profiler, web, and API stay master-only (profiler rebuilds attacker profiles against the master DB — no per-host DB exists). Expand _EXCLUDES to drop the full decnet/web, decnet/mutator, decnet/profiler, and decnet_web trees from the enrollment bundle. Updater now calls _heal_path_symlink + _sync_systemd_units after rotation so fleets pick up new unit files and /usr/local/bin/decnet tracks the shared venv without a manual reinstall. daemon-reload runs once per update when any unit changed. Fix _service_registry matchers to accept systemd-style /usr/local/bin/decnet cmdlines (psutil returns a list — join to string before substring-checking) so agent-mode `decnet status` reports collector/prober/sniffer correctly.	2026-04-19 19:19:17 -04:00
anti	d2cf1e8b3a	feat(updater): sync systemd unit files and daemon-reload on update The bootstrap installer copies etc/systemd/system/*.service into /etc/systemd/system at enrollment time, but the updater was skipping that step — a code push could not ship a new unit (e.g. the four per-host microservices added this session) or change ExecStart on an existing one. systemctl alone doesn't re-read unit files; daemon-reload is required. run_update / run_update_self now call _sync_systemd_units after rotation: diff each .service file against the live copy, atomically replace changed ones, then issue a single `systemctl daemon-reload`. No-op on legacy tarballs that don't ship etc/systemd/system/.	2026-04-19 19:07:24 -04:00
anti	6d7877c679	feat(swarm): per-host microservices as systemd units, mutator off agents Previously `decnet status` on an agent showed every microservice as DOWN because deploy's auto-spawn was unihost-scoped and the agent CLI gate hid the per-host commands. Now: - collect, probe, profiler, sniffer drop out of MASTER_ONLY_COMMANDS (they run per-host; master-side work stays master-gated). - mutate stays master-only (it orchestrates swarm-wide respawns). - decnet/mutator/ excluded from agent tarballs — never invoked there. - decnet/web exclusion tightened: ship db/ + auth.py + dependencies.py (profiler needs the repo singleton), drop api.py, swarm_api.py, ingester.py, router/, templates/. - Four new systemd unit templates (decnet-collector/prober/profiler/ sniffer) shipped in every enrollment tarball. - enroll_bootstrap.sh enables + starts all four alongside agent and forwarder at install time. - updater restarts the aux units on code push so they pick up the new release (best-effort — legacy enrollments without the units won't fail the update). - status table hides Mutator + API rows in agent mode.	2026-04-19 18:58:48 -04:00
anti	ee9ade4cd5	feat(enroll): strip master API and frontend from agent tarball Agents never run the FastAPI master app (decnet/web/) or serve the React frontend (decnet_web/) — they run decnet.agent, decnet.updater, and decnet.forwarder, none of which import decnet.web. Shipping the master tree bloats every enrollment payload and needlessly widens the worker's attack surface. Excluded paths are unreachable on the worker (all cli.py imports of decnet.web are inside master-only command bodies that the agent-mode gate strips). Tests assert neither tree leaks into the tarball.	2026-04-19 18:47:03 -04:00
anti	f91ba9a16e	feat(cli): allow `decnet status` in agent mode Agents run deckies locally and need to inspect their own state. Removed `status` from MASTER_ONLY_COMMANDS so it survives the agent-mode gate. Useful for validating remote updater pushes from the master.	2026-04-19 18:29:41 -04:00
anti	43b92c7bd6	fix(updater): restart agent+forwarder+self via systemd on push Three holes in the systemd integration: 1. _spawn_agent_via_systemd only restarted decnet-agent.service, leaving decnet-forwarder.service running the pre-update code (same /opt/decnet tree, stale import cache). 2. run_update_self used os.execv regardless of environment — the re-execed process kept the updater's existing cgroup/capability inheritance but systemd would notice MainPID change and mark the unit degraded. 3. No path to surface a failed forwarder restart (legacy enrollments have no forwarder unit). Now: agent restart first, forwarder restart as best-effort (logged but non-fatal so legacy workers still update), MainPID still read from the agent unit. For update-self under systemd, spawn a detached sleep+ systemctl restart so the HTTP response flushes before the unit cycles.	2026-04-19 18:23:10 -04:00
anti	a0a241f65d	feat(enroll): decnet-updater now runs under systemd, not a --daemon fork Bootstrap used to end with `decnet updater --daemon` which forks and detaches — invisible to systemctl, no auto-restart, dies on reboot. Ships a decnet-updater.service template matching the pattern of the other units (Restart=on-failure, log to /var/log/decnet/decnet.updater.log, certs from /etc/decnet/updater, install tree at /opt/decnet), bundles it alongside agent/forwarder/engine units, and the installer now `systemctl enable --now`s it when --with-updater is set.	2026-04-19 18:19:24 -04:00
anti	42b5e4cd06	fix(network): replace decnet_lan when driver differs (macvlan<->ipvlan) The create helpers short-circuited on name alone, so a prior macvlan deploy left Docker's decnet_lan network in place. A subsequent ipvlan deploy would no-op the network create, then container attach would try to add a macvlan port on enp0s3 that already had an ipvlan slave — EBUSY, agent 500, docker ps empty. Now: when the existing network's driver disagrees with the requested one, disconnect any live containers and DROP it before recreating. Parent-NIC can host one driver at a time. Also: setup_host_{macvlan,ipvlan} opportunistically delete the opposite host-side helper so we don't leave cruft across driver swaps.	2026-04-19 18:12:28 -04:00
anti	5df995fda1	feat(enroll): opt-in IPvlan per-agent for Wi-Fi-bridged VMs Wi-Fi APs bind one MAC per associated station, so VirtualBox/VMware guests bridged over Wi-Fi rotate the VM's DHCP lease when Docker's macvlan starts emitting container-MAC frames through the vNIC. Adds a `use_ipvlan` toggle on the Agent Enrollment tab (mirrors the updater daemon checkbox): flips the flag on SwarmHost, bakes `ipvlan=true` into the agent's decnet.ini, and `_worker_config` forces ipvlan=True on the per-host shard at dispatch. Safe no-op on wired/bare-metal agents.	2026-04-19 17:57:45 -04:00
anti	6d7567b6bb	fix(fleet): reset stale host_uuid on carried-over deckies before dispatch Deckies merged in from a prior deployment's saved state kept their original host_uuid — which dispatch_decnet_config then 404'd on if that host had since been decommissioned or re-enrolled at a different uuid. Before round-robin assignment, drop any host_uuid that isn't in the live swarm_hosts set so orphaned entries get reassigned instead of exploding with 'unknown host_uuid'.	2026-04-19 06:27:34 -04:00
anti	b883f24ba2	fix(engine): pin docker compose project name to avoid empty-basename failure systemd daemons run with WorkingDirectory=/ by default; docker compose derives the project name from basename(cwd), which is empty at '/', and aborts with 'project name must not be empty'. Pass -p decnet explicitly so the project name is independent of cwd, and set WorkingDirectory=/opt/decnet on the three DECNET units so compose artifacts (decnet-compose.yml, build contexts) also land in the install dir.	2026-04-19 06:17:30 -04:00
anti	79db999030	feat(fleet): auto-swarm deploy — shard across enrolled workers when master POST /deckies/deploy now branches on DECNET_MODE + enrolled host presence: when the caller is a master with at least one reachable swarm host, round- robin host_uuids are assigned over new deckies and the config is dispatched via AgentClient. Falls back to local docker-compose otherwise. Extracts the dispatch loop from api_deploy_swarm into dispatch_decnet_config so both endpoints share the same shard/dispatch/persist path. Adds GET /system/deployment-mode for the UI to show 'will shard across N hosts' vs 'will deploy locally' before the operator clicks deploy.	2026-04-19 06:09:08 -04:00
anti	899ea559d9	feat(enroll): systemd units for agent/forwarder/engine + log-directory INI key Rename log-file-path -> log-directory (maps to DECNET_LOG_DIRECTORY). Bundle now ships three systemd units rendered with agent_name/master_host and installs them into /etc/systemd/system/. Bootstrap replaces direct 'decnet X --daemon' calls with systemctl enable --now. Each unit pins DECNET_SYSTEM_LOGS so agent, forwarder, and deckies logs land at decnet.{agent,forwarder}.log and decnet.log under /var/log/decnet.	2026-04-19 05:46:08 -04:00
anti	ff4c993617	refactor(swarm-mgmt): backfill host address from agent's .tgz source IP	2026-04-19 05:20:29 -04:00
anti	e32fdf9cbf	feat(swarm-mgmt): agent_host + updater opt-in; prevent duplicate forwarder spawn	2026-04-19 05:12:55 -04:00
anti	95ae175e1b	fix(swarm-mgmt): exclude .env from bundle, chmod +x decnet, mkdir log	2026-04-19 04:58:55 -04:00
anti	b4df9ea0a1	fix(swarm-mgmt): bundle URLs target master_host, not dashboard base_url	2026-04-19 04:52:20 -04:00
anti	c6f7de30d2	feat(swarm-mgmt): agent enrollment bundle flow + admin swarm endpoints	2026-04-19 04:25:57 -04:00
anti	37b22b76a5	feat(cli): auto-spawn listener as detached sibling from decnet swarmctl Mirrors the agent→forwarder pattern: `decnet swarmctl` now fires the syslog-TLS listener as a detached Popen sibling so a single master invocation brings the full receive pipeline online. --no-listener opts out for operators who want to run the listener on a different host (or under their own systemd unit). Listener bind host / port come from DECNET_LISTENER_HOST and DECNET_SWARM_SYSLOG_PORT — both seedable from /etc/decnet/decnet.ini. PID at $(pid_dir)/listener.pid so operators can kill/restart manually. decnet.ini.example ships alongside env.config.example as the documented surface for the new role-scoped config. Mode, forwarder targets, listener bind, and master ports all live there — no more memorizing flag trees. Extends tests/test_auto_spawn.py with two swarmctl cases: listener is spawned with the expected argv + PID file, and --no-listener suppresses.	2026-04-19 03:25:40 -04:00
anti	43f140a87a	feat(cli): auto-spawn forwarder as detached sibling from decnet agent New _spawn_detached(argv, pid_file) helper uses Popen with start_new_session=True + close_fds=True + DEVNULL stdio to launch a DECNET subcommand as a fully independent process. The parent does NOT wait(); if it dies the child survives under init. This is deliberately not a supervisor — if the child dies the operator restarts it manually. _pid_dir() picks /opt/decnet when writable else ~/.decnet, so both root-run production and non-root dev work without ceremony. `decnet agent` now auto-spawns `decnet forwarder --daemon ...` as that detached sibling, pulling master host + syslog port from DECNET_SWARM_MASTER_HOST / DECNET_SWARM_SYSLOG_PORT. --no-forwarder opts out. If DECNET_SWARM_MASTER_HOST is unset the auto-spawn is silently skipped (single-host dev or operator wants to start the forwarder separately). tests/test_auto_spawn.py monkeypatches subprocess.Popen and verifies: the detach kwargs are passed, the PID file exists and contains a valid positive integer (PID-file corruption is a real operational headache — catching bad writes at the test level is free), the --no-forwarder flag suppresses the spawn, and the unset-master-host path silently skips.	2026-04-19 03:23:42 -04:00
anti	3223bec615	feat(cli): gate master-only commands when DECNET_MODE=agent - MASTER_ONLY_COMMANDS / MASTER_ONLY_GROUPS frozensets enumerate every command a worker host must not see. Comment block at the declaration puts the maintenance obligation in front of anyone touching command registration. - _gate_commands_by_mode() filters both app.registered_commands (for @app.command() registrations) and app.registered_groups (for add_typer sub-apps) so the 'swarm' group disappears along with 'api', 'swarmctl', 'deploy', etc. on agent hosts. - _require_master_mode() is the belt-and-braces in-function guard, added to the four highest-risk commands (api, swarmctl, deploy, teardown). Protects against direct function imports that would bypass Typer. - DECNET_DISALLOW_MASTER=false is the escape hatch for hybrid dev hosts that legitimately play both roles. tests/test_mode_gating.py exercises help-text listings via subprocess and the defence-in-depth guard via direct import.	2026-04-19 03:20:48 -04:00
anti	65fc9ac2b9	fix(tests): clean up two pre-existing failures before config work - decnet/agent/app.py /health: drop leftover 'push-test-2' canary planted during live VM push verification and never cleaned up; test_health_endpoint asserts the exact dict shape. - tests/test_factory.py: switch the lazy-engine check from mysql+aiomysql (not in pyproject) to mysql+asyncmy (the driver the project actually ships). The test does not hit the wire so the dialect swap is safe. Both were red on `pytest tests/` before any config/auto-spawn work began; fixing them here so the upcoming commits land on a green full-suite baseline.	2026-04-19 03:17:17 -04:00
anti	1e8b73c361	feat(config): add /etc/decnet/decnet.ini loader New decnet/config_ini.py parses a role-scoped INI file via stdlib configparser and seeds os.environ via setdefault — real env vars still win, keeping full back-compat with .env.local flows. [decnet] holds role-agnostic keys (mode, disallow-master, log-file-path); the role section matching `mode` is loaded, the other is ignored silently so a worker never reads master-only keys (and vice versa). Loader is standalone in this commit — not wired into cli.py yet.	2026-04-19 03:10:51 -04:00
anti	9b1299458d	fix(env): resolve DECNET_JWT_SECRET lazily so agent/updater subcommands don't need it The module-level _require_env('DECNET_JWT_SECRET') call blocked `decnet agent` and `decnet updater` from starting on workers that legitimately have no business knowing the master's JWT signing key. Move the resolution into a module `__getattr__`: only consumers that actually read `decnet.env.DECNET_JWT_SECRET` trigger the validation, which in practice means only decnet.web.auth (master-side). Adds tests/test_env_lazy_jwt.py covering both the in-process lazy path and an out-of-process `decnet agent --help` subprocess check with a fully sanitized environment.	2026-04-19 02:43:25 -04:00
anti	a266d6b17e	feat(web): Remote Updates API — dashboard endpoints for pushing code to workers Adds /api/v1/swarm-updates/{hosts,push,push-self,rollback} behind require_admin. Reuses the existing UpdaterClient + tar_working_tree + the per-host asyncio.gather pattern from api_deploy_swarm.py; tarball is built exactly once per /push request and fanned out to every selected worker. /hosts filters out decommissioned hosts and agent-only enrollments (no updater bundle = not a target). Connection drops during /update-self are treated as success — the updater re-execs itself mid-response, so httpx always raises. Pydantic models live in decnet/web/db/models.py (single source of truth). 24 tests cover happy paths, rollback, transport failures, include_self ordering (skip on rolled-back agents), validation, and RBAC gating.	2026-04-19 01:01:09 -04:00
anti	f5a5fec607	feat(deploy): systemd units w/ capability-based hardening; updater restarts agent via systemctl Add deploy/ unit files for every DECNET daemon (agent, updater, api, web, swarmctl, listener, forwarder). All run as User=decnet with NoNewPrivileges, ProtectSystem, PrivateTmp, LockPersonality; AmbientCapabilities=CAP_NET_ADMIN CAP_NET_RAW only on the agent (MACVLAN/scapy). Existing api/web units migrated to /opt/decnet layout and the same hardening stanza. Make the updater's _spawn_agent systemd-aware: under systemd (detected via INVOCATION_ID + systemctl on PATH), `systemctl restart decnet-agent.service` replaces the Popen path so the new agent inherits the unit's ambient caps instead of the updater's empty set. _stop_agent becomes a no-op in that mode to avoid racing systemctl's own stop phase. Tests cover the dispatcher branch selection, MainPID parsing, and the systemd no-op stop.	2026-04-19 00:44:06 -04:00
anti	ebeaf08a49	fix(updater): fall back to /proc scan when agent.pid is missing If the agent was started outside the updater (manually, during dev, or from a prior systemd unit), there is no agent.pid for _stop_agent to target, so a successful code install leaves the old in-memory agent process still serving requests. Scan /proc for any decnet agent command and SIGTERM all matches so restart is reliable regardless of how the agent was originally launched.	2026-04-18 23:42:26 -04:00
anti	7765b36c50	feat(updater): remote self-update daemon with auto-rollback Adds a separate `decnet updater` daemon on each worker that owns the agent's release directory and installs tarball pushes from the master over mTLS. A normal `/update` never touches the updater itself, so the updater is always a known-good rescuer if a bad agent push breaks /health — the rotation is reversed and the agent restarted against the previous release. `POST /update-self` handles updater upgrades explicitly (no auto-rollback). - decnet/updater/: executor, FastAPI app, uvicorn launcher - decnet/swarm/updater_client.py, tar_tree.py: master-side push - cli: `decnet updater`, `decnet swarm update [--host\|--all] [--include-self] [--dry-run]`, `--updater` on `swarm enroll` - enrollment API issues a second cert (CN=updater@<host>) signed by the same CA; SwarmHost records updater_cert_fingerprint - tests: executor, app, CLI, tar tree, enroll-with-updater (37 new) - wiki: Remote-Updates page + sidebar + SWARM-Mode cross-link	2026-04-18 21:40:21 -04:00
anti	8914c27220	feat(swarm): add `decnet swarm deckies` to list deployed shards by host `swarm list` only shows enrolled workers — there was no way to see which deckies are running and where. Adds GET /swarm/deckies on the controller (joins DeckyShard with SwarmHost for name/address/status) plus the CLI wrapper with --host / --state filters and --json.	2026-04-18 21:10:07 -04:00
anti	4db9c7464c	fix(swarm): relocalize master-built config on worker before deploy deploy --mode swarm was failing on every heterogeneous fleet: the master populates config.interface from its own box (detect_interface() → its default NIC), then ships that verbatim. The worker's deployer then calls get_host_ip(config.interface), hits 'ip addr show wlp6s0' on a VM whose NIC is enp0s3, and 500s. Fix: agent.executor._relocalize() runs on every swarm-mode deploy. Re-detects the worker's interface/subnet/gateway/host_ip locally and swaps them into the config before calling deployer.deploy(). When the worker's subnet doesn't match the master's, decky IPs are re-allocated from the worker's subnet via allocate_ips() so they're reachable. Unihost-mode configs are left untouched — they're already built against the local box and second-guessing them would be wrong. Validated against anti@192.168.1.13: master dispatched interface=wlp6s0, agent logged 'relocalized interface=enp0s3', deployer ran successfully, dry-run returned ok=deployed. 4 new tests cover both branches (matching-subnet preserves decky IPs; mismatch re-allocates), the end-to-end executor.deploy() path, and the unihost short-circuit.	2026-04-18 20:41:21 -04:00
anti	411a797120	feat(cli): add decnet swarm check wrapper for POST /swarm/check The swarmctl API already exposes POST /swarm/check — an active mTLS probe that refreshes SwarmHost.status + last_heartbeat for every enrolled worker. The CLI was missing a wrapper, so operators had to curl the endpoint directly (which is how the VM validation run did it, and how the wiki Deployment-Modes / SWARM-Mode pages ended up doc'ing a command that didn't exist yet). Matches the existing list/enroll/decommission pattern: typer subcommand under swarm_app, --url override, Rich table output plus --json for scripting. Three tests: populated table, empty-swarm path, and --json emission.	2026-04-18 20:28:34 -04:00
anti	bfc7af000a	test(swarm): add forwarder/listener resilience scenarios Covers failure modes the happy-path tests miss: - log rotation (copytruncate): st_size shrinks under the forwarder, it resets offset=0 and reships the new contents instead of getting wedged past EOF; - listener restart: forwarder retries, resumes from the persisted offset, and the previously-acked lines are NOT duplicated on the master; - listener tolerates a well-authenticated client that sends a partial octet-count frame and drops — the server must stay up and accept follow-on connections; - peer_cn / fingerprint_from_ssl degrade to 'unknown' / None when no peer cert is available (defensive path that otherwise rarely fires).	2026-04-18 19:56:51 -04:00
anti	1e8ca4cc05	feat(swarm-cli): add `decnet swarm {enroll,list,decommission}` + `deploy --mode swarm` New sub-app talks HTTP to the local swarm controller (127.0.0.1:8770 by default; override with --url or $DECNET_SWARMCTL_URL). - enroll: POSTs /swarm/enroll, prints fingerprint, optionally writes ca.crt/worker.crt/worker.key to --out-dir for scp to the worker. - list: renders enrolled workers as a rich table (with --status filter). - decommission: looks up uuid by --name, confirms, DELETEs. deploy --mode swarm now: 1. fetches enrolled+active workers from the controller, 2. round-robin-assigns host_uuid to each decky, 3. POSTs the sharded DecnetConfig to /swarm/deploy, 4. renders per-worker pass/fail in a results table. Exits non-zero if no workers exist or any worker's dispatch failed.	2026-04-18 19:52:37 -04:00
anti	a6430cac4c	feat(swarm): add `decnet forwarder` CLI to run syslog-over-TLS forwarder The forwarder module existed but had no runner — closes that gap so the worker-side process can actually be launched and runs isolated from the agent (asyncio.run + SIGTERM/SIGINT → stop_event). Guards: refuses to start without a worker cert bundle or a resolvable master host ($DECNET_SWARM_MASTER_HOST or --master-host).	2026-04-18 19:41:37 -04:00

1 2 3 4

196 Commits