21 Commits

Author SHA1 Message Date
f2b3393669 chore: relicense to AGPL-3.0-or-later and add SPDX headers
Replaces LICENSE (GPLv3 -> AGPLv3) and prepends
`SPDX-License-Identifier: AGPL-3.0-or-later` to every source file
across decnet/, decnet_web/, tests/, scripts/, and tools/.

Rationale: closes the GPLv3 ASP loophole so any party operating a
modified DECNET as a network service must offer their modified
source. Personal copyright (Samuel Paschuan) + inbound=outbound
contributions make a future unilateral relicense infeasible.

- LICENSE: full AGPL-3.0 text (gnu.org/licenses/agpl-3.0.txt)
- COPYRIGHT: project copyright notice
- tools/add_spdx_headers.py: idempotent header injector
  (shebang- and PEP 263-aware)

Touches 1565 source files (.py, .ts, .tsx, .js, .jsx, .css, .sh).
No behavior change; comments only.
2026-05-22 21:04:16 -04:00
4743c8f733 feat(api): /deckies/deploy and /mutate become 202 fire-and-forget
This is the unblock for the wizard hang. Both endpoints used to run
docker compose synchronously inside the HTTP handler -- on master
(unihost) or via asyncio.gather of worker /deploy POSTs at 600s
timeout each (swarm) -- blocking every other API request.

New flow:
  1. Commit the new config shape to repo state (fast).
  2. Create one DeckyLifecycle row per decky (status=pending).
  3. Spawn asyncio.create_task(run_deploy / run_mutate) -- the
     lifecycle runner drives rows through running -> succeeded|failed
     and emits decky.<name>.lifecycle on the bus.
  4. Return 202 with {lifecycle_ids: [...]}. Wizard polls
     GET /deckies/lifecycle?ids=... (next commit).

mutator/engine.py gains pick_new_services() -- shared between the
async API path and the watch-loop's synchronous mutate_decky().

DeployResponse grows lifecycle_ids[]. The old dispatch_decnet_config
helper still exists for the CLI swarm-deploy command path; it just
isn't called from the API handler anymore.

Test changes: 200 -> 202, drop dispatch_decnet_config mocks (handler
no longer calls it), assert lifecycle_ids in response + committed
state matches expectations.
2026-05-22 16:40:55 -04:00
ade8bbe30a feat(agent): real worker-side /mutate with master swarm dispatch
- Implement /mutate handler: load_state, update services + last_mutated,
  save_state, write_compose, compose up -d via asyncio.to_thread. 404
  for missing state / unknown decky_id. dry_run short-circuits before
  any side effect.
- Add AgentClient.mutate(decky_id, services, *, dry_run=False) using
  _TIMEOUT_DEPLOY (compose up can pull/build, exceeds control timeout).
- mutator/engine.py: in swarm mode with decky.host_uuid set, resolve
  worker via _resolve_swarm_host and dispatch through AgentClient.mutate
  instead of writing a compose file on master. Master-resident deckies
  (unihost mode, or swarm with host_uuid=None) keep the local path.
2026-05-22 16:14:46 -04:00
ee24a7551f fix(types): T7 — eliminate all remaining 38 mypy errors; fix DeckyRow subscript in engine tests 2026-05-01 02:07:53 -04:00
b9684254f0 fix(types): T5 — narrow AsyncClient|None with inline if; rename loop variable t→task to avoid no-redef 2026-05-01 01:53:10 -04:00
fc1f0914b7 refactor(topology): introduce TopologyRepository protocol with DTO return types
Replace repo: BaseRepository with a structural TopologyRepository protocol
in persistence.py and allocator.py. All read methods now return typed DTOs
(TopologySummary, LANRow, DeckyRow, EdgeRow) instead of raw dicts, eliminating
silent field-shape regressions across the topology subsystem.

TopologySummary gains email_personas and language_default so api_personas.py
can continue reading those fields via attribute access. hydrate() converts
DTOs to dicts before passing to _backfill_decky_configs, keeping the mutable
working-state function dict-based at its boundary. All production callers
(router handlers, mutator, CLI, heartbeat) migrated from dict/get access to
attribute access. 134 tests pass.
2026-04-30 23:51:41 -04:00
d314470d7f fix(stats): keep TopologyDecky.state in sync with docker so ACTIVE DECKIES counts right
Dashboard's ACTIVE DECKIES (active_deckies in get_stats_summary) counts
TopologyDecky rows where state='running'.  No code path was flipping
that state away from the default 'pending', so the count read 0/N
even when every container was running fine — the dashboard was lying.

Two complementary fixes:

1. deploy_topology — after the post-deploy compose ps verification,
   reconcile each TopologyDecky.state from the corresponding base
   container's docker state.  running → 'running'; anything else →
   'failed'.  Reuses the ps_rows already gathered for the
   ACTIVE-vs-DEGRADED status decision; no extra docker hit.

2. apply_add_decky — _materialise_decky_spawn now returns True/False;
   on True the row is updated to state='running' before
   _assert_valid_after.  Catches the case where a decky added via the
   live mutator queue stays at 'pending' indefinitely (the deployer's
   reconcile only runs on a fresh deploy_topology pass).

Existing topology deckies in active topologies will still read as
'pending' until the next deploy_topology runs, since this is
forward-only.  An operator-side fix is to teardown + redeploy or run
the (forthcoming) reconcile-on-startup pass.
2026-04-29 11:09:32 -04:00
57e527534c fix(mutator): auto-fall-back to legacy builder when buildx wedges live decky add
apply_add_decky's compose-up was hard-failing whenever the operator's
~/.docker/buildx/activity/ landed on a read-only mount — the wedge
detection in _compose_with_retry correctly refuses to retry (would
just leak more mounts), but for live materialisation we don't want a
wedged buildx state to abort an admin's mutation.  ANTI hit it on
adding decky-a977: 'failed to update builder last activity time: ...
read-only file system → buildx wedge detected → returned non-zero'.

_compose_up_with_buildkit_fallback wraps _compose_with_retry: on a
CalledProcessError whose stderr matches both wedge signatures
(_BUILDX_WEDGE_SIGNATURE + _BUILDX_EROFS_SIGNATURE), it logs a
warning with the manual recovery steps + retries once with
DOCKER_BUILDKIT=0 set.  The legacy non-buildx builder doesn't use
the activity dir and isn't affected.

Wired into the two paths that pass --build:
* _materialise_decky_spawn (apply_add_decky)
* _materialise_decky_services_diff (apply_update_decky service add)

_materialise_decky_recreate_base doesn't build — it just recreates a
container from an existing image — so it's not affected.

Operator-facing log message points at the manual fix
(rm -rf ~/.docker/buildx/activity + docker buildx create) so they
can recover at their leisure; we don't ATTEMPT the recovery because
the activity dir might be RO for a reason (zfs/btrfs snapshot, etc.)
that an automated rm would be wrong to fight.
2026-04-29 10:59:04 -04:00
892219ec87 feat(mutator): refuse forwards_l3 promotion on non-DMZ deckies
apply_update_decky's flip path now refuses to promote a decky to
gateway unless its home LAN is a DMZ.  The compose generator publishes
host ports for forwards_l3=True; a non-DMZ gateway would shadow the
host's port space without anything legitimately able to reach the
service.  Same posture as the existing 'forwards_l3 flip on live
requires force=true' guard — refused before any DB write so a bad
mutation leaves zero side-effects.

The check is intentionally NOT a standing _RULES invariant — the
codebase uses forwards_l3 for two semantics:

  1. Generic L3 forwarding (internal bridge deckies routing between
     their multi-home LANs).  The generator writes this on internal
     bridges via bridge_forward_probability; legitimately non-DMZ.
  2. DMZ gateway (host-port publisher).  Only meaningful on DMZ.

Standing validation can't enforce DMZ-homing without breaking case 1.
The guard fires only on the explicit user-driven flip path where the
operator's intent is unambiguously case 2.  Generator output and
internal-bridge attachments bypass the check.

check_gateway_homed_in_dmz lives in validate.py for callers that want
the explicit form (and for the test surface), but is not a standing
rule — comment in _RULES explains the asymmetry.
2026-04-29 00:38:51 -04:00
a27e3f5e0f fix(tests+mutator): unbreak the docker-shadow test env + let mutator delete from active
Two related fixes that came out of running the W5 tests locally:

1. tests/__init__.py — empty file, makes 'tests/' a package so pytest
   stops inserting it into sys.path.  Without it, 'tests/docker/'
   (the docker-image test category) shadowed the installed docker SDK
   on every engine-touching test in the repo:

     module 'docker' has no attribute 'DockerClient'

   Pytest's default --import-mode=prepend was the culprit; making
   tests/ a package is the cheapest fix and doesn't change
   --import-mode for the whole tree.

2. delete_topology_decky / delete_topology_edge / delete_lan grow an
   'enforce_pending: bool = True' kwarg.  Default preserves the HTTP
   CRUD guard (api_decky_crud / api_edge_crud / api_lan_crud get the
   409 for free).  apply_remove_decky / apply_detach_decky /
   apply_remove_lan now pass enforce_pending=False — the mutator
   queue is the live-editing surface and has its own active-topology
   gating; the repo's pending-only guard was for design-time CRUD
   that mustn't bypass it.  Without this, apply_remove_decky was
   silently broken on active topologies pre-W5; W5's new test
   surfaced it on first run.

10/10 new W5 tests pass; 58/58 across mutator + topology suites.
2026-04-29 00:24:17 -04:00
98c929894c feat(mutator): selective materialisation for apply_update_decky + tests
apply_update_decky now discriminates three sub-cases:

* services list changed → diff old vs new and call
  _materialise_decky_services_diff (compose up -d for added,
  stop + rm -f for removed).  Mirrors services_live's pattern but
  doesn't import it — mutator-routed mutations carry a different bus
  surface (mutation.applied) than the direct API path
  (decky.<name>.service_added).
* forwards_l3 flipped → port publishing changes, which docker can
  only apply at container-create time.  Gated on payload['force'] is
  true; default raises MutationError so a half-thinking operator
  can't stomp a live decky.  When force=true,
  _materialise_decky_recreate_base does compose up -d --no-deps
  --force-recreate.  Pre-checked BEFORE the DB write so a refused
  mutation leaves zero side-effects.
* coord-only (x/y) → DB only, no docker work.

Ships tests/mutator/test_ops_materialisation.py with focused coverage
for every new helper: add_decky/remove_decky/attach_decky/
detach_decky/update_decky/update_lan paths against an active
topology, with compose primitives + docker SDK mocked at the source
modules so the helpers' lazy imports pick up the stubs.  Also covers
the pending-topology skip and the force-flag gating.
2026-04-29 00:18:20 -04:00
e3afec4e70 feat(mutator): live network.disconnect for apply_detach_decky
Symmetric to apply_attach_decky — after deleting the multi-home edge
from the DB, calls the docker SDK to drop the base container's
interface in the now-detached LAN.  Service containers lose
visibility automatically (they share the base's netns).

Idempotency: 'not connected' / 'no such' APIError is logged at info
and treated as success.
2026-04-29 00:15:39 -04:00
f347a3a736 feat(mutator): live network.connect for apply_attach_decky
After the DB writes that record the multi-home edge, calls the docker
SDK directly to add an interface to the base container's netns:

  client.networks.get(<topology bridge>).connect(<base>, ipv4_address=ip)

Non-destructive — the base keeps running, no recreate.  Service
containers automatically see the new interface because they share
the base's netns via network_mode: service:<base>.

Idempotency: docker APIError with 'already' / 'endpoint exists' is
logged at info and treated as success.  Other errors log + leave the
DB row in place; an operator retry will hit the same path.
2026-04-29 00:15:11 -04:00
eed55619cb feat(mutator): live teardown for apply_remove_decky
Captures the decky's name and services list before delete_topology_decky
runs (the helper needs both as compose targets even though the DB row
is gone), then calls _materialise_decky_remove which stops + rm -f's
the base + per-service containers via 'docker compose stop / rm -f'.

Re-renders the per-topology compose AFTER the stop/rm so a future
'compose up -d' on the file doesn't try to bring the decky back.
2026-04-29 00:14:44 -04:00
8c06190e69 feat(mutator): live spawn for apply_add_decky + shared materialisation helpers
Adds _materialise_decky_{spawn,remove,connect,disconnect,services_diff,recreate_base}
helpers alongside the existing _materialise_lan_change.  Each follows
the same skip rules: bail when topology is not active/degraded, when
agent-pinned, or when docker calls fail (logged, not re-raised — DB
remains source of truth).

apply_add_decky now calls _materialise_decky_spawn after the DB writes.
The helper:

* re-renders the per-topology compose so it lists the new decky;
* runs 'compose up -d --no-deps --build <decky_base> <decky>-<svc>...'
  in a worker thread (matches engine/services_live's pattern).

Service container targets are filtered through get_service() so
fleet_singleton services are skipped — they don't have per-decky
compose entries.  Gateway (forwards_l3=True) deckies need no
special-case here; the compose generator already emits the host
'ports:' block for them.

Subsequent commits wire the other apply_* ops to the matching
helpers.  Tests for the full set ship in the workstream's last
commit.
2026-04-29 00:14:18 -04:00
578cdf9e2e fix(mutator): reject hostile apply_update_lan changes on live topologies
subnet and is_dmz are pinned at deploy time — live deckies bind to
the bridge with IPs allocated from the old subnet, and is_dmz flips
the docker network's internal flag which can't be changed while
containers are attached.  Today the op happily wrote the new value
into the DB and left docker on the old one, drifting the two surfaces.

apply_update_lan now raises MutationError when topology status is
active or degraded and the patch touches subnet or is_dmz.  Coord
(x/y) and rename updates still pass through; renames don't currently
have a live caller and the bridge's docker name keys off the lan name
in the renderer, so the next deploy will reconcile.

This matches the posture taken by _materialise_lan_change for live
LAN add/remove (commit 472c84b).
2026-04-29 00:12:44 -04:00
472c84b9c8 fix(mutator): materialise live LAN add/remove on docker, not just the DB
apply_add_lan and apply_remove_lan were DB-only — they wrote/deleted
the topology_lans row but never created or destroyed the docker bridge
network.  Adding a LAN to a deployed topology silently did nothing on
the substrate side; any decky later attached to it had nowhere to bind.

Both ops now call a shared _materialise_lan_change helper after the DB
write.  When the topology is active/degraded and not pinned to a swarm
agent, the helper:

* creates / removes the docker bridge network (internal=True for
  non-DMZ LANs, mirroring engine/deployer.deploy_topology),
* re-renders the per-topology compose file so future redeploys reflect
  the change.

Failures are logged, not re-raised — the DB row stays as source of
truth so an operator can retry without leaking inconsistent state.
Agent-pinned topologies are skipped; the next agent push reconciles.

apply_add_decky / apply_attach_decky have the same gap and are not
fixed here — multi-homing a running container needs careful
recreate-vs-network-connect handling and is its own commit.  Without
those, dropping a decky into a freshly-added LAN still won't spawn a
container; only the LAN itself is now live.
2026-04-29 00:00:02 -04:00
862e4dbb31 merge: testing → main (reconcile 2-week divergence) 2026-04-28 18:36:00 -04:00
b2e4706a14 Refactor: implemented Repository Factory and Async Mutator Engine. Decoupled storage logic and enforced Dependency Injection across CLI and Web API. Updated documentation.
Some checks failed
CI / Lint (ruff) (push) Successful in 12s
CI / SAST (bandit) (push) Successful in 13s
CI / Dependency audit (pip-audit) (push) Successful in 22s
CI / Test (Standard) (3.11) (push) Failing after 54s
CI / Test (Standard) (3.12) (push) Successful in 1m35s
CI / Test (Live) (3.11) (push) Has been skipped
CI / Test (Fuzz) (3.11) (push) Has been skipped
CI / Merge dev → testing (push) Has been skipped
CI / Prepare Merge to Main (push) Has been skipped
CI / Finalize Merge to Main (push) Has been skipped
2026-04-12 07:48:17 -04:00
f78104e1c8 fix: resolve all ruff lint errors and SQLite UNIQUE constraint issue
Ruff fixes (20 errors → 0):
- F401: Remove unused imports (DeckyConfig, random_hostname, IniConfig,
  COMPOSE_FILE, sys, patch) across cli.py, mutator/engine.py,
  templates/ftp, templates/rdp, test_mysql.py, test_postgres.py
- F541: Remove extraneous f-prefixes on strings with no placeholders
  in templates/imap, test_ftp_live, test_http_live
- E741: Rename ambiguous variable 'l' to descriptive names (line, entry,
  part) across conftest.py, test_ftp_live, test_http_live,
  test_mongodb_live, test_pop3, test_ssh

SQLite fix:
- Change _initialize_sync() admin seeding from SELECT-then-INSERT to
  INSERT OR IGNORE, preventing IntegrityError when admin user already
  exists from a previous run
2026-04-12 02:17:50 -04:00
c384a3103a refactor: separate engine, collector, mutator, and fleet into independent subpackages
- decnet/engine/ — container lifecycle (deploy, teardown, status); _kill_api removed
- decnet/collector/ — Docker log streaming (moved from web/collector.py)
- decnet/mutator/ — mutation engine (no longer imports from cli or duplicates deployer code)
- decnet/fleet.py — shared decky-building logic extracted from cli.py

Cross-contamination eliminated:
- web router no longer imports from decnet.cli
- mutator no longer imports from decnet.cli
- cli no longer imports from decnet.web
- _kill_api() moved to cli (process management, not engine concern)
- _compose_with_retry duplicate removed from mutator
2026-04-12 00:26:22 -04:00