Replaces LICENSE (GPLv3 -> AGPLv3) and prepends
`SPDX-License-Identifier: AGPL-3.0-or-later` to every source file
across decnet/, decnet_web/, tests/, scripts/, and tools/.
Rationale: closes the GPLv3 ASP loophole so any party operating a
modified DECNET as a network service must offer their modified
source. Personal copyright (Samuel Paschuan) + inbound=outbound
contributions make a future unilateral relicense infeasible.
- LICENSE: full AGPL-3.0 text (gnu.org/licenses/agpl-3.0.txt)
- COPYRIGHT: project copyright notice
- tools/add_spdx_headers.py: idempotent header injector
(shebang- and PEP 263-aware)
Touches 1565 source files (.py, .ts, .tsx, .js, .jsx, .css, .sh).
No behavior change; comments only.
This is the unblock for the wizard hang. Both endpoints used to run
docker compose synchronously inside the HTTP handler -- on master
(unihost) or via asyncio.gather of worker /deploy POSTs at 600s
timeout each (swarm) -- blocking every other API request.
New flow:
1. Commit the new config shape to repo state (fast).
2. Create one DeckyLifecycle row per decky (status=pending).
3. Spawn asyncio.create_task(run_deploy / run_mutate) -- the
lifecycle runner drives rows through running -> succeeded|failed
and emits decky.<name>.lifecycle on the bus.
4. Return 202 with {lifecycle_ids: [...]}. Wizard polls
GET /deckies/lifecycle?ids=... (next commit).
mutator/engine.py gains pick_new_services() -- shared between the
async API path and the watch-loop's synchronous mutate_decky().
DeployResponse grows lifecycle_ids[]. The old dispatch_decnet_config
helper still exists for the CLI swarm-deploy command path; it just
isn't called from the API handler anymore.
Test changes: 200 -> 202, drop dispatch_decnet_config mocks (handler
no longer calls it), assert lifecycle_ids in response + committed
state matches expectations.
- Implement /mutate handler: load_state, update services + last_mutated,
save_state, write_compose, compose up -d via asyncio.to_thread. 404
for missing state / unknown decky_id. dry_run short-circuits before
any side effect.
- Add AgentClient.mutate(decky_id, services, *, dry_run=False) using
_TIMEOUT_DEPLOY (compose up can pull/build, exceeds control timeout).
- mutator/engine.py: in swarm mode with decky.host_uuid set, resolve
worker via _resolve_swarm_host and dispatch through AgentClient.mutate
instead of writing a compose file on master. Master-resident deckies
(unihost mode, or swarm with host_uuid=None) keep the local path.
Replace repo: BaseRepository with a structural TopologyRepository protocol
in persistence.py and allocator.py. All read methods now return typed DTOs
(TopologySummary, LANRow, DeckyRow, EdgeRow) instead of raw dicts, eliminating
silent field-shape regressions across the topology subsystem.
TopologySummary gains email_personas and language_default so api_personas.py
can continue reading those fields via attribute access. hydrate() converts
DTOs to dicts before passing to _backfill_decky_configs, keeping the mutable
working-state function dict-based at its boundary. All production callers
(router handlers, mutator, CLI, heartbeat) migrated from dict/get access to
attribute access. 134 tests pass.
Dashboard's ACTIVE DECKIES (active_deckies in get_stats_summary) counts
TopologyDecky rows where state='running'. No code path was flipping
that state away from the default 'pending', so the count read 0/N
even when every container was running fine — the dashboard was lying.
Two complementary fixes:
1. deploy_topology — after the post-deploy compose ps verification,
reconcile each TopologyDecky.state from the corresponding base
container's docker state. running → 'running'; anything else →
'failed'. Reuses the ps_rows already gathered for the
ACTIVE-vs-DEGRADED status decision; no extra docker hit.
2. apply_add_decky — _materialise_decky_spawn now returns True/False;
on True the row is updated to state='running' before
_assert_valid_after. Catches the case where a decky added via the
live mutator queue stays at 'pending' indefinitely (the deployer's
reconcile only runs on a fresh deploy_topology pass).
Existing topology deckies in active topologies will still read as
'pending' until the next deploy_topology runs, since this is
forward-only. An operator-side fix is to teardown + redeploy or run
the (forthcoming) reconcile-on-startup pass.
apply_add_decky's compose-up was hard-failing whenever the operator's
~/.docker/buildx/activity/ landed on a read-only mount — the wedge
detection in _compose_with_retry correctly refuses to retry (would
just leak more mounts), but for live materialisation we don't want a
wedged buildx state to abort an admin's mutation. ANTI hit it on
adding decky-a977: 'failed to update builder last activity time: ...
read-only file system → buildx wedge detected → returned non-zero'.
_compose_up_with_buildkit_fallback wraps _compose_with_retry: on a
CalledProcessError whose stderr matches both wedge signatures
(_BUILDX_WEDGE_SIGNATURE + _BUILDX_EROFS_SIGNATURE), it logs a
warning with the manual recovery steps + retries once with
DOCKER_BUILDKIT=0 set. The legacy non-buildx builder doesn't use
the activity dir and isn't affected.
Wired into the two paths that pass --build:
* _materialise_decky_spawn (apply_add_decky)
* _materialise_decky_services_diff (apply_update_decky service add)
_materialise_decky_recreate_base doesn't build — it just recreates a
container from an existing image — so it's not affected.
Operator-facing log message points at the manual fix
(rm -rf ~/.docker/buildx/activity + docker buildx create) so they
can recover at their leisure; we don't ATTEMPT the recovery because
the activity dir might be RO for a reason (zfs/btrfs snapshot, etc.)
that an automated rm would be wrong to fight.
apply_update_decky's flip path now refuses to promote a decky to
gateway unless its home LAN is a DMZ. The compose generator publishes
host ports for forwards_l3=True; a non-DMZ gateway would shadow the
host's port space without anything legitimately able to reach the
service. Same posture as the existing 'forwards_l3 flip on live
requires force=true' guard — refused before any DB write so a bad
mutation leaves zero side-effects.
The check is intentionally NOT a standing _RULES invariant — the
codebase uses forwards_l3 for two semantics:
1. Generic L3 forwarding (internal bridge deckies routing between
their multi-home LANs). The generator writes this on internal
bridges via bridge_forward_probability; legitimately non-DMZ.
2. DMZ gateway (host-port publisher). Only meaningful on DMZ.
Standing validation can't enforce DMZ-homing without breaking case 1.
The guard fires only on the explicit user-driven flip path where the
operator's intent is unambiguously case 2. Generator output and
internal-bridge attachments bypass the check.
check_gateway_homed_in_dmz lives in validate.py for callers that want
the explicit form (and for the test surface), but is not a standing
rule — comment in _RULES explains the asymmetry.
Two related fixes that came out of running the W5 tests locally:
1. tests/__init__.py — empty file, makes 'tests/' a package so pytest
stops inserting it into sys.path. Without it, 'tests/docker/'
(the docker-image test category) shadowed the installed docker SDK
on every engine-touching test in the repo:
module 'docker' has no attribute 'DockerClient'
Pytest's default --import-mode=prepend was the culprit; making
tests/ a package is the cheapest fix and doesn't change
--import-mode for the whole tree.
2. delete_topology_decky / delete_topology_edge / delete_lan grow an
'enforce_pending: bool = True' kwarg. Default preserves the HTTP
CRUD guard (api_decky_crud / api_edge_crud / api_lan_crud get the
409 for free). apply_remove_decky / apply_detach_decky /
apply_remove_lan now pass enforce_pending=False — the mutator
queue is the live-editing surface and has its own active-topology
gating; the repo's pending-only guard was for design-time CRUD
that mustn't bypass it. Without this, apply_remove_decky was
silently broken on active topologies pre-W5; W5's new test
surfaced it on first run.
10/10 new W5 tests pass; 58/58 across mutator + topology suites.
apply_update_decky now discriminates three sub-cases:
* services list changed → diff old vs new and call
_materialise_decky_services_diff (compose up -d for added,
stop + rm -f for removed). Mirrors services_live's pattern but
doesn't import it — mutator-routed mutations carry a different bus
surface (mutation.applied) than the direct API path
(decky.<name>.service_added).
* forwards_l3 flipped → port publishing changes, which docker can
only apply at container-create time. Gated on payload['force'] is
true; default raises MutationError so a half-thinking operator
can't stomp a live decky. When force=true,
_materialise_decky_recreate_base does compose up -d --no-deps
--force-recreate. Pre-checked BEFORE the DB write so a refused
mutation leaves zero side-effects.
* coord-only (x/y) → DB only, no docker work.
Ships tests/mutator/test_ops_materialisation.py with focused coverage
for every new helper: add_decky/remove_decky/attach_decky/
detach_decky/update_decky/update_lan paths against an active
topology, with compose primitives + docker SDK mocked at the source
modules so the helpers' lazy imports pick up the stubs. Also covers
the pending-topology skip and the force-flag gating.
Symmetric to apply_attach_decky — after deleting the multi-home edge
from the DB, calls the docker SDK to drop the base container's
interface in the now-detached LAN. Service containers lose
visibility automatically (they share the base's netns).
Idempotency: 'not connected' / 'no such' APIError is logged at info
and treated as success.
After the DB writes that record the multi-home edge, calls the docker
SDK directly to add an interface to the base container's netns:
client.networks.get(<topology bridge>).connect(<base>, ipv4_address=ip)
Non-destructive — the base keeps running, no recreate. Service
containers automatically see the new interface because they share
the base's netns via network_mode: service:<base>.
Idempotency: docker APIError with 'already' / 'endpoint exists' is
logged at info and treated as success. Other errors log + leave the
DB row in place; an operator retry will hit the same path.
Captures the decky's name and services list before delete_topology_decky
runs (the helper needs both as compose targets even though the DB row
is gone), then calls _materialise_decky_remove which stops + rm -f's
the base + per-service containers via 'docker compose stop / rm -f'.
Re-renders the per-topology compose AFTER the stop/rm so a future
'compose up -d' on the file doesn't try to bring the decky back.
Adds _materialise_decky_{spawn,remove,connect,disconnect,services_diff,recreate_base}
helpers alongside the existing _materialise_lan_change. Each follows
the same skip rules: bail when topology is not active/degraded, when
agent-pinned, or when docker calls fail (logged, not re-raised — DB
remains source of truth).
apply_add_decky now calls _materialise_decky_spawn after the DB writes.
The helper:
* re-renders the per-topology compose so it lists the new decky;
* runs 'compose up -d --no-deps --build <decky_base> <decky>-<svc>...'
in a worker thread (matches engine/services_live's pattern).
Service container targets are filtered through get_service() so
fleet_singleton services are skipped — they don't have per-decky
compose entries. Gateway (forwards_l3=True) deckies need no
special-case here; the compose generator already emits the host
'ports:' block for them.
Subsequent commits wire the other apply_* ops to the matching
helpers. Tests for the full set ship in the workstream's last
commit.
subnet and is_dmz are pinned at deploy time — live deckies bind to
the bridge with IPs allocated from the old subnet, and is_dmz flips
the docker network's internal flag which can't be changed while
containers are attached. Today the op happily wrote the new value
into the DB and left docker on the old one, drifting the two surfaces.
apply_update_lan now raises MutationError when topology status is
active or degraded and the patch touches subnet or is_dmz. Coord
(x/y) and rename updates still pass through; renames don't currently
have a live caller and the bridge's docker name keys off the lan name
in the renderer, so the next deploy will reconcile.
This matches the posture taken by _materialise_lan_change for live
LAN add/remove (commit 472c84b).
apply_add_lan and apply_remove_lan were DB-only — they wrote/deleted
the topology_lans row but never created or destroyed the docker bridge
network. Adding a LAN to a deployed topology silently did nothing on
the substrate side; any decky later attached to it had nowhere to bind.
Both ops now call a shared _materialise_lan_change helper after the DB
write. When the topology is active/degraded and not pinned to a swarm
agent, the helper:
* creates / removes the docker bridge network (internal=True for
non-DMZ LANs, mirroring engine/deployer.deploy_topology),
* re-renders the per-topology compose file so future redeploys reflect
the change.
Failures are logged, not re-raised — the DB row stays as source of
truth so an operator can retry without leaking inconsistent state.
Agent-pinned topologies are skipped; the next agent push reconciles.
apply_add_decky / apply_attach_decky have the same gap and are not
fixed here — multi-homing a running container needs careful
recreate-vs-network-connect handling and is its own commit. Without
those, dropping a decky into a freshly-added LAN still won't spawn a
container; only the LAN itself is now live.