/api/v1/topologies/archetypes returns the archetype registry (slug,
display name, description, preferred services/distros, nmap_os
fingerprint) so the frontend wizard can render a live catalog instead
of hardcoding a copy.
The web bundle proxy handled GET/POST/PUT/DELETE but not PATCH or
preflight OPTIONS, which broke browser calls to PATCH endpoints behind
the static-bundle server. CORS middleware had the same gap.
db reset drops-and-recreates a fixed table set in FK order. Topology
tables weren't in the list, so reset left orphan topology rows behind
and a fresh MazeNET deploy could collide with stale child records.
topology delete cascades children (LANs, deckies, edges, mutations) but
refuses while containers are still running — teardown is prerequisite.
show stopped assuming every decky carried a full decky_config blob;
MazeNET-generated deckies only get hydrated on deploy, so fall back to
top-level name/services when the config isn't there.
Legacy fleet deckies live in decnet-state.json; MazeNET topology
containers don't. Tag them at compose-time with
decnet.topology.service=true and let the collector match on that label.
Spin up the agent's log collector on the first successful /topology/apply
(not in the lifespan — that would break the no-docker-on-boot invariant)
and tear it down with the app. Land log lines in DECNET_AGENT_LOG_FILE,
separate from master-side DECNET_INGEST_LOG_FILE, so a dev box running
both roles can't forward its own ingest back to itself.
When master pushes a topology that differs from whatever is pinned
locally, teardown the predecessor and accept the new one. Refusing with
409 left the agent stranded after partial deploys. record_error now
persists the hydrated blob so a later teardown can still walk the LAN
list — otherwise a half-failed apply strands containers + bridges with
no breadcrumb back to them.
Two small observability follow-ups to the phase-1 agent/topology wiring:
TopologySummary now carries needs_resync so operators can see the
heartbeat's resync flag via the topology list/detail API without
dropping into the DB.
TopologyStore.record_error becomes an upsert — when a docker/compose
failure fires during the first materialise (put() never reached), we
still land a marker row so GET /topology/state surfaces the error and
the next heartbeat carries an empty applied_version_hash. That empty
hash is what master's heartbeat check relies on to flag the topology
for resync instead of assuming the apply succeeded.
Agent heartbeats now carry an applied-topology snapshot. The master
heartbeat handler compares the reported version_hash against what
canonical_hash yields for the hydrated topology pinned to that host
and flags Topology.needs_resync on divergence (or when the agent
reports no topology at all while master expects one).
The mutator watch loop gains reconcile_agent_resyncs, which re-pushes
the current hydrated blob via AgentClient.apply_topology without
touching status, then clears the flag on success. Push failures leave
the flag set so the next tick retries.
deploy_topology and teardown_topology now branch on
target_host_uuid. When set:
- Hydrate the topology locally (validator runs exactly as before).
- Compute canonical_hash; push {hydrated, version_hash} to the
pinned agent through AgentClient.apply_topology.
- Status machine still moves PENDING -> DEPLOYING -> ACTIVE on 2xx,
PENDING -> DEPLOYING -> FAILED on error; master remains the sole
owner of the row.
Teardown flips to TEARING_DOWN, fires /topology/teardown, then
TORN_DOWN — we log a warning on agent error but still settle to
TORN_DOWN so operators can delete the row (agent garbage is cleaned
on the next re-enroll).
Unihost deploys are unchanged — the field defaults to NULL so every
existing flow takes the local path.
Step 6 of the agent <-> topology integration.
Three new RPCs mirroring the existing deploy/teardown/status pattern:
- apply_topology(hydrated, version_hash) — long-timeout (600s) for
image pulls + compose up.
- teardown_topology(topology_id) — 300s timeout; enough for a
stubborn compose-down without hanging a heartbeat.
- get_topology_state() — short control-plane read for reconcile.
The per-call timeout swap uses the same trick as .deploy().
Step 5 of the agent <-> topology integration.
New mTLS-protected routes on the agent:
- POST /topology/apply — master pushes {hydrated, version_hash}.
Validates the hash matches locally (serialisation drift guard),
runs the topology through the same validator/composer pipeline
used master-side, then creates bridges + compose up + records the
apply in topology.db.
- POST /topology/teardown — dismantles compose, removes bridges,
clears topology.db. Idempotent.
- GET /topology/state — returns applied row + live docker
observation for the heartbeat.
Implementation lives in decnet/agent/topology_ops.py; it reuses the
private compose helpers from decnet.engine.deployer so we don't
duplicate compose/project-name plumbing. The apply path is sync
under the hood (docker SDK + subprocess); we hop to a thread so the
event loop keeps servicing other agent traffic.
v1 is one-topology-per-agent; cross-topology apply returns 409.
Step 4 of the agent <-> topology integration.
Single-row sqlite tracking which topology the agent last applied and
its version hash. Sync/stdlib, same pattern as the log-forwarder
offset store. v1 is one-topology-per-agent; attempting to apply a
different topology over a populated row raises AlreadyApplied so the
endpoint can return 409. observed() snapshots live docker state
(decnet-topology-* bridges + decnet-* containers) for the heartbeat.
The store is a cache, not authority — no auto-restore on boot.
Master remains the only source of truth.
Step 3 of the agent <-> topology integration.
Tiny pure helper both master and agent will use to answer "is the
applied state the one we expect?". SHA-256 of canonical JSON with
volatile keys (timestamps, status, version, canvas x/y/w/h) stripped
so the hash only captures deployment-relevant state.
Step 2 of the agent <-> topology integration.
Adds the `target_host_uuid` FK on `Topology` plus wiring through the
two create endpoints (`POST /topologies`, `POST /topologies/blank`).
Validates the mode/host pair: `mode='agent'` now requires a known,
routable host; `mode='unihost'` must leave the field unset.
Surfaced on `TopologySummary` so list/detail responses expose it.
Purely additive at the schema level — existing unihost flows unchanged
(field defaults to `NULL`).
Step 1 of the agent <-> topology integration.
Active/degraded/failed/deploying topologies cannot be deleted
without first transitioning to torn_down, but the UI had no way
to trigger that. Add POST /topologies/{id}/teardown mirroring the
deploy endpoint (background task, 202 Accepted), and a
click-to-arm TEARDOWN button on the topology list card that shows
whenever the row is in a teardown-eligible state.
MazeNET publishes gateway ports on the host via Docker. With the
default userland-proxy enabled, attacker connections appear to
originate from the bridge gateway instead of the real remote IP.
Log a soft warning at deploy time when the topology publishes any
ports and docker info reports UserlandProxy=true, pointing the
operator at the daemon.json toggle. Best-effort: daemon talk
failures silently no-op.
A prior half-torn-down topology can leave a bridge network alive under
a different name that still owns our intended subnet. Docker then
rejects our create with 'Pool overlaps with other one on this address
space', and the topology deploy fails.
Extend create_bridge_network to sweep any unused bridge whose IPAM
subnet matches the one we're about to claim (skipping networks with
running containers — those are live use).
UI-created deckies (api_decky_crud, api_create_blank_topology) write
decky_config as sent by the client — typically just archetype flags,
without the name/ips_by_lan fields compose.py requires. The generator
path populates them at persist() time, so compose worked for generated
topologies but KeyError'd on UI-created ones.
Normalise in hydrate() so every write path feeds the same shape
downstream: mirror decky.name into decky_config.name, and allocate
per-LAN IPs deterministically (reserving the primary decky.ip where it
falls in-subnet, then filling remaining edges with next_free).
Add check_no_host_port_collision: enumerate the ports the topology's
gateways will publish (forwards_l3=True × svc.ports), probe live
listeners via psutil, emit a 'warning'-severity PORT_COLLISION
issue per overlap. Live-only — invoked from deploy_topology just
after dry-run branching, so unit tests that exercise validate()
stay hermetic.
Warning rather than error because docker-compose up will hard-fail
on a real collision anyway; this just gives operators a cleaner log
line ahead of the compose failure.
When a non-DMZ LAN is created via POST /lans, look up the topology's
gateway (decky with forwards_l3=True attached to the DMZ) and insert
an edge binding it to the new LAN. The gateway becomes multi-homed
to every internal LAN automatically, so DMZ_ORPHAN cannot arise
from ordinary editor use.
Also fixes delete_lan: the home-decky guard used scalar_one_or_none,
which blew up when the gateway already had >1 'other' LAN edge.
Switch to scalars().first() — we only need to know *some* other
edge exists, not a unique one.
Gateway deckies (forwards_l3=True) are the DMZ's ingress. Their
service containers share the base namespace via network_mode:service,
so any listener inside the gateway is reachable through the base
container's published ports. Emit 'ports: [<p>:<p>, ...]' on the
gateway base from svc.ports across the decky's service list.
This is the principled replacement for the broken network_mode: host
stub — with docker-proxy publishing, the DMZ works on any single-NIC
VPS (no MACVLAN, no promiscuous mode required).
POST /topologies/blank seeded the gateway decky with
archetype=host-gateway + network_mode=host, but neither was wired:
no compose writer reads network_mode and host-gateway is not a real
archetype. Replace with archetype=deaddeck + forwards_l3=true so the
gateway is a normal multi-homed bridge decky, consistent with how
compose.py interprets forwards_l3 (sysctl + NET_ADMIN).
Edge marked is_bridge=true, forwards_l3=true so downstream readers
(generator, compose, validator) see a real bridge attachment.
JA3/JA3S fingerprints are defined by their specs as MD5 digests of
the ClientHello/ServerHello feature tuples — they are identifiers,
not security primitives. Pass usedforsecurity=False at the two call
sites so bandit stops flagging them as B324 High when scanning
outside the templates/ exclude.
DECNET's app-level RequestValidationError handler remaps structural
422→400, including query/path constraint violations (limit bounds,
the next-subnet base pattern, etc.). Schemathesis fuzzing will drive
those code paths and fail response_schema_conformance unless 400 is
declared in responses={}. Adds the entry to every phase-3 read route.
GET /api/v1/topologies — paginated list with status filter. Extends
repo.list_topologies() to accept limit/offset and adds count_topologies()
for the total envelope field.
GET /api/v1/topologies/{id} — hydrated TopologyDetail; 404 if missing.
GET /api/v1/topologies/{id}/status-events — audit trail, limit-capped.
Catalog helpers for the phase-4 canvas UI:
* GET /topologies/services — full service catalog.
* GET /topologies/next-subnet?base=172.20 — wraps SubnetAllocator against
reserved_subnets across non-torn-down topologies.
* GET /topologies/{id}/lans/{lan_id}/next-ip — IPAllocator pre-seeded
with existing decky IPs in that LAN.
All read routes are viewer-or-admin. Sub-routers are included in an
order that keeps literal catalog paths (/services, /next-subnet) from
being shadowed by the /{topology_id} trie branch.
Add Pydantic DTOs in decnet/web/db/models.py covering every phase-3
endpoint shape: TopologyGenerateRequest, TopologySummary/Detail, child
create/update requests, MutationEnqueueRequest (Literal op guard),
MutationRow with JSON-payload decoder, validation/version/not-editable
error envelopes, and the three catalog responses.
Create decnet/web/router/topology/ as an import-safe package exporting
topology_router (prefix /topologies) — sub-routers land step-by-step in
subsequent commits. Mount under the main api router alongside swarm_mgmt.
tests/api/topology/test_models.py pins repo-dict ↔ DTO parity so future
repo-row drift breaks the contract test before the endpoints.
Adds the live-mutation pipeline for active/degraded topologies:
* TopologyMutation table with composite index (state, topology_id)
so the watch-loop guard query stays O(log n).
* claim_next_mutation is a single atomic UPDATE ... WHERE
state='pending' so racing reconcilers deterministically pick one
winner; losers see rowcount=0 and skip.
* reconcile_topologies drains pending rows per live topology, applies
via decnet.mutator.ops.dispatch, and on failure marks the mutation
failed + transitions topology to degraded.
* run_watch_loop gains a gated branch: flat-fleet mutate_all runs
every tick unchanged; the reconciler only enters when the cheap
has_pending_topology_mutation guard returns True.
* apply_* ops re-check hard invariants (names, IP collisions, subnet
overlap, known services, service_config shape) after every mutation
so the repo never lands in an invalid state.
* CLI: 'decnet topology mutate' / 'mutations' subcommands.
MazeNET phase 2 step 6. Equips the repo layer with the CRUD the web
editor needs before deploy.
- TopologyNotEditable exception: raised when a pending-only method hits
a non-pending topology. The intent is "free-form edits stop at deploy;
the mutator (step 7) takes over for live topologies."
- _assert_pending helper checks status inside the session.
- update_lan / update_topology_decky accept enforce_pending=True for
pre-deploy callers (existing internal callers default to False so
behavior is unchanged).
- delete_lan: cascades edges; refuses if any decky has only one edge
(= this LAN is its home) to prevent orphans.
- delete_topology_decky: cascades edges.
- delete_topology_edge: bare-bones removal.
All four mutators accept expected_version for optimistic concurrency.
Existing tests continue to pass (no behavior change for persist/deploy).
MazeNET phase 2 step 5. Pure storage — the generator emits None for
x/y and the web canvas fills them in later. No logic changes; no
compose, deploy, or validator impact.
MazeNET phase 2 step 4. Readies the repo layer for concurrent editors
(web canvas + CLI + mutator) without lost-write races.
- Topology.version: monotonically bumped on supervised child-row writes.
- VersionConflict exception carries {current, expected} for the UI.
- _check_and_bump_version helper reads Topology in the same session,
compares against expected_version, raises on mismatch, bumps on match.
Commit happens in the caller's existing transaction so check+bump+write
are atomic per mutation.
- add_lan / update_lan / add_topology_decky / update_topology_decky /
add_topology_edge accept expected_version=None by default, preserving
every existing caller's behavior.
When expected_version is None, no check runs and version stays put —
internal callers (persist) that don't care about concurrency keep
working unchanged.
MazeNET phase 2 step 3. Blocks deploys of hand-authored topologies that
would fail mid-bring-up (orphan deckies, duplicate IPs, overlapping
subnets, unknown services) with a structured error list instead of a
docker error at startup.
Rules (one function each, composable by the editor for inline hints):
- exactly one DMZ
- every LAN has a bridge chain to the DMZ (BFS via multi-homed deckies)
- no orphan deckies
- unique LAN and decky names per topology
- no IP collisions + IPs inside their LAN's subnet
- no LAN subnet overlaps
- every service in decnet.fleet.all_service_names()
- service_config keys match the decky's declared services
deploy_topology runs the validator after hydrate, before any status
transition or Docker call; errors raise ValidationError and status
stays at pending.
MazeNET phase 2 step 2. Mirrors the flat-fleet service_config pattern
(DeckyConfig.service_config → composer → svc.compose_fragment) into the
topology compose pipeline, so a hand-authored decky can carry overrides
like {"ssh": {"password": "megapassword"}} and the ssh fragment reads
them just like the flat path does.
- _PlannedDecky gains service_config: dict[str, dict].
- persist() stores it under decky_config["service_config"].
- topology/compose.py passes cfg.get("service_config", {}).get(svc, {})
to svc.compose_fragment(service_cfg=...).
Schema unchanged — service_config lives inside the existing
decky_config JSON blob. Zero changes in decnet/services/*.
MazeNET phase 2 step 1. Pulls inline IP/subnet allocation out of the
generator into decnet/topology/allocator.py so the editor + reconciler
can reuse the same primitives without duplicating logic.
- IPAllocator: stateful host-IP handout with reserve/release/is_free.
- SubnetAllocator: /24 handout under a base prefix, skips reservations.
- reserved_subnets(repo): collects claimed subnets across every
non-torn_down topology so concurrent drafts cannot collide.
- generate() accepts reserved_subnets= to skip existing claims.
Generator output is byte-identical under seed (behavior preserved).
Covers dry-run compose emission (no status change), FAILED transition
with reason logged on daemon errors, teardown from FAILED, and a
live-marked end-to-end test that creates/removes bridge networks
against a real docker daemon (skipped on CI).
decnet topology {generate,list,show,deploy,teardown} wraps the new
persistence and deployer APIs. Structured text output, no ASCII art —
visual DAG rendering belongs in the web dashboard. Group is master-only
via MASTER_ONLY_GROUPS and a _require_master_mode guard on each body.
Adds per-topology compose generation (one Docker bridge network per
LAN, multi-homed bridge deckies, ip_forward sysctl for L3 forwarders)
plus async deploy_topology/teardown_topology in the engine. Leaf-first
teardown via BFS-named LAN reverse sort; partial-state safe on failure.
Adds decnet/topology/ with:
- config.TopologyConfig: pydantic model driving generation (depth,
branching_factor, deckies_per_lan_min/max, bridge_forward_probability,
cross_edge_probability, subnet_base_prefix, service selection, seed).
Emits GeneratedTopology dataclass (lans, deckies, edges).
- status.TopologyStatus + assert_transition: seven-state machine with
an explicit legal-transition table. torn_down is terminal; degraded
is schema-reserved for future Healer use.
- generator.generate: deterministic DAG generation under config.seed.
Builds a tree of LANs (DMZ at root), plants deckies in each LAN,
promotes one decky per non-DMZ LAN to a parent bridge, and rolls
cross-edges per cross_edge_probability for DAG shape.
- persistence: persist() writes a plan to the repo as pending;
transition_status() enforces state-machine legality; hydrate() loads
topology + children into a single dict.
Covered by tests/topology/{test_status,test_generator,test_persistence}.
Adds topology CRUD to BaseRepository (NotImplementedError defaults) and
implements them in SQLModelRepository: create/get/list/delete topologies,
add/update/list LANs and TopologyDeckies, add/list edges, plus an atomic
update_topology_status that appends a TopologyStatusEvent in the same
transaction. Cascade delete sweeps children before the topology row.
Covered by tests/topology/test_repo.py (roundtrip, per-topology name
uniqueness, status event log, cascade delete, status filter) and an
extension to tests/test_base_repo.py for the NotImplementedError surface.
Introduces five new SQLModel tables for MazeNET (nested deception
topologies): Topology, LAN, TopologyDecky, TopologyEdge, and
TopologyStatusEvent. DeckyShard is intentionally not touched —
TopologyDecky is a purpose-built sibling for MazeNET's lifecycle
(topology-scoped UUIDs, per-topology name uniqueness).
Part of MazeNET v1 (nested self-container network-of-networks).
Schemathesis was failing CI on routes that returned status codes not
declared in their OpenAPI responses= dicts. Adds the missing codes
across swarm_updates, swarm_mgmt, swarm, fleet and attackers routers.
Also adds 400 to every POST/PUT/PATCH that accepts a JSON body —
Starlette returns 400 on malformed/non-UTF8 bodies before FastAPI's
422 validation runs, which schemathesis fuzzing trips every time.
No handler logic changed.
The 1,878-line cli.py held every Typer command plus process/HTTP helpers
and mode-gating logic. Split into one module per command using a
register(app) pattern so submodules never import app at module scope,
eliminating circular-import risk.
- utils.py: process helpers, _http_request, _kill_all_services, console, log
- gating.py: MASTER_ONLY_* sets, _require_master_mode, _gate_commands_by_mode
- deploy.py: deploy + _deploy_swarm (tightly coupled)
- lifecycle.py: status, teardown, redeploy
- workers.py: probe, collect, mutate, correlate
- inventory.py, swarm.py, db.py, and one file per remaining command
__init__.py calls register(app) on each module then runs the mode gate
last, and re-exports the private symbols tests patch against
(_db_reset_mysql_async, _kill_all_services, _require_master_mode, etc.).
Test patches retargeted to the submodule where each name now resolves.
Enroll-bundle tarball test updated to assert decnet/cli/__init__.py.
No behavioral change.
Uvicorn's h11/httptools HTTP protocols don't populate scope['extensions']['tls'], so /swarm/heartbeat's per-request cert pinning was 403ing every call despite CERT_REQUIRED validating the cert at handshake. Patch RequestResponseCycle.__init__ on both protocol modules to read the peer cert off the asyncio transport and write DER bytes into scope['extensions']['tls']['client_cert_chain']. Importing the module from swarm_api.py auto-installs the patch in the swarmctl uvicorn worker before any request is served.
New decnet.agent.heartbeat asyncio loop wired into the agent FastAPI
lifespan. Every 30 s the worker POSTs executor.status() to the master's
/swarm/heartbeat with its DECNET_HOST_UUID for self-identity; the
existing agent mTLS bundle provides the client cert the master pins
against SwarmHost.client_cert_fingerprint.
start() is a silent no-op when identity env (HOST_UUID, MASTER_HOST) is
unset or the worker bundle is missing, so dev runs and un-enrolled hosts
don't crash the agent app. On non-204 responses the loop logs loudly but
keeps ticking — an operator may re-enrol mid-session, and fail-closed
pinning shouldn't be self-silencing.
swarmctl CLI gains --tls/--cert/--key/--client-ca flags. With --tls the
controller runs uvicorn under HTTPS + mTLS (CERT_REQUIRED) so worker
heartbeats can reach it cross-host. Default is still 127.0.0.1 plaintext
for backwards compat with the master-CLI enrollment flow.
Auto-issue path (no --cert/--key given): a server cert signed by the
existing DECNET CA is issued once and parked under ~/.decnet/swarmctl/.
Workers already ship that CA's ca.crt from the enroll bundle, so they
verify the endpoint with no extra trust config. BYOC via --cert/--key
when the operator wants a publicly-trusted or externally-managed cert.
The auto-cert path is idempotent across restarts to keep a stable
fingerprint for any long-lived mTLS sessions.