853 Commits

Author SHA1 Message Date
bf5ed7abbb feat(engine): emit creation/retirement mutation events on deploy/teardown
Close the lifecycle loop for the correlation graph: every decky now
enters the substrate with an explicit `trigger=creation` event
(old_services=[] ⇒ new_services=<initial>) and leaves it with
`trigger=retirement` (old=<current> ⇒ new=[]).  With scheduled/operator
mutations already flowing through emit_decky_mutated, the entire decky
lifecycle is now a well-formed sequence of mutation events — the
correlator can fold substrate_state(t) at any T by replaying them.

Lazy-imports mutator.events to dodge the engine↔mutator circular
dependency.  Bus is None at CLI sites; the syslog write is what the
correlator consumes.  Emission is soft-failing so a broken log path
never aborts a deploy.
2026-04-21 19:35:05 -04:00
fa0cdb3ab5 feat(mutator): route mutate_decky through emit_decky_mutated with trigger
Mutator now emits one decky_mutated event (RFC 5424 + bus) per
successful mutation instead of the inline decky.<id>.state bus
publish.  The previous state topic published new_services only;
mutation events carry old/new/trigger, which is what the correlation
engine needs to interleave substrate-change markers into attacker
traversals.

- mutate_decky gains trigger: MutationTrigger = "operator" and
  captures old_services before the shuffle; replaces the inline
  _publish_safely(decky.<id>.state) with emit_decky_mutated(...).
- mutate_all derives trigger internally: operator when force or
  only-filter is set (CLI --all, API mutate-now, UI bus request);
  scheduled on interval ticks.  Passed through to each mutate_decky
  call.
- Tests updated: the old decky.<id>.state assertion is replaced
  with decky.<id>.mutation topic + mutation payload shape; 3 new
  tests cover trigger derivation for scheduled / force / only paths.
  26 tests in test_mutator.py green; 116 across mutator + topology
  + bus.
2026-04-21 19:31:31 -04:00
f875350d75 feat(mutator): emit_decky_mutated helper — RFC 5424 + bus in one call
First step toward making mutation events first-class nodes in the
correlation graph. Today the graph silently reflects post-mutation
state with no marker of the transition; this helper lands the
emitter the mutator and deploy paths will call.

- decnet/mutator/events.py: emit_decky_mutated(bus, *, decky,
  old_services, new_services, trigger, actor=None, log_path=None)
  writes an RFC 5424 line (service=mutator, hostname=<decky>,
  MSGID=decky_mutated, SD params for old/new services + trigger +
  optional actor) to DECNET_INGEST_LOG_FILE, then fire-and-forget
  publishes on decky.<id>.mutation. Either side failing is soft —
  the other path still completes.
- MutationTrigger Literal covers creation, retirement, scheduled,
  operator, behavioral, healer, federation. Reserved values for v2/v3
  (behavioral + federation) stay nullable so the schema is stable.
- decnet/bus/topics.py: DECKY_MUTATION constant + decky_mutation(id)
  builder. Distinct from DECKY_STATE ("current shape") because a
  mutation is a transition event, not a steady-state snapshot.
- Empty-set symmetry: creation emits old_services=[], retirement
  emits new_services=[]. Every decky lifecycle becomes a well-formed
  fold sequence on the correlator side.
- 4 new tests: FakeBus + correlator parser round-trip; creation and
  retirement empty-set cases; bus=None still writes syslog;
  unwritable log path doesn't block bus publish. 95 tests green
  across test_mutator + tests/bus.
2026-04-21 19:29:21 -04:00
e23c6c4ee4 feat(mutator): bus-wake on decky mutate_request; adaptive sleep; heartbeat
The flat-fleet mutator was DB-poll-only and noisy — it logged
"no active deployment found" every 10s on idle hosts and ran
mutate_all at a fixed tick regardless of when the next decky
was due.

- mutate_all returns seconds-until-next-due; watch loop sleeps
  min(next_due, poll_interval_secs) with a 1s floor.
- "No deployment" is now idle, not an error: edge-triggered log
  on present<->absent transition instead of every tick.
- mutate_decky publishes decky.<name>.state on successful compose
  so UIs react in real time.
- New decky.*.mutate_request subscription lets API/CLI/UI force
  an immediate mutation of a specific decky without waiting for
  its interval; target name feeds mutate_all(only={...}).
- system.mutator.health heartbeat via run_health_heartbeat helper,
  bringing the mutator in line with DEBT-031 workers.

Tests: next_due return, only= filter, decky.<name>.state publish
on success, no publish on compose failure. Full mutator+topology-
mutator+bus suite (109) green.
2026-04-21 19:28:01 -04:00
5c0631e12c feat(agent,forwarder,updater): publish system.<worker>.health heartbeats (DEBT-031 workers 7-9)
All three workers now share a run_health_heartbeat helper in
decnet.bus.publish.  Each publishes system.<worker>.health on a 30s tick
with {worker, ts} plus optional per-worker extras.  Subscribers can
watch system.*.health to see every DECNET worker on a host at once.

- agent: heartbeat runs inside the FastAPI lifespan alongside the
  existing master-facing heartbeat; bus-disabled path is a no-op.
- forwarder: heartbeat task spawned at run_forwarder entry, cancelled
  in the finally block so a crashed master loop never leaks the task.
- updater: new FastAPI lifespan hosts the heartbeat.

Heartbeat helper swallows extra() failures and is cancellation-safe so
lifespan teardown never hangs on it.
2026-04-21 17:02:10 -04:00
cbb394a160 feat(ingester): publish system.log per committed batch (DEBT-031 worker 6)
Ingester connects the bus at startup, emits a batch-committed summary
(component/flushed/position) after each successful _flush_batch.  Zero-
row flushes are suppressed so the topic stays meaningful.

Complements the collector's per-line system.log publishes: collector
signals ingress, ingester signals DB-persisted progress.  Federation
forwarder (worker 8) will subscribe to the batch-committed leaf to
trigger its upstream push.

Bus stays optional: publish_safely swallows failures, get_bus() can
return None, DECNET_BUS_ENABLED=false leaves the ingestion loop fully
functional.
2026-04-21 16:58:49 -04:00
a448dbe283 feat(collector): publish system.log per ingested event (DEBT-031 worker 5)
log_collector_worker connects the bus at startup, builds a thread-safe
system.log publisher, and hands it to each container-stream thread
through _stream_container's new publish_fn parameter.  Publishing fires
right after the JSON record is written — same rate-limiter path, no
extra parsing, compact payload (decky/service/event_type/attacker_ip/
timestamp) so subscribers can redraw without re-reading the DB.

Bus stays optional: if get_bus() fails or DECNET_BUS_ENABLED=false the
factory returns a no-op publisher and the stream thread calls it
unconditionally.  Hook failures are logged and never abort the thread.
2026-04-21 16:57:21 -04:00
67c2e30f89 feat(profiler): publish attacker.scored per profile upsert (DEBT-031 worker 4)
The profiler worker threads its bus publisher through _WorkerState so
_update_profiles can emit a compact attacker.scored event for every
upsert.  Payload carries the headline counts (event/service/decky/
bounty/credential) plus is_traversal, so the MazeNET attacker pool can
redraw without a round-trip.

Bus stays optional: publish_attacker=None when DECNET_BUS_ENABLED=false
or get_bus() fails, and hook exceptions are logged without breaking the
upsert path.
2026-04-21 16:54:40 -04:00
e51b65d7c3 feat(correlation,profiler): publish attacker.observed on first sighting (DEBT-031 worker 3)
CorrelationEngine gains an optional publish_fn hook fired once per unique
attacker IP.  The profiler worker — sole caller of the engine today —
carries the bus physically, builds a thread-safe publisher, and wraps it
with the attacker.observed topic before handing it in.

Bus stays optional: if get_bus() fails or DECNET_BUS_ENABLED=false, the
engine runs publish_fn=None and the worker degrades to DB-only.  Hook
failures log a warning and never break ingestion.
2026-04-21 16:53:03 -04:00
34d9e37ab0 feat(prober): publish attacker.fingerprinted on the bus (DEBT-031)
Each successful JARM / HASSH / TCPfp probe fans out an
attacker.fingerprinted event; the probe family goes in event.type so a
single subscription covers all three.  Payload carries the attacker IP,
port, and probe-specific hash — enough for the MazeNET live map to
render fingerprint info on observed attackers.

Lifts the thread-safe publisher helper out of the sniffer worker into
decnet/bus/publish.py so the prober (and every future worker with a
to_thread hot path) can reuse it without copy-pasting the
run_coroutine_threadsafe dance.  Sniffer rewires onto the shared helper
in passing.

Adds ATTACKER_FINGERPRINTED as a new leaf — distinct from
ATTACKER_OBSERVED (correlator's first-sight signal) because an active
probe result is additional evidence about an already-observed attacker.

Note: the plan's decky.{id}.state realism-probe publish path is
deferred — the current prober fingerprints attackers, not decky
realism.  Will revisit when realism probes exist.
2026-04-21 16:47:55 -04:00
7f497ac552 feat(sniffer): publish decky.{id}.traffic on the bus (DEBT-031)
SnifferEngine gains an optional publish_fn hook, invoked after the
dedup + syslog write for traffic-summary events only (tls_session,
tcp_flow_timing, tcp_syn_fingerprint) — intermediate parser artifacts
like tls_client_hello stay off the bus.

The sniffer worker wires get_bus() + a thread-safe shim that marshals
sync calls from the scapy sniff thread back onto the asyncio loop via
run_coroutine_threadsafe.  Bus failure at startup degrades cleanly to
publish-off mode; publish failures at runtime never escape the sniff
thread.
2026-04-21 16:35:50 -04:00
f3eaab5d37 refactor(bus): extract publish_safely + extend topics for DEBT-031
Shared publish_safely helper at decnet/bus/publish.py so the nine
workers about to be wired into the bus don't each copy-paste the
"never raise back at the caller" contract. Mutator drops its private
copy and imports the canonical one.

topics.py gains the attacker.* hierarchy (observed, scored,
session.started, session.ended) and a system_health(worker) builder
for per-worker health heartbeats — both prerequisites for the worker
rollout under DEBT-031.
2026-04-21 16:32:30 -04:00
f611e7363b feat(mutator,web): live topology mutation pipeline backend (DEBT-030)
Wire the mutator and web API into the service bus so live-topology
edits flow sub-second from enqueue to UI:

- Mutator publishes every state transition on the bus (mutation.applying
  /applied/failed + topology.status). Fire-and-forget; DB stays source
  of truth.
- Mutator watch loop subscribes to topology.*.mutation.enqueued and
  wakes early via asyncio.Event — the 10s poll becomes a fallback
  heartbeat, not the primary dispatch trigger.
- POST /topologies/{id}/mutations publishes mutation.enqueued after
  the DB write succeeds.
- New GET /topologies/{id}/events SSE route: snapshot on connect
  (status + in-flight mutations), live forwards topology.{id}.>
  bus events, 15s keepalive. ?token= auth mirrors /stream.
- New decnet/bus/app.py — process-wide lazy bus singleton for the
  API, closed cleanly on lifespan shutdown.
2026-04-21 14:38:25 -04:00
fbf289ff63 feat(bus): host-local UNIX-socket pub/sub worker (DEBT-029)
Land the `decnet bus` worker and `get_bus()` factory. Transport is a
host-local UNIX-domain socket (0660, group=decnet); authz is the file
mode. Wire framing is a tiny verb-line + 4-byte-BE length + orjson body.
NATS-style wildcard topics (`*`, `>`). At-most-once, fire-and-forget —
DB stays the source of truth. `FakeBus` / `NullBus` for tests and the
disabled path. Cross-host federation is deferred to a future
`--bridge-tcp` mode; DEBT-030 is master-only and unblocked.
2026-04-21 13:49:02 -04:00
071312fc0c feat(web/api): expose archetype catalog endpoint
/api/v1/topologies/archetypes returns the archetype registry (slug,
display name, description, preferred services/distros, nmap_os
fingerprint) so the frontend wizard can render a live catalog instead
of hardcoding a copy.
2026-04-21 10:24:01 -04:00
542637c0dc feat(web/api): support PATCH on proxy and CORS
The web bundle proxy handled GET/POST/PUT/DELETE but not PATCH or
preflight OPTIONS, which broke browser calls to PATCH endpoints behind
the static-bundle server. CORS middleware had the same gap.
2026-04-21 10:23:55 -04:00
1b29a7692c feat(cli/db): include topology tables in db reset
db reset drops-and-recreates a fixed table set in FK order. Topology
tables weren't in the list, so reset left orphan topology rows behind
and a fresh MazeNET deploy could collide with stale child records.
2026-04-21 10:23:49 -04:00
e75198cca9 feat(cli/topology): add delete command and null-safe show
topology delete cascades children (LANs, deckies, edges, mutations) but
refuses while containers are still running — teardown is prerequisite.
show stopped assuming every decky carried a full decky_config blob;
MazeNET-generated deckies only get hydrated on deploy, so fall back to
top-level name/services when the config isn't there.
2026-04-21 10:23:37 -04:00
0cdcfe2653 feat(agent/collector): topology-label discovery and master-authoritative supersede
Legacy fleet deckies live in decnet-state.json; MazeNET topology
containers don't. Tag them at compose-time with
decnet.topology.service=true and let the collector match on that label.
Spin up the agent's log collector on the first successful /topology/apply
(not in the lifespan — that would break the no-docker-on-boot invariant)
and tear it down with the app. Land log lines in DECNET_AGENT_LOG_FILE,
separate from master-side DECNET_INGEST_LOG_FILE, so a dev box running
both roles can't forward its own ingest back to itself.

When master pushes a topology that differs from whatever is pinned
locally, teardown the predecessor and accept the new one. Refusing with
409 left the agent stranded after partial deploys. record_error now
persists the hydrated blob so a later teardown can still walk the LAN
list — otherwise a half-failed apply strands containers + bridges with
no breadcrumb back to them.
2026-04-21 10:23:10 -04:00
12e18b75db feat(swarm): expose needs_resync on TopologySummary + upsert record_error
Two small observability follow-ups to the phase-1 agent/topology wiring:

TopologySummary now carries needs_resync so operators can see the
heartbeat's resync flag via the topology list/detail API without
dropping into the DB.

TopologyStore.record_error becomes an upsert — when a docker/compose
failure fires during the first materialise (put() never reached), we
still land a marker row so GET /topology/state surfaces the error and
the next heartbeat carries an empty applied_version_hash. That empty
hash is what master's heartbeat check relies on to flag the topology
for resync instead of assuming the apply succeeded.
2026-04-21 01:41:30 -04:00
e8f9c955b3 feat(swarm): heartbeat-driven topology resync for agent-pinned deployments
Agent heartbeats now carry an applied-topology snapshot. The master
heartbeat handler compares the reported version_hash against what
canonical_hash yields for the hydrated topology pinned to that host
and flags Topology.needs_resync on divergence (or when the agent
reports no topology at all while master expects one).

The mutator watch loop gains reconcile_agent_resyncs, which re-pushes
the current hydrated blob via AgentClient.apply_topology without
touching status, then clears the flag on success. Push failures leave
the flag set so the next tick retries.
2026-04-21 01:35:12 -04:00
05d1ebbaaa feat(engine): route agent-pinned topologies via AgentClient
deploy_topology and teardown_topology now branch on
target_host_uuid.  When set:

- Hydrate the topology locally (validator runs exactly as before).
- Compute canonical_hash; push {hydrated, version_hash} to the
  pinned agent through AgentClient.apply_topology.
- Status machine still moves PENDING -> DEPLOYING -> ACTIVE on 2xx,
  PENDING -> DEPLOYING -> FAILED on error; master remains the sole
  owner of the row.

Teardown flips to TEARING_DOWN, fires /topology/teardown, then
TORN_DOWN — we log a warning on agent error but still settle to
TORN_DOWN so operators can delete the row (agent garbage is cleaned
on the next re-enroll).

Unihost deploys are unchanged — the field defaults to NULL so every
existing flow takes the local path.

Step 6 of the agent <-> topology integration.
2026-04-21 01:27:59 -04:00
5f8a746d6e feat(swarm): AgentClient topology apply/teardown/state methods
Three new RPCs mirroring the existing deploy/teardown/status pattern:

- apply_topology(hydrated, version_hash) — long-timeout (600s) for
  image pulls + compose up.
- teardown_topology(topology_id) — 300s timeout; enough for a
  stubborn compose-down without hanging a heartbeat.
- get_topology_state() — short control-plane read for reconcile.

The per-call timeout swap uses the same trick as .deploy().

Step 5 of the agent <-> topology integration.
2026-04-21 01:26:21 -04:00
13cb0ff38e feat(agent): topology apply/teardown/state endpoints
New mTLS-protected routes on the agent:

- POST /topology/apply — master pushes {hydrated, version_hash}.
  Validates the hash matches locally (serialisation drift guard),
  runs the topology through the same validator/composer pipeline
  used master-side, then creates bridges + compose up + records the
  apply in topology.db.
- POST /topology/teardown — dismantles compose, removes bridges,
  clears topology.db.  Idempotent.
- GET /topology/state — returns applied row + live docker
  observation for the heartbeat.

Implementation lives in decnet/agent/topology_ops.py; it reuses the
private compose helpers from decnet.engine.deployer so we don't
duplicate compose/project-name plumbing.  The apply path is sync
under the hood (docker SDK + subprocess); we hop to a thread so the
event loop keeps servicing other agent traffic.

v1 is one-topology-per-agent; cross-topology apply returns 409.

Step 4 of the agent <-> topology integration.
2026-04-21 01:25:15 -04:00
aea3e7e05b feat(agent): sqlite-backed topology_store as applied-state cache
Single-row sqlite tracking which topology the agent last applied and
its version hash.  Sync/stdlib, same pattern as the log-forwarder
offset store.  v1 is one-topology-per-agent; attempting to apply a
different topology over a populated row raises AlreadyApplied so the
endpoint can return 409.  observed() snapshots live docker state
(decnet-topology-* bridges + decnet-* containers) for the heartbeat.

The store is a cache, not authority — no auto-restore on boot.
Master remains the only source of truth.

Step 3 of the agent <-> topology integration.
2026-04-21 01:22:01 -04:00
98465af226 feat(topology): canonical_hash for applied-state comparison
Tiny pure helper both master and agent will use to answer "is the
applied state the one we expect?".  SHA-256 of canonical JSON with
volatile keys (timestamps, status, version, canvas x/y/w/h) stripped
so the hash only captures deployment-relevant state.

Step 2 of the agent <-> topology integration.
2026-04-21 01:20:42 -04:00
5a0cf5d7c8 feat(topology): add target_host_uuid to pin topologies to swarm agents
Adds the `target_host_uuid` FK on `Topology` plus wiring through the
two create endpoints (`POST /topologies`, `POST /topologies/blank`).
Validates the mode/host pair: `mode='agent'` now requires a known,
routable host; `mode='unihost'` must leave the field unset.
Surfaced on `TopologySummary` so list/detail responses expose it.
Purely additive at the schema level — existing unihost flows unchanged
(field defaults to `NULL`).

Step 1 of the agent <-> topology integration.
2026-04-21 01:19:45 -04:00
b261e8e5fa feat(topology): add teardown endpoint + UI button
Active/degraded/failed/deploying topologies cannot be deleted
without first transitioning to torn_down, but the UI had no way
to trigger that. Add POST /topologies/{id}/teardown mirroring the
deploy endpoint (background task, 202 Accepted), and a
click-to-arm TEARDOWN button on the topology list card that shows
whenever the row is in a teardown-eligible state.
2026-04-20 23:41:37 -04:00
c37d1f09c6 feat(deployer): warn when userland-proxy masks attacker source IPs
MazeNET publishes gateway ports on the host via Docker. With the
default userland-proxy enabled, attacker connections appear to
originate from the bridge gateway instead of the real remote IP.
Log a soft warning at deploy time when the topology publishes any
ports and docker info reports UserlandProxy=true, pointing the
operator at the daemon.json toggle. Best-effort: daemon talk
failures silently no-op.
2026-04-20 23:37:59 -04:00
4d2e38f616 fix(network): sweep orphan Docker bridges that squat on our subnet
A prior half-torn-down topology can leave a bridge network alive under
a different name that still owns our intended subnet.  Docker then
rejects our create with 'Pool overlaps with other one on this address
space', and the topology deploy fails.

Extend create_bridge_network to sweep any unused bridge whose IPAM
subnet matches the one we're about to claim (skipping networks with
running containers — those are live use).
2026-04-20 23:19:42 -04:00
d22922fc72 fix(topology): backfill decky_config name and ips_by_lan in hydrate
UI-created deckies (api_decky_crud, api_create_blank_topology) write
decky_config as sent by the client — typically just archetype flags,
without the name/ips_by_lan fields compose.py requires.  The generator
path populates them at persist() time, so compose worked for generated
topologies but KeyError'd on UI-created ones.

Normalise in hydrate() so every write path feeds the same shape
downstream: mirror decky.name into decky_config.name, and allocate
per-LAN IPs deterministically (reserving the primary decky.ip where it
falls in-subnet, then filling remaining edges with next_free).
2026-04-20 23:19:32 -04:00
2c35d60d45 feat(mazenet): host port-collision warning at deploy time
Add check_no_host_port_collision: enumerate the ports the topology's
gateways will publish (forwards_l3=True × svc.ports), probe live
listeners via psutil, emit a 'warning'-severity PORT_COLLISION
issue per overlap. Live-only — invoked from deploy_topology just
after dry-run branching, so unit tests that exercise validate()
stay hermetic.

Warning rather than error because docker-compose up will hard-fail
on a real collision anyway; this just gives operators a cleaner log
line ahead of the compose failure.
2026-04-20 23:07:31 -04:00
be4e1b1891 feat(mazenet): auto-bridge new LANs to the DMZ gateway
When a non-DMZ LAN is created via POST /lans, look up the topology's
gateway (decky with forwards_l3=True attached to the DMZ) and insert
an edge binding it to the new LAN. The gateway becomes multi-homed
to every internal LAN automatically, so DMZ_ORPHAN cannot arise
from ordinary editor use.

Also fixes delete_lan: the home-decky guard used scalar_one_or_none,
which blew up when the gateway already had >1 'other' LAN edge.
Switch to scalars().first() — we only need to know *some* other
edge exists, not a unique one.
2026-04-20 23:07:19 -04:00
3618c59d08 feat(mazenet): publish gateway service ports via docker
Gateway deckies (forwards_l3=True) are the DMZ's ingress. Their
service containers share the base namespace via network_mode:service,
so any listener inside the gateway is reachable through the base
container's published ports. Emit 'ports: [<p>:<p>, ...]' on the
gateway base from svc.ports across the decky's service list.

This is the principled replacement for the broken network_mode: host
stub — with docker-proxy publishing, the DMZ works on any single-NIC
VPS (no MACVLAN, no promiscuous mode required).
2026-04-20 23:07:07 -04:00
cc9765e54e fix(mazenet): drop fictional host-mode on DMZ gateway stub
POST /topologies/blank seeded the gateway decky with
archetype=host-gateway + network_mode=host, but neither was wired:
no compose writer reads network_mode and host-gateway is not a real
archetype. Replace with archetype=deaddeck + forwards_l3=true so the
gateway is a normal multi-homed bridge decky, consistent with how
compose.py interprets forwards_l3 (sysctl + NET_ADMIN).

Edge marked is_bridge=true, forwards_l3=true so downstream readers
(generator, compose, validator) see a real bridge attachment.
2026-04-20 23:06:54 -04:00
897ce4035f fix(sniffer): mark JA3/JA3S MD5 hashing as non-security
JA3/JA3S fingerprints are defined by their specs as MD5 digests of
the ClientHello/ServerHello feature tuples — they are identifiers,
not security primitives. Pass usedforsecurity=False at the two call
sites so bandit stops flagging them as B324 High when scanning
outside the templates/ exclude.
2026-04-20 23:06:31 -04:00
d06b04221f feat(api/topology): live mutation queue endpoints (POST/GET /mutations) 2026-04-20 19:38:55 -04:00
ff0b2efbb0 feat(api/topology): pending-only child CRUD for LANs, deckies, edges 2026-04-20 19:37:16 -04:00
999113e3c3 feat(api/topology): POST/DELETE/deploy endpoints for MazeNET topologies 2026-04-20 19:34:35 -04:00
38db76dd14 fix(api): document 400 on topology read endpoints for schemathesis contract
DECNET's app-level RequestValidationError handler remaps structural
422→400, including query/path constraint violations (limit bounds,
the next-subnet base pattern, etc.).  Schemathesis fuzzing will drive
those code paths and fail response_schema_conformance unless 400 is
declared in responses={}.  Adds the entry to every phase-3 read route.
2026-04-20 18:30:32 -04:00
f182c98ffa feat(api): phase 3 step 2 — topology read endpoints (list/get/status/catalog)
GET /api/v1/topologies — paginated list with status filter. Extends
repo.list_topologies() to accept limit/offset and adds count_topologies()
for the total envelope field.

GET /api/v1/topologies/{id} — hydrated TopologyDetail; 404 if missing.
GET /api/v1/topologies/{id}/status-events — audit trail, limit-capped.

Catalog helpers for the phase-4 canvas UI:
* GET /topologies/services — full service catalog.
* GET /topologies/next-subnet?base=172.20 — wraps SubnetAllocator against
  reserved_subnets across non-torn-down topologies.
* GET /topologies/{id}/lans/{lan_id}/next-ip — IPAllocator pre-seeded
  with existing decky IPs in that LAN.

All read routes are viewer-or-admin. Sub-routers are included in an
order that keeps literal catalog paths (/services, /next-subnet) from
being shadowed by the /{topology_id} trie branch.
2026-04-20 18:25:33 -04:00
2379b2aeda feat(api): phase 3 step 1 — topology request/response models + router skeleton
Add Pydantic DTOs in decnet/web/db/models.py covering every phase-3
endpoint shape: TopologyGenerateRequest, TopologySummary/Detail, child
create/update requests, MutationEnqueueRequest (Literal op guard),
MutationRow with JSON-payload decoder, validation/version/not-editable
error envelopes, and the three catalog responses.

Create decnet/web/router/topology/ as an import-safe package exporting
topology_router (prefix /topologies) — sub-routers land step-by-step in
subsequent commits. Mount under the main api router alongside swarm_mgmt.

tests/api/topology/test_models.py pins repo-dict ↔ DTO parity so future
repo-row drift breaks the contract test before the endpoints.
2026-04-20 18:16:30 -04:00
a76b9ecdf9 feat(mazenet): step 7 — topology_mutations queue + mutator reconciler
Adds the live-mutation pipeline for active/degraded topologies:

* TopologyMutation table with composite index (state, topology_id)
  so the watch-loop guard query stays O(log n).
* claim_next_mutation is a single atomic UPDATE ... WHERE
  state='pending' so racing reconcilers deterministically pick one
  winner; losers see rowcount=0 and skip.
* reconcile_topologies drains pending rows per live topology, applies
  via decnet.mutator.ops.dispatch, and on failure marks the mutation
  failed + transitions topology to degraded.
* run_watch_loop gains a gated branch: flat-fleet mutate_all runs
  every tick unchanged; the reconciler only enters when the cheap
  has_pending_topology_mutation guard returns True.
* apply_* ops re-check hard invariants (names, IP collisions, subnet
  overlap, known services, service_config shape) after every mutation
  so the repo never lands in an invalid state.
* CLI: 'decnet topology mutate' / 'mutations' subcommands.
2026-04-20 18:02:37 -04:00
91df57d36b feat(topology): pending-only mutation repo methods with cascade + guards
MazeNET phase 2 step 6. Equips the repo layer with the CRUD the web
editor needs before deploy.

- TopologyNotEditable exception: raised when a pending-only method hits
  a non-pending topology. The intent is "free-form edits stop at deploy;
  the mutator (step 7) takes over for live topologies."
- _assert_pending helper checks status inside the session.
- update_lan / update_topology_decky accept enforce_pending=True for
  pre-deploy callers (existing internal callers default to False so
  behavior is unchanged).
- delete_lan: cascades edges; refuses if any decky has only one edge
  (= this LAN is its home) to prevent orphans.
- delete_topology_decky: cascades edges.
- delete_topology_edge: bare-bones removal.

All four mutators accept expected_version for optimistic concurrency.
Existing tests continue to pass (no behavior change for persist/deploy).
2026-04-20 17:50:29 -04:00
9afaac7612 feat(topology): nullable layout coords on LAN + TopologyDecky
MazeNET phase 2 step 5. Pure storage — the generator emits None for
x/y and the web canvas fills them in later. No logic changes; no
compose, deploy, or validator impact.
2026-04-20 17:48:29 -04:00
e475c0957e feat(topology): optimistic concurrency via Topology.version + expected_version
MazeNET phase 2 step 4. Readies the repo layer for concurrent editors
(web canvas + CLI + mutator) without lost-write races.

- Topology.version: monotonically bumped on supervised child-row writes.
- VersionConflict exception carries {current, expected} for the UI.
- _check_and_bump_version helper reads Topology in the same session,
  compares against expected_version, raises on mismatch, bumps on match.
  Commit happens in the caller's existing transaction so check+bump+write
  are atomic per mutation.
- add_lan / update_lan / add_topology_decky / update_topology_decky /
  add_topology_edge accept expected_version=None by default, preserving
  every existing caller's behavior.

When expected_version is None, no check runs and version stays put —
internal callers (persist) that don't care about concurrency keep
working unchanged.
2026-04-20 17:47:28 -04:00
2544d0294a feat(topology): add pre-deploy validator and wire into deploy_topology
MazeNET phase 2 step 3. Blocks deploys of hand-authored topologies that
would fail mid-bring-up (orphan deckies, duplicate IPs, overlapping
subnets, unknown services) with a structured error list instead of a
docker error at startup.

Rules (one function each, composable by the editor for inline hints):
- exactly one DMZ
- every LAN has a bridge chain to the DMZ (BFS via multi-homed deckies)
- no orphan deckies
- unique LAN and decky names per topology
- no IP collisions + IPs inside their LAN's subnet
- no LAN subnet overlaps
- every service in decnet.fleet.all_service_names()
- service_config keys match the decky's declared services

deploy_topology runs the validator after hydrate, before any status
transition or Docker call; errors raise ValidationError and status
stays at pending.
2026-04-20 17:45:32 -04:00
d4f4c58277 feat(topology): thread per-service config overrides through compose
MazeNET phase 2 step 2. Mirrors the flat-fleet service_config pattern
(DeckyConfig.service_config → composer → svc.compose_fragment) into the
topology compose pipeline, so a hand-authored decky can carry overrides
like {"ssh": {"password": "megapassword"}} and the ssh fragment reads
them just like the flat path does.

- _PlannedDecky gains service_config: dict[str, dict].
- persist() stores it under decky_config["service_config"].
- topology/compose.py passes cfg.get("service_config", {}).get(svc, {})
  to svc.compose_fragment(service_cfg=...).

Schema unchanged — service_config lives inside the existing
decky_config JSON blob. Zero changes in decnet/services/*.
2026-04-20 17:42:37 -04:00
1bd1846e40 feat(topology): extract IP + subnet allocators as reusable services
MazeNET phase 2 step 1. Pulls inline IP/subnet allocation out of the
generator into decnet/topology/allocator.py so the editor + reconciler
can reuse the same primitives without duplicating logic.

- IPAllocator: stateful host-IP handout with reserve/release/is_free.
- SubnetAllocator: /24 handout under a base prefix, skips reservations.
- reserved_subnets(repo): collects claimed subnets across every
  non-torn_down topology so concurrent drafts cannot collide.
- generate() accepts reserved_subnets= to skip existing claims.

Generator output is byte-identical under seed (behavior preserved).
2026-04-20 17:41:17 -04:00
80e3c28234 test(topology): deploy dry-run + failure-path + live docker e2e
Covers dry-run compose emission (no status change), FAILED transition
with reason logged on daemon errors, teardown from FAILED, and a
live-marked end-to-end test that creates/removes bridge networks
against a real docker daemon (skipped on CI).
2026-04-20 16:57:43 -04:00