First step toward making mutation events first-class nodes in the
correlation graph. Today the graph silently reflects post-mutation
state with no marker of the transition; this helper lands the
emitter the mutator and deploy paths will call.
- decnet/mutator/events.py: emit_decky_mutated(bus, *, decky,
old_services, new_services, trigger, actor=None, log_path=None)
writes an RFC 5424 line (service=mutator, hostname=<decky>,
MSGID=decky_mutated, SD params for old/new services + trigger +
optional actor) to DECNET_INGEST_LOG_FILE, then fire-and-forget
publishes on decky.<id>.mutation. Either side failing is soft —
the other path still completes.
- MutationTrigger Literal covers creation, retirement, scheduled,
operator, behavioral, healer, federation. Reserved values for v2/v3
(behavioral + federation) stay nullable so the schema is stable.
- decnet/bus/topics.py: DECKY_MUTATION constant + decky_mutation(id)
builder. Distinct from DECKY_STATE ("current shape") because a
mutation is a transition event, not a steady-state snapshot.
- Empty-set symmetry: creation emits old_services=[], retirement
emits new_services=[]. Every decky lifecycle becomes a well-formed
fold sequence on the correlator side.
- 4 new tests: FakeBus + correlator parser round-trip; creation and
retirement empty-set cases; bus=None still writes syslog;
unwritable log path doesn't block bus publish. 95 tests green
across test_mutator + tests/bus.
The flat-fleet mutator was DB-poll-only and noisy — it logged
"no active deployment found" every 10s on idle hosts and ran
mutate_all at a fixed tick regardless of when the next decky
was due.
- mutate_all returns seconds-until-next-due; watch loop sleeps
min(next_due, poll_interval_secs) with a 1s floor.
- "No deployment" is now idle, not an error: edge-triggered log
on present<->absent transition instead of every tick.
- mutate_decky publishes decky.<name>.state on successful compose
so UIs react in real time.
- New decky.*.mutate_request subscription lets API/CLI/UI force
an immediate mutation of a specific decky without waiting for
its interval; target name feeds mutate_all(only={...}).
- system.mutator.health heartbeat via run_health_heartbeat helper,
bringing the mutator in line with DEBT-031 workers.
Tests: next_due return, only= filter, decky.<name>.state publish
on success, no publish on compose failure. Full mutator+topology-
mutator+bus suite (109) green.
All nine service workers now participate in the host-local bus: sniffer,
prober, correlator (via profiler), profiler, collector, ingester, agent,
forwarder, updater. Pre-bus behavior is preserved end-to-end for
DECNET_BUS_ENABLED=false and get_bus() failures.
Three items intentionally deferred: realism-probe decky.{id}.state
(needs a realism probe path that doesn't exist yet), correlator session
boundaries (needs session state), and bus-wake subscriptions (publishes
landed; wake side wired to no subscriber today).
All three workers now share a run_health_heartbeat helper in
decnet.bus.publish. Each publishes system.<worker>.health on a 30s tick
with {worker, ts} plus optional per-worker extras. Subscribers can
watch system.*.health to see every DECNET worker on a host at once.
- agent: heartbeat runs inside the FastAPI lifespan alongside the
existing master-facing heartbeat; bus-disabled path is a no-op.
- forwarder: heartbeat task spawned at run_forwarder entry, cancelled
in the finally block so a crashed master loop never leaks the task.
- updater: new FastAPI lifespan hosts the heartbeat.
Heartbeat helper swallows extra() failures and is cancellation-safe so
lifespan teardown never hangs on it.
Ingester connects the bus at startup, emits a batch-committed summary
(component/flushed/position) after each successful _flush_batch. Zero-
row flushes are suppressed so the topic stays meaningful.
Complements the collector's per-line system.log publishes: collector
signals ingress, ingester signals DB-persisted progress. Federation
forwarder (worker 8) will subscribe to the batch-committed leaf to
trigger its upstream push.
Bus stays optional: publish_safely swallows failures, get_bus() can
return None, DECNET_BUS_ENABLED=false leaves the ingestion loop fully
functional.
log_collector_worker connects the bus at startup, builds a thread-safe
system.log publisher, and hands it to each container-stream thread
through _stream_container's new publish_fn parameter. Publishing fires
right after the JSON record is written — same rate-limiter path, no
extra parsing, compact payload (decky/service/event_type/attacker_ip/
timestamp) so subscribers can redraw without re-reading the DB.
Bus stays optional: if get_bus() fails or DECNET_BUS_ENABLED=false the
factory returns a no-op publisher and the stream thread calls it
unconditionally. Hook failures are logged and never abort the thread.
The profiler worker threads its bus publisher through _WorkerState so
_update_profiles can emit a compact attacker.scored event for every
upsert. Payload carries the headline counts (event/service/decky/
bounty/credential) plus is_traversal, so the MazeNET attacker pool can
redraw without a round-trip.
Bus stays optional: publish_attacker=None when DECNET_BUS_ENABLED=false
or get_bus() fails, and hook exceptions are logged without breaking the
upsert path.
CorrelationEngine gains an optional publish_fn hook fired once per unique
attacker IP. The profiler worker — sole caller of the engine today —
carries the bus physically, builds a thread-safe publisher, and wraps it
with the attacker.observed topic before handing it in.
Bus stays optional: if get_bus() fails or DECNET_BUS_ENABLED=false, the
engine runs publish_fn=None and the worker degrades to DB-only. Hook
failures log a warning and never break ingestion.
Each successful JARM / HASSH / TCPfp probe fans out an
attacker.fingerprinted event; the probe family goes in event.type so a
single subscription covers all three. Payload carries the attacker IP,
port, and probe-specific hash — enough for the MazeNET live map to
render fingerprint info on observed attackers.
Lifts the thread-safe publisher helper out of the sniffer worker into
decnet/bus/publish.py so the prober (and every future worker with a
to_thread hot path) can reuse it without copy-pasting the
run_coroutine_threadsafe dance. Sniffer rewires onto the shared helper
in passing.
Adds ATTACKER_FINGERPRINTED as a new leaf — distinct from
ATTACKER_OBSERVED (correlator's first-sight signal) because an active
probe result is additional evidence about an already-observed attacker.
Note: the plan's decky.{id}.state realism-probe publish path is
deferred — the current prober fingerprints attackers, not decky
realism. Will revisit when realism probes exist.
SnifferEngine gains an optional publish_fn hook, invoked after the
dedup + syslog write for traffic-summary events only (tls_session,
tcp_flow_timing, tcp_syn_fingerprint) — intermediate parser artifacts
like tls_client_hello stay off the bus.
The sniffer worker wires get_bus() + a thread-safe shim that marshals
sync calls from the scapy sniff thread back onto the asyncio loop via
run_coroutine_threadsafe. Bus failure at startup degrades cleanly to
publish-off mode; publish failures at runtime never escape the sniff
thread.
Shared publish_safely helper at decnet/bus/publish.py so the nine
workers about to be wired into the bus don't each copy-paste the
"never raise back at the caller" contract. Mutator drops its private
copy and imports the canonical one.
topics.py gains the attacker.* hierarchy (observed, scored,
session.started, session.ended) and a system_health(worker) builder
for per-worker health heartbeats — both prerequisites for the worker
rollout under DEBT-031.
Per-worker integration of the service bus shipped in DEBT-029. Publishes
are fire-and-forget; subscribes wake polling loops. Bus stays optional —
if get_bus() fails or DECNET_BUS_ENABLED=false, workers log once and
continue in poll-only mode (mirrors decnet/mutator/engine.py:run_watch_loop).
- scripts/bus/smoke-mutator.sh: boots decnet bus, subscribes to
topology.>, publishes one event per mutation-lifecycle state plus
a topology.status transition, asserts all four land on the
subscriber. Cheap E2E for the topic hierarchy the mutator + SSE
route rely on.
- development/DEBT.md: mark DEBT-030 ✅ resolved (Phase A) with a
summary of what shipped; flag the optimistic staged-buffer editor
as Phase B follow-up, not debt.
- tests/topology/test_mutator.py: reconcile_topologies publishes
applying+applied on success, applying+failed+status on failure; and
stays safe when bus=None. _wake_on_enqueue sets its asyncio.Event
on every matching enqueue event.
- tests/api/topology/test_mutations.py: POST /mutations publishes
mutation.enqueued after a successful DB write, via a FakeBus
injected in place of the app-wide bus singleton.
- tests/api/topology/test_events_stream.py: SSE route returns 401
unauthenticated, 404 for unknown topologies, and (driving the
async generator directly) emits a snapshot on connect plus
forwards a published mutation.applied as an `event: mutation.applied`
SSE frame.
Wire the MazeNET editor to the new /topologies/{id}/events SSE route
so live (active|degraded) topologies reflect mutator state transitions
without reload:
- useTopologyStream hook opens an EventSource against
/topologies/{id}/events?token=<jwt>, with 3s reconnect matching the
dashboard's /stream consumer. Callback refs avoid tearing down the
connection on consumer rerenders.
- useMazeApi gains enqueueMutation(topologyId, op, payload,
expectedVersion?) — thin wrapper over POST /mutations.
- MazeNET.tsx opens the stream only when topoStatus is active|degraded
(pending editors have nothing to stream) and refetches on
mutation.applied|failed|status events. Header shows a LIVE /
CONNECTING… indicator.
Phase A slice — Apply (N changes) with an optimistic staged buffer
lands in a follow-up; the hooks + API method it'll need are already
here.
Wire the mutator and web API into the service bus so live-topology
edits flow sub-second from enqueue to UI:
- Mutator publishes every state transition on the bus (mutation.applying
/applied/failed + topology.status). Fire-and-forget; DB stays source
of truth.
- Mutator watch loop subscribes to topology.*.mutation.enqueued and
wakes early via asyncio.Event — the 10s poll becomes a fallback
heartbeat, not the primary dispatch trigger.
- POST /topologies/{id}/mutations publishes mutation.enqueued after
the DB write succeeds.
- New GET /topologies/{id}/events SSE route: snapshot on connect
(status + in-flight mutations), live forwards topology.{id}.>
bus events, 15s keepalive. ?token= auth mirrors /stream.
- New decnet/bus/app.py — process-wide lazy bus singleton for the
API, closed cleanly on lifespan shutdown.
start.sh boots a local bus on /tmp (no root, no decnet group).
sub.py / pub.py are thin CLIs over UnixSocketBus for manual poking.
smoke.sh is a self-contained end-to-end check — spawns a worker,
subscribes, publishes, asserts delivery, cleans up.
Land the `decnet bus` worker and `get_bus()` factory. Transport is a
host-local UNIX-domain socket (0660, group=decnet); authz is the file
mode. Wire framing is a tiny verb-line + 4-byte-BE length + orjson body.
NATS-style wildcard topics (`*`, `>`). At-most-once, fire-and-forget —
DB stays the source of truth. `FakeBus` / `NullBus` for tests and the
disabled path. Cross-host federation is deferred to a future
`--bridge-tcp` mode; DEBT-030 is master-only and unblocked.
Port the design-handoff layout into a scoped DeckyFleet.css (no more
piggybacking on Dashboard.css). Add an archetype-first creation wizard
that consumes /api/v1/topologies/archetypes, falling back to the
MazeNET ARCHETYPES constant when the endpoint is unavailable.
Canvas grew a deployed prop so nodes can visually distinguish "live in
docker" from "planned". ContextMenu learned nested submenus with
ChevronRight affordance; NetBox renders a ShieldAlert for DMZ LANs;
Palette got additional lucide icons. Dead PendingChange union pulled
out of types.ts — Phase-3 mutation ops are driven by the API layer now,
not a frontend type.
New /topologies page lists topologies; a bare /mazenet now redirects
there since the editor has no meaning without ?topology=<id>. Wizard
picks up a note style + tweaked copy.
test_compose asserts the new decnet.topology.* labels land on both base
deckies (role=base, no service marker) and service fragments
(service=true). The stub docker client in test_deploy grew a filters
kwarg so it keeps matching the real .networks.list(filters=...) call
signature now used by the deployer.
/api/v1/topologies/archetypes returns the archetype registry (slug,
display name, description, preferred services/distros, nmap_os
fingerprint) so the frontend wizard can render a live catalog instead
of hardcoding a copy.
The web bundle proxy handled GET/POST/PUT/DELETE but not PATCH or
preflight OPTIONS, which broke browser calls to PATCH endpoints behind
the static-bundle server. CORS middleware had the same gap.
db reset drops-and-recreates a fixed table set in FK order. Topology
tables weren't in the list, so reset left orphan topology rows behind
and a fresh MazeNET deploy could collide with stale child records.
topology delete cascades children (LANs, deckies, edges, mutations) but
refuses while containers are still running — teardown is prerequisite.
show stopped assuming every decky carried a full decky_config blob;
MazeNET-generated deckies only get hydrated on deploy, so fall back to
top-level name/services when the config isn't there.
Legacy fleet deckies live in decnet-state.json; MazeNET topology
containers don't. Tag them at compose-time with
decnet.topology.service=true and let the collector match on that label.
Spin up the agent's log collector on the first successful /topology/apply
(not in the lifespan — that would break the no-docker-on-boot invariant)
and tear it down with the app. Land log lines in DECNET_AGENT_LOG_FILE,
separate from master-side DECNET_INGEST_LOG_FILE, so a dev box running
both roles can't forward its own ingest back to itself.
When master pushes a topology that differs from whatever is pinned
locally, teardown the predecessor and accept the new one. Refusing with
409 left the agent stranded after partial deploys. record_error now
persists the hydrated blob so a later teardown can still walk the LAN
list — otherwise a half-failed apply strands containers + bridges with
no breadcrumb back to them.
Replaces the single-line name input with a modal that mirrors the
design-handoff DeployWizard shape (backdrop + violet-bordered panel,
wizard-step tabs, card-picker body):
- Step 1 — TARGET: a RUN LOCALLY card plus one card per enrolled
swarm host. Non-routable hosts render disabled with their status as
the tooltip. Selecting an agent pins the topology via
target_host_uuid; local stays unihost.
- Step 2 — TYPE: BLANK (POST /topologies/blank) or SEED-BASED
(POST /topologies/ with depth, branching, deckies-per-LAN, optional
seed). Name is required on both.
Existing navigate-to-editor-on-create behavior is preserved.
Two small observability follow-ups to the phase-1 agent/topology wiring:
TopologySummary now carries needs_resync so operators can see the
heartbeat's resync flag via the topology list/detail API without
dropping into the DB.
TopologyStore.record_error becomes an upsert — when a docker/compose
failure fires during the first materialise (put() never reached), we
still land a marker row so GET /topology/state surfaces the error and
the next heartbeat carries an empty applied_version_hash. That empty
hash is what master's heartbeat check relies on to flag the topology
for resync instead of assuming the apply succeeded.
Four regression tests guarding Step 8 of the agent/topology wiring:
- Lifespan startup must not call docker.from_env even with a populated
topology.db — replace docker with a boom-stub and assert zero calls.
- GET /topology/state returns the cached row verbatim without
re-materialising bridges/containers; live observation is read-only.
- Static guard: TopologyStore must not grow a restore/replay/reapply
method without someone re-reading the module docstring.
- Raw sqlite read + a second TopologyStore instance confirm the store
is passive — nothing scrubs stale rows on open, which is the
behaviour master's resync flow depends on.
Agent heartbeats now carry an applied-topology snapshot. The master
heartbeat handler compares the reported version_hash against what
canonical_hash yields for the hydrated topology pinned to that host
and flags Topology.needs_resync on divergence (or when the agent
reports no topology at all while master expects one).
The mutator watch loop gains reconcile_agent_resyncs, which re-pushes
the current hydrated blob via AgentClient.apply_topology without
touching status, then clears the flag on success. Push failures leave
the flag set so the next tick retries.
deploy_topology and teardown_topology now branch on
target_host_uuid. When set:
- Hydrate the topology locally (validator runs exactly as before).
- Compute canonical_hash; push {hydrated, version_hash} to the
pinned agent through AgentClient.apply_topology.
- Status machine still moves PENDING -> DEPLOYING -> ACTIVE on 2xx,
PENDING -> DEPLOYING -> FAILED on error; master remains the sole
owner of the row.
Teardown flips to TEARING_DOWN, fires /topology/teardown, then
TORN_DOWN — we log a warning on agent error but still settle to
TORN_DOWN so operators can delete the row (agent garbage is cleaned
on the next re-enroll).
Unihost deploys are unchanged — the field defaults to NULL so every
existing flow takes the local path.
Step 6 of the agent <-> topology integration.
Three new RPCs mirroring the existing deploy/teardown/status pattern:
- apply_topology(hydrated, version_hash) — long-timeout (600s) for
image pulls + compose up.
- teardown_topology(topology_id) — 300s timeout; enough for a
stubborn compose-down without hanging a heartbeat.
- get_topology_state() — short control-plane read for reconcile.
The per-call timeout swap uses the same trick as .deploy().
Step 5 of the agent <-> topology integration.
New mTLS-protected routes on the agent:
- POST /topology/apply — master pushes {hydrated, version_hash}.
Validates the hash matches locally (serialisation drift guard),
runs the topology through the same validator/composer pipeline
used master-side, then creates bridges + compose up + records the
apply in topology.db.
- POST /topology/teardown — dismantles compose, removes bridges,
clears topology.db. Idempotent.
- GET /topology/state — returns applied row + live docker
observation for the heartbeat.
Implementation lives in decnet/agent/topology_ops.py; it reuses the
private compose helpers from decnet.engine.deployer so we don't
duplicate compose/project-name plumbing. The apply path is sync
under the hood (docker SDK + subprocess); we hop to a thread so the
event loop keeps servicing other agent traffic.
v1 is one-topology-per-agent; cross-topology apply returns 409.
Step 4 of the agent <-> topology integration.
Single-row sqlite tracking which topology the agent last applied and
its version hash. Sync/stdlib, same pattern as the log-forwarder
offset store. v1 is one-topology-per-agent; attempting to apply a
different topology over a populated row raises AlreadyApplied so the
endpoint can return 409. observed() snapshots live docker state
(decnet-topology-* bridges + decnet-* containers) for the heartbeat.
The store is a cache, not authority — no auto-restore on boot.
Master remains the only source of truth.
Step 3 of the agent <-> topology integration.
Tiny pure helper both master and agent will use to answer "is the
applied state the one we expect?". SHA-256 of canonical JSON with
volatile keys (timestamps, status, version, canvas x/y/w/h) stripped
so the hash only captures deployment-relevant state.
Step 2 of the agent <-> topology integration.
Adds the `target_host_uuid` FK on `Topology` plus wiring through the
two create endpoints (`POST /topologies`, `POST /topologies/blank`).
Validates the mode/host pair: `mode='agent'` now requires a known,
routable host; `mode='unihost'` must leave the field unset.
Surfaced on `TopologySummary` so list/detail responses expose it.
Purely additive at the schema level — existing unihost flows unchanged
(field defaults to `NULL`).
Step 1 of the agent <-> topology integration.
Dragging a LAN or decky, or resizing a NetBox, updates React state
but previously vanished on reload because the grid-layout adapter
rewrote everything from the graph. Add a per-topology localStorage
snapshot (key: mazenet.layout.<topologyId>) that captures net
x/y/w/h and decky x/y; useLayoutPersistor writes it debounced, and
getTopology merges it over adaptTopology's grid so entities without
a stored entry still fall back to a clean auto-layout. Deleting a
topology calls clearLayout to drop its snapshot.
Dropping more than one LAN near the same spot stacked the NetBox
rectangles on top of each other, and multiple deckies in a LAN
landed on identical per-LAN coordinates. Since canvas position
persistence is deferred (localStorage pass), the stored x/y are
not load-bearing — compute layout from the topology graph instead.
adaptTopology now lays LANs out in a 3-col grid with the DMZ first
and stacks deckies 2-wide inside their home LAN. New LAN palette
drops append to the same grid, ignoring the raw drop point.
Active/degraded/failed/deploying topologies cannot be deleted
without first transitioning to torn_down, but the UI had no way
to trigger that. Add POST /topologies/{id}/teardown mirroring the
deploy endpoint (background task, 202 Accepted), and a
click-to-arm TEARDOWN button on the topology list card that shows
whenever the row is in a teardown-eligible state.
MazeNET publishes gateway ports on the host via Docker. With the
default userland-proxy enabled, attacker connections appear to
originate from the bridge gateway instead of the real remote IP.
Log a soft warning at deploy time when the topology publishes any
ports and docker info reports UserlandProxy=true, pointing the
operator at the daemon.json toggle. Best-effort: daemon talk
failures silently no-op.
Rebuild the inspector panel to match the handoff mock: crosshair-titled
header with dim type label and close X, status-dot + archetype-chip
head rows, connection list with directional arrows, member list with
click-to-select, and a pending-diff block at the foot. Carry the
gateway/observed disable titles over from the ctx menu so the 'remove'
action stays honest.
Also prefix the subtitle with 'NETWORK OF NETWORKS' so the purpose of
this editor reads at a glance.
A prior half-torn-down topology can leave a bridge network alive under
a different name that still owns our intended subnet. Docker then
rejects our create with 'Pool overlaps with other one on this address
space', and the topology deploy fails.
Extend create_bridge_network to sweep any unused bridge whose IPAM
subnet matches the one we're about to claim (skipping networks with
running containers — those are live use).
UI-created deckies (api_decky_crud, api_create_blank_topology) write
decky_config as sent by the client — typically just archetype flags,
without the name/ips_by_lan fields compose.py requires. The generator
path populates them at persist() time, so compose worked for generated
topologies but KeyError'd on UI-created ones.
Normalise in hydrate() so every write path feeds the same shape
downstream: mirror decky.name into decky_config.name, and allocate
per-LAN IPs deterministically (reserving the primary decky.ip where it
falls in-subnet, then filling remaining edges with next_free).
Gateway detection in the editor previously matched
archetype === 'host-gateway' (a fictional archetype that never
existed in decnet/archetypes.py). Switch to
decky_config.forwards_l3 — the real runtime marker the composer
already reads — so deletion guards, drag-pinning, context menu
locking, and NodeCard DMZ-gateway styling all line up with what
actually ships at deploy time.
On DMZ palette drop, create the gateway with archetype=deaddeck,
services=['ssh'], forwards_l3=true, and mark the edge
is_bridge=true, forwards_l3=true. attachEdge now accepts those
flags so callers can seed a real bridge attachment.
Add check_no_host_port_collision: enumerate the ports the topology's
gateways will publish (forwards_l3=True × svc.ports), probe live
listeners via psutil, emit a 'warning'-severity PORT_COLLISION
issue per overlap. Live-only — invoked from deploy_topology just
after dry-run branching, so unit tests that exercise validate()
stay hermetic.
Warning rather than error because docker-compose up will hard-fail
on a real collision anyway; this just gives operators a cleaner log
line ahead of the compose failure.
When a non-DMZ LAN is created via POST /lans, look up the topology's
gateway (decky with forwards_l3=True attached to the DMZ) and insert
an edge binding it to the new LAN. The gateway becomes multi-homed
to every internal LAN automatically, so DMZ_ORPHAN cannot arise
from ordinary editor use.
Also fixes delete_lan: the home-decky guard used scalar_one_or_none,
which blew up when the gateway already had >1 'other' LAN edge.
Switch to scalars().first() — we only need to know *some* other
edge exists, not a unique one.
Gateway deckies (forwards_l3=True) are the DMZ's ingress. Their
service containers share the base namespace via network_mode:service,
so any listener inside the gateway is reachable through the base
container's published ports. Emit 'ports: [<p>:<p>, ...]' on the
gateway base from svc.ports across the decky's service list.
This is the principled replacement for the broken network_mode: host
stub — with docker-proxy publishing, the DMZ works on any single-NIC
VPS (no MACVLAN, no promiscuous mode required).
POST /topologies/blank seeded the gateway decky with
archetype=host-gateway + network_mode=host, but neither was wired:
no compose writer reads network_mode and host-gateway is not a real
archetype. Replace with archetype=deaddeck + forwards_l3=true so the
gateway is a normal multi-homed bridge decky, consistent with how
compose.py interprets forwards_l3 (sysctl + NET_ADMIN).
Edge marked is_bridge=true, forwards_l3=true so downstream readers
(generator, compose, validator) see a real bridge attachment.