Build login-session into both images as the swapped root shell, add a
quarantine bind mount for telnet (symmetric to SSH), seed transcripts/
dir and service discriminant at entrypoint. Deployer syncs sessrec.c +
Makefile into each build context alongside the existing syslog_bridge
helper. sessrec falls back to /etc/sessrec.service when env is stripped
(busybox /bin/login).
Topology rows deleted without a proper teardown leave Docker containers
and bridge networks behind, holding IPAM pools that cause 403 "Pool
overlaps" on the next deploy at the same subnet.
- engine/reaper.py walks the local Docker daemon, extracts the 8-char
topology prefix from every decnet_t_* resource, and force-removes
containers + networks whose prefix is not in the repo.
- POST /api/v1/topologies/reap-orphans (admin-only) returns a report
of live/orphan prefixes and what was removed.
- Resources belonging to live topologies are never touched; per-resource
errors are captured without aborting the sweep.
When create_bridge_network or compose-up raised mid-deploy, the
deployer marked the topology FAILED and re-raised — but left every
network it had already created alive. The next deploy attempt tripped
over the orphans with 'Pool overlaps with other one on this address
space' (IPAM conflict).
Track networks created in the current attempt; on exception, tear down
the started compose stack (if any), remove the networks in reverse
order, and delete the compose file before marking FAILED. Rollback
errors are logged but never mask the original failure.
Covered by a new regression test that drives a docker client which
succeeds once then raises, and asserts every created network is also
removed.
Close the lifecycle loop for the correlation graph: every decky now
enters the substrate with an explicit `trigger=creation` event
(old_services=[] ⇒ new_services=<initial>) and leaves it with
`trigger=retirement` (old=<current> ⇒ new=[]). With scheduled/operator
mutations already flowing through emit_decky_mutated, the entire decky
lifecycle is now a well-formed sequence of mutation events — the
correlator can fold substrate_state(t) at any T by replaying them.
Lazy-imports mutator.events to dodge the engine↔mutator circular
dependency. Bus is None at CLI sites; the syslog write is what the
correlator consumes. Emission is soft-failing so a broken log path
never aborts a deploy.
Agent heartbeats now carry an applied-topology snapshot. The master
heartbeat handler compares the reported version_hash against what
canonical_hash yields for the hydrated topology pinned to that host
and flags Topology.needs_resync on divergence (or when the agent
reports no topology at all while master expects one).
The mutator watch loop gains reconcile_agent_resyncs, which re-pushes
the current hydrated blob via AgentClient.apply_topology without
touching status, then clears the flag on success. Push failures leave
the flag set so the next tick retries.
deploy_topology and teardown_topology now branch on
target_host_uuid. When set:
- Hydrate the topology locally (validator runs exactly as before).
- Compute canonical_hash; push {hydrated, version_hash} to the
pinned agent through AgentClient.apply_topology.
- Status machine still moves PENDING -> DEPLOYING -> ACTIVE on 2xx,
PENDING -> DEPLOYING -> FAILED on error; master remains the sole
owner of the row.
Teardown flips to TEARING_DOWN, fires /topology/teardown, then
TORN_DOWN — we log a warning on agent error but still settle to
TORN_DOWN so operators can delete the row (agent garbage is cleaned
on the next re-enroll).
Unihost deploys are unchanged — the field defaults to NULL so every
existing flow takes the local path.
Step 6 of the agent <-> topology integration.
MazeNET publishes gateway ports on the host via Docker. With the
default userland-proxy enabled, attacker connections appear to
originate from the bridge gateway instead of the real remote IP.
Log a soft warning at deploy time when the topology publishes any
ports and docker info reports UserlandProxy=true, pointing the
operator at the daemon.json toggle. Best-effort: daemon talk
failures silently no-op.
Add check_no_host_port_collision: enumerate the ports the topology's
gateways will publish (forwards_l3=True × svc.ports), probe live
listeners via psutil, emit a 'warning'-severity PORT_COLLISION
issue per overlap. Live-only — invoked from deploy_topology just
after dry-run branching, so unit tests that exercise validate()
stay hermetic.
Warning rather than error because docker-compose up will hard-fail
on a real collision anyway; this just gives operators a cleaner log
line ahead of the compose failure.
MazeNET phase 2 step 3. Blocks deploys of hand-authored topologies that
would fail mid-bring-up (orphan deckies, duplicate IPs, overlapping
subnets, unknown services) with a structured error list instead of a
docker error at startup.
Rules (one function each, composable by the editor for inline hints):
- exactly one DMZ
- every LAN has a bridge chain to the DMZ (BFS via multi-homed deckies)
- no orphan deckies
- unique LAN and decky names per topology
- no IP collisions + IPs inside their LAN's subnet
- no LAN subnet overlaps
- every service in decnet.fleet.all_service_names()
- service_config keys match the decky's declared services
deploy_topology runs the validator after hydrate, before any status
transition or Docker call; errors raise ValidationError and status
stays at pending.
Covers dry-run compose emission (no status change), FAILED transition
with reason logged on daemon errors, teardown from FAILED, and a
live-marked end-to-end test that creates/removes bridge networks
against a real docker daemon (skipped on CI).
Adds per-topology compose generation (one Docker bridge network per
LAN, multi-homed bridge deckies, ip_forward sysctl for L3 forwarders)
plus async deploy_topology/teardown_topology in the engine. Leaf-first
teardown via BFS-named LAN reverse sort; partial-state safe on failure.
Two compounding root causes produced the recurring 'Address already in use'
error on redeploy:
1. _ensure_network only compared driver+name; if a prior deploy's IPAM
pool drifted (different subnet/gateway/range), Docker kept handing out
addresses from the old pool and raced the real LAN. Now also compares
Subnet/Gateway/IPRange and rebuilds on drift.
2. A prior half-failed 'up' could leave containers still holding the IPs
and ports the new run wants. Run 'compose down --remove-orphans' as a
best-effort pre-up cleanup so IPAM starts from a clean state.
Also surface docker compose stderr to the structured log on failure so
the agent's journal captures Docker's actual message (which IP, which
port) instead of just the exit code.
The nested list-comp `[f"{id}-{svc}" for svc in [d.services for d ...]]`
iterated over a list of lists, so `svc` was the whole services list and
f-string stringified it -> `decky3-['sip']`. docker compose saw "no such
service" and the per-decky teardown failed 500.
Flatten: find the matching decky once, then iterate its services. Noop
early on unknown decky_id and on empty service lists. Regression test
asserts the emitted compose args have no '[' or quote characters.
The docker build contexts and syslog_bridge.py lived at repo root, which
meant setuptools (include = ["decnet*"]) never shipped them. Agents
installed via `pip install $RELEASE_DIR` got site-packages/decnet/** but no
templates/, so every deploy blew up in deployer._sync_logging_helper with
FileNotFoundError on templates/syslog_bridge.py.
Move templates/ -> decnet/templates/ and declare it as setuptools
package-data. Path resolutions in services/*.py and engine/deployer.py drop
one .parent since templates now lives beside the code. Test fixtures,
bandit exclude path, and coverage omit glob updated to match.
systemd daemons run with WorkingDirectory=/ by default; docker compose
derives the project name from basename(cwd), which is empty at '/', and
aborts with 'project name must not be empty'. Pass -p decnet explicitly
so the project name is independent of cwd, and set WorkingDirectory=/opt/decnet
on the three DECNET units so compose artifacts (decnet-compose.yml,
build contexts) also land in the install dir.
Rename the container-side logging module decnet_logging → syslog_bridge
(canonical at templates/syslog_bridge.py, synced into each template by
the deployer). Drop the stale per-template copies; setuptools find was
picking them up anyway. Swap useradd/USER/chown "decnet" for "logrelay"
so no obvious token appears in the rendered container image.
Apply the same cloaking pattern to the telnet template that SSH got:
syslog pipe moves to /run/systemd/journal/syslog-relay and the relay
is cat'd via exec -a "systemd-journal-fwd". rsyslog.d conf rename
99-decnet.conf → 50-journal-forward.conf. SSH capture script:
/var/decnet/captured → /var/lib/systemd/coredump (real systemd path),
logger tag decnet-capture → systemd-journal. Compose volume updated
to match the new in-container quarantine path.
SD element ID shifts decnet@55555 → relay@55555; synced across
collector, parser, sniffer, prober, formatter, tests, and docs so the
host-side pipeline still matches what containers emit.
Extends tracing to every remaining module: all 23 API route handlers,
correlation engine, sniffer (fingerprint/p0f/syslog), prober (jarm/hassh/tcpfp),
profiler behavioral analysis, logging subsystem, engine, and mutator.
Bridges the ingester→SSE trace gap by persisting trace_id/span_id columns on
the logs table and creating OTEL span links in the SSE endpoint. Adds log-trace
correlation via _TraceContextFilter injecting otel_trace_id into Python LogRecords.
Includes development/docs/TRACING.md with full span reference (76 spans),
pipeline propagation architecture, quick start guide, and troubleshooting.
Gated by DECNET_DEVELOPER_TRACING env var (default off, zero overhead).
When enabled, traces flow through FastAPI routes, background workers
(collector, ingester, profiler, sniffer, prober), engine/mutator
operations, and all DB calls via TracedRepository proxy.
Includes Jaeger docker-compose for local dev and 18 unit tests.
- Modify Rfc5424Formatter to read decnet_component from LogRecord
and use it as RFC 5424 APP-NAME field (falls back to 'decnet')
- Add get_logger(component) factory in decnet/logging/__init__.py
with _ComponentFilter that injects decnet_component on each record
- Wire all five layers to their component tag:
cli -> 'cli', engine -> 'engine', api -> 'api' (api.py, ingester,
routers), mutator -> 'mutator', collector -> 'collector'
- Add structured INFO/DEBUG/WARNING/ERROR log calls throughout each
layer per the defined vocabulary; DEBUG calls are suppressed unless
DECNET_DEVELOPER=true
- Add tests/test_logging.py covering factory, filter, formatter
component-awareness, fallback behaviour, and level gating