DECNET

Author	SHA1	Message	Date
anti	6905c88083	feat(creds): DEBT-040 Phase 1 — SMB NTLMSSP framer Replace impacket's SimpleSMBServer with a hand-rolled asyncio SMB2 framer that walks Negotiate -> SessionSetup(Type1) -> SessionSetup(Type3) just deep enough to extract the inner NTLMSSP Type 3 via the shared parse_type3() parser. Always returns STATUS_LOGON_FAILURE; the attacker's hash lands in the Credential table, the attacker doesn't land on the host. - decnet/engine/deployer.py: _sync_ntlmssp_sources() mirrors the auth-helper / sessrec sync pattern, copies _shared/ntlmssp.py into smb/ and rdp/ build contexts before docker compose up. - Dockerfile: drop impacket dep, copy ntlmssp.py. - 7 unit tests drive the asyncio handler in-process via StreamReader.feed_data; assert dialect, MORE_PROCESSING_REQUIRED on first SessionSetup, NTLMSSP Type 2 carriage in SPNEGO, credential capture with universal SD shape, STATUS_LOGON_FAILURE on Type 3, oversized-NBSS / SMB1 / short-PDU drops.	2026-04-25 07:31:41 -04:00
anti	f1026b4427	feat(telnet): same PAM cred-capture, /etc/pam.d/login Promotes auth-helper.c to decnet/templates/_shared/auth-helper/ and adds _sync_auth_helper_sources() — mirrors the existing sessrec sync pattern that keeps shared sources in step with per-template build contexts. Telnet's image grows the same multi-stage musl build, COPY of the static helper into /usr/sbin/auth-helper, and prepended pam_exec line in /etc/pam.d/login. Pulls in the `login` package (real Debian PAM-aware /bin/login, replacing busybox's PAM-less applet) and libpam-modules transitively for pam_exec.so. Verified inside the rebuilt telnet image: - /bin/login is the real 53KB Debian binary (PAM-aware) - /etc/pam.d/login top line is the auth-helper hook - pam_exec.so present at /usr/lib/x86_64-linux-gnu/security/pam_exec.so - helper smoke-run emits correct RFC 5424 line for `telnetpw` → password_b64="dGVsbmV0cHc=" SSH Dockerfile updated to read auth-helper.c from auth-helper/ subdirectory so both templates use the synced layout. The canonical source lives in _shared/; per-template copies are tracked in git AND synced at deploy time so a drift on either side rebases on the next deploy. Closes the telnet half of DEBT-038's #5 follow-up.	2026-04-25 04:52:35 -04:00
anti	99bc9a8b6d	fix(engine): offload blocking compose to a worker thread deploy_topology and teardown_topology are async, but every _compose_with_retry / _compose call inside them was running in the main event loop via subprocess.run — which means a multi-minute docker compose --build froze the entire API: other endpoints, mutator events, SSE streams, status polls. The user noticed when a 2-decky deploy blocked everything else for the duration of the build. Wrap both calls in anyio.to_thread.run_sync. Same pattern the mutator engine has been using at engine.py:104 since forever. Per-LAN bridge create/remove docker SDK calls are still synchronous in the loop — they're individually fast (~50-200ms per LAN) and the loops are bounded by topology size, so they don't dominate. Worth revisiting if a 200-LAN deploy turns out to stall noticeably.	2026-04-24 22:14:08 -04:00
anti	f8ef0a5cf1	fix(deploy): redirect DOCKER_CONFIG out of $HOME so ProtectHome doesn't kill builds The api unit's ProtectHome=read-only made the user's HOME read-only inside the unit's namespace. docker compose --build then tried to write ~/.docker/buildx/activity/* and got EROFS — which we'd been misdiagnosing as a buildx wedge for the last few iterations. Real fix: set DOCKER_CONFIG and BUILDX_CONFIG in the unit's Environment= to a path inside ReadWritePaths. Hardening stays on, docker CLI writes to install_dir/.docker instead of /home/<user>/.docker. The wedge classifier now detects this case (count==0 + /home/ in the stderr path) and emits a recipe pointing at the env-var fix instead of the driver-rebuild path. Test added. Wiki gets the new branch first since it's the most common cause on systemd-managed installs.	2026-04-24 22:07:13 -04:00
anti	257624e6a7	fix(engine/buildx): recipe used reserved 'default' builder name 'docker buildx create --name default' errors with 'default is a reserved name and cannot be used to identify builder instance'. The bundled builder always exists under that name; the recipe should switch to it (buildx use default), not try to recreate it. For the count==0 driver-rebuild branch, the new builder needs a non-reserved name — using 'decnet-builder' as the example.	2026-04-24 22:02:20 -04:00
anti	40a31d8bc7	fix(engine/buildx): branch recovery recipe on leaked-mount count The hint was one-size-fits-all and pointed at prune+restart even when zero mounts were leaked — a false positive caused by matching any stderr containing the activity-dir path. Two changes: 1. Tighten the wedge classifier. Both the buildx-specific phrase ('failed to update builder last activity time') AND the EROFS marker ('read-only file system') must appear in stderr. Either alone is now treated as a normal transient error and retried. 2. Branch the recipe on _count_leaked_buildkit_mounts(): * count > 0 → unmount loop + daemon stop + umount -l (prune+restart alone doesn't evict held mounts) * count == 0 → rebuild the buildx driver (rm builder state, buildx create --use, inspect --bootstrap) Original compose stderr is now preserved in the hint as 'Original error: ...' so the user sees both the recipe and what compose actually said. Tests cover both branches plus a negative case (unrelated EROFS).	2026-04-24 21:58:09 -04:00
anti	05d225ae38	fix(engine): surface CalledProcessError.stderr in deploy-failure log + status reason str(CalledProcessError) is just 'Command ... returned non-zero exit status N' — the stderr (where the buildx recovery hint lives) was being silently dropped from both the deploy log line and the persisted 'failed' status reason. New _format_subprocess_error helper appends .stderr when the exception is a CalledProcessError. Applied to transition_status reason and the background-deploy log message so operators and the UI see the real failure, not just the exit code. This is what makes the buildx preflight hint from `86b9dec` actually reach the user.	2026-04-24 19:31:37 -04:00
anti	86b9decf80	fix(engine): detect wedged buildx + surface recovery hint on deploy When Docker's buildx leaks bind-mounts from a failed build it starts reporting 'read-only file system' on its own activity file, even though nothing is actually read-only. The user's host had 20+ leaked mounts before we noticed — each retry compounds the leak. _compose_with_retry now: * Pre-flight counts /var/lib/docker/tmp/buildkit-mount* entries in /proc/self/mounts; if >= 10 and the command is a build, refuses to start and returns a clean recovery recipe instead of retrying. * On mid-build failures that match the wedge signature ('failed to update builder last activity time' or the activity-dir path in stderr), short-circuits the retry loop with the same recipe. The first occurrence no longer needs a pre-flight; the pre-flight catches repeat attempts. Recipe points at 'docker buildx prune -af && sudo systemctl restart docker', which is what actually clears the leaked mounts. Tests cover all three paths: wedge preflight blocks builds, non-build commands (down/stop) ignore the preflight, mid-build signature detection kills the retry loop. A new autouse fixture stubs the wedge-detector to 0 so dev-host state doesn't poison the mocked subprocess tests. Wiki companion commit adds Troubleshooting → 'Buildx leaked mounts'.	2026-04-24 19:25:45 -04:00
anti	51e9e263ca	feat(templates): add instance_seed stealth helper and wire into template builds Each decky now gets a deterministic-per-instance seeded RNG derived from NODE_NAME, so cluster UUIDs, version strings, uptime, and credentials diverge across the fleet while staying stable within one container. The canonical helper lives at decnet/templates/instance_seed.py; the deployer copies it into every active template build context alongside syslog_bridge.py. Dockerfiles COPY it to /opt/ so server.py can import it. Connection-time jitter intentionally stays unseeded — two hits to the same decky must not replay the same latency curve.	2026-04-22 09:24:04 -04:00
anti	a58d42e492	feat(templates): wire SSH+Telnet to sessrec transcript recorder Build login-session into both images as the swapped root shell, add a quarantine bind mount for telnet (symmetric to SSH), seed transcripts/ dir and service discriminant at entrypoint. Deployer syncs sessrec.c + Makefile into each build context alongside the existing syslog_bridge helper. sessrec falls back to /etc/sessrec.service when env is stripped (busybox /bin/login).	2026-04-21 23:03:42 -04:00
anti	85bb0e2f65	fix(engine): roll back partial Docker state on deploy failure When create_bridge_network or compose-up raised mid-deploy, the deployer marked the topology FAILED and re-raised — but left every network it had already created alive. The next deploy attempt tripped over the orphans with 'Pool overlaps with other one on this address space' (IPAM conflict). Track networks created in the current attempt; on exception, tear down the started compose stack (if any), remove the networks in reverse order, and delete the compose file before marking FAILED. Rollback errors are logged but never mask the original failure. Covered by a new regression test that drives a docker client which succeeds once then raises, and asserts every created network is also removed.	2026-04-21 20:23:03 -04:00
anti	bf5ed7abbb	feat(engine): emit creation/retirement mutation events on deploy/teardown Close the lifecycle loop for the correlation graph: every decky now enters the substrate with an explicit `trigger=creation` event (old_services=[] ⇒ new_services=<initial>) and leaves it with `trigger=retirement` (old=<current> ⇒ new=[]). With scheduled/operator mutations already flowing through emit_decky_mutated, the entire decky lifecycle is now a well-formed sequence of mutation events — the correlator can fold substrate_state(t) at any T by replaying them. Lazy-imports mutator.events to dodge the engine↔mutator circular dependency. Bus is None at CLI sites; the syslog write is what the correlator consumes. Emission is soft-failing so a broken log path never aborts a deploy.	2026-04-21 19:35:05 -04:00
anti	e8f9c955b3	feat(swarm): heartbeat-driven topology resync for agent-pinned deployments Agent heartbeats now carry an applied-topology snapshot. The master heartbeat handler compares the reported version_hash against what canonical_hash yields for the hydrated topology pinned to that host and flags Topology.needs_resync on divergence (or when the agent reports no topology at all while master expects one). The mutator watch loop gains reconcile_agent_resyncs, which re-pushes the current hydrated blob via AgentClient.apply_topology without touching status, then clears the flag on success. Push failures leave the flag set so the next tick retries.	2026-04-21 01:35:12 -04:00
anti	05d1ebbaaa	feat(engine): route agent-pinned topologies via AgentClient deploy_topology and teardown_topology now branch on target_host_uuid. When set: - Hydrate the topology locally (validator runs exactly as before). - Compute canonical_hash; push {hydrated, version_hash} to the pinned agent through AgentClient.apply_topology. - Status machine still moves PENDING -> DEPLOYING -> ACTIVE on 2xx, PENDING -> DEPLOYING -> FAILED on error; master remains the sole owner of the row. Teardown flips to TEARING_DOWN, fires /topology/teardown, then TORN_DOWN — we log a warning on agent error but still settle to TORN_DOWN so operators can delete the row (agent garbage is cleaned on the next re-enroll). Unihost deploys are unchanged — the field defaults to NULL so every existing flow takes the local path. Step 6 of the agent <-> topology integration.	2026-04-21 01:27:59 -04:00
anti	c37d1f09c6	feat(deployer): warn when userland-proxy masks attacker source IPs MazeNET publishes gateway ports on the host via Docker. With the default userland-proxy enabled, attacker connections appear to originate from the bridge gateway instead of the real remote IP. Log a soft warning at deploy time when the topology publishes any ports and docker info reports UserlandProxy=true, pointing the operator at the daemon.json toggle. Best-effort: daemon talk failures silently no-op.	2026-04-20 23:37:59 -04:00
anti	2c35d60d45	feat(mazenet): host port-collision warning at deploy time Add check_no_host_port_collision: enumerate the ports the topology's gateways will publish (forwards_l3=True × svc.ports), probe live listeners via psutil, emit a 'warning'-severity PORT_COLLISION issue per overlap. Live-only — invoked from deploy_topology just after dry-run branching, so unit tests that exercise validate() stay hermetic. Warning rather than error because docker-compose up will hard-fail on a real collision anyway; this just gives operators a cleaner log line ahead of the compose failure.	2026-04-20 23:07:31 -04:00
anti	2544d0294a	feat(topology): add pre-deploy validator and wire into deploy_topology MazeNET phase 2 step 3. Blocks deploys of hand-authored topologies that would fail mid-bring-up (orphan deckies, duplicate IPs, overlapping subnets, unknown services) with a structured error list instead of a docker error at startup. Rules (one function each, composable by the editor for inline hints): - exactly one DMZ - every LAN has a bridge chain to the DMZ (BFS via multi-homed deckies) - no orphan deckies - unique LAN and decky names per topology - no IP collisions + IPs inside their LAN's subnet - no LAN subnet overlaps - every service in decnet.fleet.all_service_names() - service_config keys match the decky's declared services deploy_topology runs the validator after hydrate, before any status transition or Docker call; errors raise ValidationError and status stays at pending.	2026-04-20 17:45:32 -04:00
anti	80e3c28234	test(topology): deploy dry-run + failure-path + live docker e2e Covers dry-run compose emission (no status change), FAILED transition with reason logged on daemon errors, teardown from FAILED, and a live-marked end-to-end test that creates/removes bridge networks against a real docker daemon (skipped on CI).	2026-04-20 16:57:43 -04:00
anti	2a030bf3a9	feat(topology): add compose generator and deployer integration Adds per-topology compose generation (one Docker bridge network per LAN, multi-homed bridge deckies, ip_forward sysctl for L3 forwarders) plus async deploy_topology/teardown_topology in the engine. Leaf-first teardown via BFS-named LAN reverse sort; partial-state safe on failure.	2026-04-20 16:54:40 -04:00
anti	91549e6936	fix(deploy): prevent 'Address already in use' from stale IPAM and half-torn-down containers Two compounding root causes produced the recurring 'Address already in use' error on redeploy: 1. _ensure_network only compared driver+name; if a prior deploy's IPAM pool drifted (different subnet/gateway/range), Docker kept handing out addresses from the old pool and raced the real LAN. Now also compares Subnet/Gateway/IPRange and rebuilds on drift. 2. A prior half-failed 'up' could leave containers still holding the IPs and ports the new run wants. Run 'compose down --remove-orphans' as a best-effort pre-up cleanup so IPAM starts from a clean state. Also surface docker compose stderr to the structured log on failure so the agent's journal captures Docker's actual message (which IP, which port) instead of just the exit code.	2026-04-19 19:59:06 -04:00
anti	585541016f	fix(engine): teardown(decky_id=...) built malformed service names The nested list-comp `[f"{id}-{svc}" for svc in [d.services for d ...]]` iterated over a list of lists, so `svc` was the whole services list and f-string stringified it -> `decky3-['sip']`. docker compose saw "no such service" and the per-decky teardown failed 500. Flatten: find the matching decky once, then iterate its services. Noop early on unknown decky_id and on empty service lists. Regression test asserts the emitted compose args have no '[' or quote characters.	2026-04-19 19:42:42 -04:00
anti	6708f26e6b	fix(packaging): move templates/ into decnet/ package so they ship with pip install The docker build contexts and syslog_bridge.py lived at repo root, which meant setuptools (include = ["decnet"]) never shipped them. Agents installed via `pip install $RELEASE_DIR` got site-packages/decnet/* but no templates/, so every deploy blew up in deployer._sync_logging_helper with FileNotFoundError on templates/syslog_bridge.py. Move templates/ -> decnet/templates/ and declare it as setuptools package-data. Path resolutions in services/*.py and engine/deployer.py drop one .parent since templates now lives beside the code. Test fixtures, bandit exclude path, and coverage omit glob updated to match.	2026-04-19 19:30:04 -04:00
anti	b883f24ba2	fix(engine): pin docker compose project name to avoid empty-basename failure systemd daemons run with WorkingDirectory=/ by default; docker compose derives the project name from basename(cwd), which is empty at '/', and aborts with 'project name must not be empty'. Pass -p decnet explicitly so the project name is independent of cwd, and set WorkingDirectory=/opt/decnet on the three DECNET units so compose artifacts (decnet-compose.yml, build contexts) also land in the install dir.	2026-04-19 06:17:30 -04:00
anti	8dd4c78b33	refactor: strip DECNET tokens from container-visible surface Rename the container-side logging module decnet_logging → syslog_bridge (canonical at templates/syslog_bridge.py, synced into each template by the deployer). Drop the stale per-template copies; setuptools find was picking them up anyway. Swap useradd/USER/chown "decnet" for "logrelay" so no obvious token appears in the rendered container image. Apply the same cloaking pattern to the telnet template that SSH got: syslog pipe moves to /run/systemd/journal/syslog-relay and the relay is cat'd via exec -a "systemd-journal-fwd". rsyslog.d conf rename 99-decnet.conf → 50-journal-forward.conf. SSH capture script: /var/decnet/captured → /var/lib/systemd/coredump (real systemd path), logger tag decnet-capture → systemd-journal. Compose volume updated to match the new in-container quarantine path. SD element ID shifts decnet@55555 → relay@55555; synced across collector, parser, sniffer, prober, formatter, tests, and docs so the host-side pipeline still matches what containers emit.	2026-04-17 22:57:53 -04:00
anti	70d8ffc607	feat: complete OTEL tracing across all services with pipeline bridge and docs Extends tracing to every remaining module: all 23 API route handlers, correlation engine, sniffer (fingerprint/p0f/syslog), prober (jarm/hassh/tcpfp), profiler behavioral analysis, logging subsystem, engine, and mutator. Bridges the ingester→SSE trace gap by persisting trace_id/span_id columns on the logs table and creating OTEL span links in the SSE endpoint. Adds log-trace correlation via _TraceContextFilter injecting otel_trace_id into Python LogRecords. Includes development/docs/TRACING.md with full span reference (76 spans), pipeline propagation architecture, quick start guide, and troubleshooting.	2026-04-16 00:58:08 -04:00
anti	65ddb0b359	feat: add OpenTelemetry distributed tracing across all DECNET services Gated by DECNET_DEVELOPER_TRACING env var (default off, zero overhead). When enabled, traces flow through FastAPI routes, background workers (collector, ingester, profiler, sniffer, prober), engine/mutator operations, and all DB calls via TracedRepository proxy. Includes Jaeger docker-compose for local dev and 18 unit tests.	2026-04-15 23:23:13 -04:00
anti	035499f255	feat: add component-aware RFC 5424 application logging system - Modify Rfc5424Formatter to read decnet_component from LogRecord and use it as RFC 5424 APP-NAME field (falls back to 'decnet') - Add get_logger(component) factory in decnet/logging/__init__.py with _ComponentFilter that injects decnet_component on each record - Wire all five layers to their component tag: cli -> 'cli', engine -> 'engine', api -> 'api' (api.py, ingester, routers), mutator -> 'mutator', collector -> 'collector' - Add structured INFO/DEBUG/WARNING/ERROR log calls throughout each layer per the defined vocabulary; DEBUG calls are suppressed unless DECNET_DEVELOPER=true - Add tests/test_logging.py covering factory, filter, formatter component-awareness, fallback behaviour, and level gating	2026-04-13 07:39:01 -04:00
anti	c384a3103a	refactor: separate engine, collector, mutator, and fleet into independent subpackages - decnet/engine/ — container lifecycle (deploy, teardown, status); _kill_api removed - decnet/collector/ — Docker log streaming (moved from web/collector.py) - decnet/mutator/ — mutation engine (no longer imports from cli or duplicates deployer code) - decnet/fleet.py — shared decky-building logic extracted from cli.py Cross-contamination eliminated: - web router no longer imports from decnet.cli - mutator no longer imports from decnet.cli - cli no longer imports from decnet.web - _kill_api() moved to cli (process management, not engine concern) - _compose_with_retry duplicate removed from mutator	2026-04-12 00:26:22 -04:00

28 Commits