# DECNET β€” Technical Debt Register > Last updated: 2026-04-24 β€” DEBT-036 opened (session-profile ingester). > Severity: πŸ”΄ Critical Β· 🟠 High Β· 🟑 Medium Β· 🟒 Low --- ## πŸ”΄ Critical ### ~~DEBT-001 β€” Hardcoded JWT fallback secret~~ βœ… RESOLVED ~~**File:** `decnet/env.py:15`~~ Fixed in commit `b6b046c`. `DECNET_JWT_SECRET` is now required; startup raises `ValueError` if unset or set to a known-bad value. ### ~~DEBT-002 β€” Default admin credentials in code~~ βœ… CLOSED (by design) `DECNET_ADMIN_PASSWORD` defaults to `"admin"` intentionally β€” the web dashboard enforces a password change on first login (`must_change_password=1`). Startup enforcement removed as it broke tooling without adding meaningful security. ### ~~DEBT-003 β€” Hardcoded LDAP password placeholder~~ βœ… CLOSED (false positive) `templates/ldap/server.py:73` β€” `""` is a log label for SASL auth attempts, not an operational credential. The LDAP template is a honeypot; it has no bind password of its own. ### ~~DEBT-004 β€” Wildcard CORS with no origin restriction~~ βœ… RESOLVED ~~**File:** `decnet/web/api.py:48-54`~~ Fixed in commit `b6b046c`. `allow_origins` now uses `DECNET_CORS_ORIGINS` (env var, defaults to `http://localhost:8080`). `allow_methods` and `allow_headers` tightened to explicit allowlists. --- ## 🟠 High ### ~~DEBT-005 β€” Auth module has zero test coverage~~ βœ… RESOLVED ~~**File:** `decnet/web/auth.py`~~ Comprehensive test suite added in `tests/api/` covering login, password change, token validation, and JWT edge cases. ### ~~DEBT-006 β€” Database layer has zero test coverage~~ βœ… RESOLVED ~~**File:** `decnet/web/sqlite_repository.py`~~ `tests/api/test_repository.py` added β€” covers log insertion, bounty CRUD, histogram queries, stats summary, and fuzz testing of all query paths. In-memory SQLite with `StaticPool` ensures full isolation. ### ~~DEBT-007 β€” Web API routes mostly untested~~ βœ… RESOLVED ~~**Files:** `decnet/web/router/` (all sub-modules)~~ Full coverage added across `tests/api/` β€” fleet, logs, bounty, stream, auth all have dedicated test modules with both functional and fuzz test cases. ### ~~DEBT-008 β€” Auth token accepted via query string~~ βœ… RESOLVED ~~**File:** `decnet/web/dependencies.py:33-34`~~ Query-string token fallback removed. `get_current_user` now accepts only `Authorization: Bearer ` header. Tokens no longer appear in access logs or browser history. ### ~~DEBT-009 β€” Inconsistent and unstructured logging across templates~~ βœ… CLOSED (false positive) All service templates already import from `decnet_logging` and use `syslog_line()` for structured output. The `print(line, flush=True)` present in some templates is the intentional Docker stdout channel for container log forwarding β€” not unstructured debug output. ### ~~DEBT-010 β€” `decnet_logging.py` duplicated across all 19 service templates~~ βœ… RESOLVED ~~**Files:** `templates/*/decnet_logging.py`~~ All 22 per-directory copies deleted. Canonical source lives at `templates/decnet_logging.py`. `deployer.py` now calls `_sync_logging_helper()` before `docker compose up` β€” it copies the canonical file into each active template build context automatically. --- ## 🟑 Medium ### DEBT-011 β€” No database migration system **File:** `decnet/web/db/sqlite/repository.py` Schema is created during startup via `SQLModel.metadata.create_all`. There is no Alembic or equivalent migration layer. Schema changes across deployments require manual intervention or silently break existing databases. **Status:** Architectural. Deferred β€” requires Alembic integration and migration history bootstrapping. ### ~~DEBT-012 β€” No environment variable validation schema~~ βœ… RESOLVED ~~**File:** `decnet/env.py`~~ `DECNET_API_PORT` and `DECNET_WEB_PORT` now validated via `_port()` β€” enforces integer type and 1–65535 range, raises `ValueError` with a clear message on bad input. ### ~~DEBT-013 β€” Unvalidated input on `decky_name` route parameter~~ βœ… RESOLVED ~~**File:** `decnet/web/router/fleet/api_mutate_decky.py:10`~~ `decky_name` now declared as `Path(..., pattern=r"^[a-z0-9\-]{1,64}$")` β€” FastAPI rejects non-matching values with 422 before any downstream processing. ### ~~DEBT-014 β€” Streaming endpoint has no error handling~~ βœ… RESOLVED ~~**File:** `decnet/web/router/stream/api_stream_events.py`~~ `event_generator()` now wrapped in `try/except`. `asyncio.CancelledError` is handled silently (clean disconnect). All other exceptions log server-side via `log.exception()` and yield an `event: error` SSE frame to the client. ### ~~DEBT-015 β€” Broad exception detail leaked to API clients~~ βœ… RESOLVED ~~**File:** `decnet/web/router/fleet/api_deploy_deckies.py:78`~~ Raw exception message no longer returned to client. Full exception now logged server-side via `log.exception()`. Client receives generic `"Deployment failed. Check server logs for details."`. ### ~~DEBT-016 β€” Unvalidated log query parameters~~ βœ… RESOLVED ~~**File:** `decnet/web/router/logs/api_get_logs.py:12-19`~~ `search` capped at `max_length=512`. `start_time` and `end_time` validated against `^\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}:\d{2}$` regex pattern. FastAPI rejects invalid input with 422. ### ~~DEBT-017 β€” Silent DB lock retry during startup~~ βœ… RESOLVED ~~**File:** `decnet/web/api.py:20-26`~~ Each retry attempt now emits `log.warning("DB init attempt %d/5 failed: %s", attempt, exc)`. After all retries exhausted, `log.error()` is emitted so degraded startup is always visible in logs. ### ~~DEBT-018 β€” No Docker HEALTHCHECK in any template~~ βœ… RESOLVED ~~**Files:** All 20 `templates/*/Dockerfile`~~ All 24 Dockerfiles updated with: ```dockerfile HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \ CMD kill -0 1 || exit 1 ``` ### ~~DEBT-019 β€” Most template containers run as root~~ βœ… RESOLVED ~~**Files:** All `templates/*/Dockerfile` except Cowrie~~ All 24 Dockerfiles now create a `decnet` system user, use `setcap cap_net_bind_service+eip` on the Python binary (allows binding ports < 1024 without root), and drop to `USER decnet` before `ENTRYPOINT`. ### ~~DEBT-020 β€” Swagger/OpenAPI disabled in production~~ βœ… RESOLVED ~~**File:** `decnet/web/api.py:43-45`~~ All route decorators now declare `responses={401: {"description": "Not authenticated"}, 422: {"description": "Validation error"}}`. OpenAPI schema is complete for all endpoints. ### ~~DEBT-021 β€” `sqlite_repository.py` is a god module~~ βœ… RESOLVED ~~**File:** `decnet/web/sqlite_repository.py` (~400 lines)~~ Fully refactored to `decnet/web/db/` modular layout: `models.py` (SQLModel schema), `repository.py` (abstract base), `sqlite/repository.py` (SQLite implementation), `sqlite/database.py` (engine/session factory). Commit `de84cc6`. ### DEBT-026 β€” IMAP/POP3 bait emails not configurable via service config **Files:** `templates/imap/server.py`, `templates/pop3/server.py`, `decnet/services/imap.py`, `decnet/services/pop3.py` Bait emails are hardcoded. A stub env var `IMAP_EMAIL_SEED` is read but currently ignored. Full implementation requires: 1. `IMAP_EMAIL_SEED` points to a JSON file with a list of `{from_, to, subject, date, body}` dicts. 2. `templates/imap/server.py` loads and merges/replaces `_BAIT_EMAILS` from that file at startup. 3. `decnet/services/imap.py` `compose_fragment()` reads `service_cfg["email_seed"]` and injects `IMAP_EMAIL_SEED` + a bind-mount for the seed file into the compose fragment. 4. Same pattern for POP3 (`POP3_EMAIL_SEED`). **Status:** Stub in place β€” full wiring deferred to next session. --- ### DEBT-027 β€” Dynamic Bait Store **Files:** `templates/redis/server.py`, `templates/ftp/server.py` The bait store and honeypot files are hardcoded. A dynamic injection framework should be created to populate this payload across different honeypots. **Status:** Deferred β€” out of current scope. ### DEBT-028 β€” Test coverage for `api_deploy_deckies.py` **File:** `decnet/web/router/fleet/api_deploy_deckies.py` (24% coverage) The deploy endpoint exercises Docker Compose orchestration via `decnet.engine.deploy`, which creates MACVLAN/IPvlan networks and runs `docker compose up`. Meaningful tests require mocking the entire Docker SDK + subprocess layer, coupling tightly to implementation details. **Status:** Deferred β€” test after Docker-in-Docker CI is available. ### DEBT-029 β€” Service-wide pub/sub bus worker (`decnet bus`) βœ… RESOLVED **Files:** `decnet/bus/` (`worker.py`, `factory.py`, `unix_client.py`, `unix_server.py`, `protocol.py`, `fake.py`, `base.py`, `topics.py`), `decnet/cli/bus.py`, `deploy/decnet-bus.service`, `tests/bus/` (62 tests green). `CLAUDE.md` promises a `ServiceBus` worker and a `get_bus()` factory, but neither exists. Today there is no event plumbing between workers: mutator, correlator, profiler, sniffer, and prober cannot publish state transitions to interested consumers. The web SSE endpoint (`/stream`) polls the DB every ~1s inside its generator loop as a result. Downstream features that need this infrastructure: live topology mutations (DEBT-030), pulsating/live topology visualization, automatic mutations, network traffic simulation, attacker-pool push updates. MVP scope (**host-local**): 1. `decnet bus` long-running worker, systemd-supervised like every other worker. Runs on every host β€” master and each swarm agent β€” independently. 2. Transport: **UNIX-domain socket** (default `/run/decnet/bus.sock`, fallback `~/.decnet/bus.sock` in dev). Kernel-authenticated peer delivery; authorization is socket file permissions (0660, group=`decnet`). No TCP, no mTLS, no external broker. 3. Wire protocol: tiny hand-rolled framing β€” 1 ASCII verb line (`PUB `, `SUB `, `EVT `, `HELLO`, `BYE`) + 4-byte big-endian body length + orjson body. Shared `matches(pattern, topic)` helper implements NATS-style wildcards (`*` = one token, `>` = one-or-more trailing tokens). 4. Factory `get_bus()` returns a client with `publish(topic, payload)` / `subscribe(pattern) -> Subscription` (async ctx + async iterator). In-process `FakeBus` for unit tests; `NullBus` when `DECNET_BUS_ENABLED=false`. 5. Topic hierarchy locked early: `topology.{id}.mutation.{state}`, `topology.{id}.status`, `decky.{id}.state`, `decky.{id}.traffic`, `attacker.observed`, `system.log`, `system.bus.health`. 6. Delivery semantics: **at-most-once, fire-and-forget**. Per-subscriber bounded queue with drop-oldest on overflow. No replay, no persistence, no queue groups, no ordering guarantees. DB remains the source of truth; the bus is the notification layer only. 7. First consumer proving end-to-end: SSE route for topology events (DEBT-030). 8. Later: migrate `/stream` off its internal poll loop onto the bus for global events. **Cross-host federation is out of MVP scope.** Each host runs its own bus β€” swarm agents and the master do not share a bus substrate. If a use case emerges that requires cross-host pub/sub, it will land as a `decnet bus --bridge-tcp` mode that proxies the UNIX socket over the existing swarm mTLS infra. DEBT-030 is master-only and therefore unblocked by this deferral. **Status:** βœ… Resolved β€” MVP shipped. Host-local UNIX-socket bus, `get_bus()` factory, `decnet bus` worker with heartbeats, systemd unit, 62 unit/integration tests green. DEBT-030 is now unblocked. ### DEBT-030 β€” Live (hot) topology mutations via web UI βœ… RESOLVED (Phase A) **Files:** `decnet/web/router/topology/api_mutations.py` (enqueue endpoint already exists), `decnet/mutator/engine.py` + `ops.py` (reconciler already applies all 7 ops), `web/src/hooks/useMazeApi.ts` (missing enqueue methods), `web/src/components/MazeNET.tsx` (editor treats every topology as pending). **Backend is already there:** - `TopologyMutation` table (`decnet/web/db/models.py:322-358`) supports `add_lan`, `remove_lan`, `attach_decky`, `detach_decky`, `remove_decky`, `update_decky`, `update_lan`. - `POST /topologies/{id}/mutations` enqueues, gated to `active|degraded`. - Mutator watch loop (`decnet/mutator/engine.py:136-190`) claims atomically, dispatches to `ops.py`, does Docker best-effort, flips topology to `degraded` on failure. **Gap is entirely in the frontend + event delivery:** 1. `useMazeApi.ts` has no `enqueueMutation()` peer to `deployTopology()`; editor edits on `active` topologies currently no-op / 4xx. 2. No mutation-status UI (pending / applying / applied / failed badges, audit log). 3. No serverβ†’client push channel for mutation state transitions β€” depends on DEBT-029. **Design (agreed):** - **Staged buffer** (client-side, Zustand, not persisted): every editor action pushes a `TopologyMutation` onto `pendingOps[]`. Undo = pop. Reset = clear. - **Apply (N changes)** button opens a diff modal rendering ops in plain English, then POSTs the batch. Batch carries the `topology.version` observed when staging began; server returns 409 on drift. - **Batch atomicity = honest partial.** Server enqueues N rows in order; mutator applies one-by-one. If op 3 fails, 1-2 stay applied, topology flips to `degraded`, user decides to fix-forward or enqueue a manual revert. (Docker ops aren't transactional; pretending otherwise causes worse bugs than honesty.) - **Visual states compose** per existing rule: `pending-mutation`, `applying`, `failed` layer on top of `running / inactive / selected`, never replace them. - **Push via SSE over the bus** (not polling): new route `GET /api/v1/topologies/{id}/events` subscribes to `topology.{id}.*` on the service bus and forwards as SSE. Envelope: `{v, type, ts, payload}`. Day-one event types: `mutation.enqueued|applying|applied|failed`, `topology.status_changed`, `topology.version_bumped`. Room to grow: `decky.state_changed`, `decky.traffic`, `attacker.observed`. - **Separate from `/stream`** deliberately: different auth scopes, different fan-out shape (per-topology vs global), different failure isolation. Two routes, one bus. **Status:** βœ… Resolved (Phase A) β€” end-to-end busβ†’UI plumbing shipped. - Mutator publishes every state transition on the bus (`mutation.applying|applied|failed`, `status`); fire-and-forget, DB remains source of truth. - Mutator watch loop is bus-woken via `topology.*.mutation.enqueued`; 10s poll stays as fallback heartbeat so a dropped wake event costs latency, not correctness. - New route `GET /api/v1/topologies/{id}/events` streams per-topology SSE β€” snapshot on connect + live forwarding of bus events, 15s keepalive, `?token=` query-param auth matching `/stream`. - Web editor opens the SSE when topology is `active|degraded`, refetches on `mutation.applied|failed|status`, surfaces a `LIVE` / `CONNECTING…` header indicator. - Smoke: `scripts/bus/smoke-mutator.sh` verifies the full mutator-family topic hierarchy round-trips through a live bus worker. **Phase B follow-up (deferred):** staged-buffer editor (Apply (N changes) + optimistic visual states using `NodeBase.status='mutating'`). Today's Phase A refetches the whole topology on each applied event β€” correct but not yet optimistic. The hooks + API method + SSE consumer that Phase B needs are already in place (`useTopologyStream.ts`, `useMazeApi.enqueueMutation`). ### ~~DEBT-031 β€” Service workers don't use the bus~~ βœ… RESOLVED **Files:** `decnet/collector/`, `decnet/correlation/`, `decnet/profiler/`, `decnet/sniffer/`, `decnet/prober/`, `decnet/ingester/`, `decnet/agent/`, `decnet/forwarder/`, `decnet/updater/`. DEBT-029 shipped the bus; DEBT-030 proved the pattern end-to-end through the mutator and the web editor. Every other worker still ignores the bus entirely β€” they neither publish the state transitions their consumers would want nor subscribe to events that could replace polling / cut latency. The plumbing is ready; the workers aren't wired in. **Guiding principle: bus is optional.** Workers must not take a hard dependency on the bus. If `get_bus()` fails or `DECNET_BUS_ENABLED=false`, the worker logs one warning at startup and continues in pre-bus mode (poll loops, DB-only state). This mirrors `decnet/mutator/engine.py:run_watch_loop` β€” try to connect, catch broadly, log, degrade to poll-only. Copy that pattern; don't invent a new one. **Publish (per worker, what should land on the bus):** - `collector` β€” `system.log` batches / high-severity lines as they ingest (fan-out to dashboards / live views). - `correlator` β€” `attacker.observed` on first sighting, `attacker.session.{started|ended}` on session boundaries. - `profiler` β€” `attacker.scored` when a profile score crosses a threshold. - `sniffer` β€” `decky.{id}.traffic` summaries (bounded rate; drop-oldest is fine per bus semantics). - `prober` β€” `decky.{id}.state` transitions when a realism probe flips health. - `ingester` β€” `system.log` for structured forwarder-originated batches. - `agent` / `forwarder` / `updater` β€” `system.{worker}.health` heartbeats + lifecycle events (start, stop, self-update applied). **Subscribe (per worker, what they could react to instead of polling):** - `correlator` / `profiler` β€” wake on `system.log` instead of polling the logs table; poll stays as fallback. - `prober` β€” wake on `decky.*.state` to re-probe immediately after a mutation-applied event. - Any worker that currently polls the DB on a fixed interval β€” add a bus-wake `asyncio.Event` exactly like the mutator's. **Constraints (non-negotiable):** 1. DB stays the source of truth. A dropped bus event costs latency, never correctness β€” every subscriber must still have a poll fallback. 2. Publishes are fire-and-forget, wrapped in `try/except log.warning`. A bus publish failure must never break the worker's primary loop. 3. No new topics outside the hierarchy documented in `CLAUDE.md` / `wiki-checkout/Service-Bus.md`. Extend `decnet/bus/topics.py` with helpers + constants; don't hand-roll topic strings at the callsite. 4. Test with `FakeBus` (see `tests/bus/conftest.py::fake_bus`). Every new publish path gets a unit test asserting the event lands on a fake subscriber; every new wake path gets a test asserting the worker re-enters its loop faster than the poll interval. 5. `DECNET_BUS_ENABLED=false` must leave every worker functional β€” add a CI matrix row or at minimum an explicit test per worker proving it. **Suggested rollout order** (ship one worker at a time, one commit each): sniffer β†’ prober β†’ correlator β†’ profiler β†’ collector β†’ ingester β†’ agent/forwarder/updater. Sniffer and prober are the highest-value publishers for the live-topology visualization story; correlator/profiler unlock the attacker-pool push updates that MazeNET's observed-entities view currently polls for. **Status:** Resolved. Nine-commit rollout landed on `dev`: 1. Prep β€” extracted `publish_safely` + `make_thread_safe_publisher` to `decnet/bus/publish.py`; added `attacker.*`, `system..health` topic builders. 2. Sniffer β€” `decky.{id}.traffic` per flow-summary / fingerprint event (bounded by the bus's drop-oldest queue). 3. Prober β€” `attacker.fingerprinted` with probe family (jarm/hassh/tcpfp) in `event.type`. 4. Correlator β€” `attacker.observed` on first sighting, hooked via an optional `publish_fn` on `CorrelationEngine`; the profiler worker carries the bus. 5. Profiler β€” `attacker.scored` per DB-committed profile upsert. 6. Collector β€” `system.log` per ingested parsed event (compact payload: decky/service/event_type/attacker_ip/timestamp). 7. Ingester β€” `system.log` per DB-committed batch (`event.type = "batch_committed"`, payload includes offset). 8. Agent / Forwarder / Updater β€” shared `run_health_heartbeat` helper emits `system..health` every 30s. **Deferred (out of DEBT-031 scope, tracked for follow-ups):** - **Realism-probe `decky.{id}.state`** β€” the prober as it exists today fingerprints attackers, not deckies. Publishing `decky.{id}.state` on realism-flip needs a separate realism probe path we don't have yet. - **Correlator `session.started` / `session.ended`** β€” `CorrelationEngine` is a batch class with no session state. A session-boundary signal would need session tracking introduced first; constants are reserved in `decnet/bus/topics.py`. - **Standalone `decnet correlate` worker** β€” the rollout plan presumed one; today the engine runs inside the profiler worker, which is the right shape for the current data flow. - **Bus-wake subscriptions** β€” publishes landed; subscribe-side (e.g. prober re-probe on `decky.*.state`) was not wired to avoid coupling the wake pattern to a subscriber we don't yet have. ### DEBT-033 β€” Transcript day-shard rotation **Files:** `decnet/templates/_shared/sessrec/sessrec.c`, `decnet/web/router/transcripts/`. Session recording v1 (SSH/Telnet interactive-session capture) stores asciinema events in **one JSONL shard per (decky, UTC day)**: `sessions-YYYY-MM-DD.jsonl`. This bounds inode count (O(days) not O(sessions)) and blunts the obvious "`while true; do login; exit; done`" DoS, but a determined attacker can still keep a single day's shard growing until the 200 MB disk-free precheck trips. When that happens the recorder silently skips new recordings (`session_skipped reason=disk_pressure`) until midnight or until operator cleanup β€” which is *safe*, but it also means an attacker can blind the recorder for the rest of the day by filling disk once. Proper fix is size-based rotation on the day shard: 1. Recorder (or a sidecar job) rotates `sessions-YYYY-MM-DD.jsonl` β†’ `sessions-YYYY-MM-DD.1.jsonl` when size crosses e.g. 500 MB; keep last N rotations (default 4 β†’ hard ceiling β‰ˆ 2 GB/day/decky). 2. Oldest rotations drop on write pressure (FIFO), not on read. 3. API router shard-index cache (see `transcripts/` router, built from session-recording plan) gains an mtime-keyed scan across all rotations for the requested day when resolving a `sid`, not just the live shard. Cache invalidation already keys on `(path, st_mtime_ns)` so rotation drops stale entries automatically. 4. Same trigger (disk pressure or a new config knob `DECNET_TRANSCRIPT_DAY_MAX_MB`) decides when to fire; no background timer needed if the recorder itself checks size before each append. **Why deferred from v1:** the per-session 10 MB cap + disk-free precheck together give bounded worst-case behavior ("recorder quietly stops; disk stays healthy") that is acceptable for a first release. Rotation is a correctness-under-load improvement, not a correctness baseline, and it couples recorder write-path + API read-path changes that are cleaner to land as one commit after v1 ships. **Status:** Open β€” implement after v1 session recording lands and we have real-world session sizes to calibrate the rotation threshold. ### βœ… DEBT-034 β€” Worker supervisor (START buttons in Config β†’ Workers) > **Shipped 2026-04-22.** systemd units for the five missing workers > (`collector` / `profiler` / `sniffer` / `prober` / `mutator`) + > `decnet.target`, polkit rule scoping `manage-units` to `decnet-*.service` > for the `decnet` group, `systemd_control` helper, single-worker + > `start-all` endpoints, `installed` flag on `WorkerStatus`, and UI > wiring. Deferred items (SWARM-host start/stop via mTLS API; > DECNET-side crash-quarantine policy) remain as named follow-ups. **Files:** `packaging/systemd/*.service` + `decnet.target` (**new**), `packaging/polkit/50-decnet-workers.rules` (**new**), `decnet/web/services/systemd_control.py` (**new**), `decnet/web/router/workers/api_start_worker.py` + `api_start_all_workers.py` (**new**), `decnet_web/src/components/Config.tsx` (enable START buttons). The Workers panel (Config β†’ Workers) landed with bus-based STOP but every START button is a disabled placeholder. STOP works because a running worker can subscribe to its own `system..control` topic and SIGTERM-self-signal when it sees `{"action": "stop"}`. START has the inverse problem β€” a *stopped* worker has no subscriber, so the same bus pattern cannot bring it back up. Something outside the worker must own the process lifecycle. **Decision: lean on systemd.** DECNET workers are already systemd-supervised in production (`deploy/decnet-bus.service` shipped with DEBT-029; the rest follow the same pattern). Building a DECNET-native supervisor (`decnetd`) would duplicate `Restart=on-failure`, crash backoff, log routing into journald, and boot ordering β€” all of which systemd already does correctly. The only non-systemd host we care about is the dev box, where operators can start workers by hand. **v1 scope:** 1. **Unit files** for every worker in `packaging/systemd/`: `decnet-bus`, `decnet-api`, `decnet-collector`, `decnet-profiler`, `decnet-sniffer`, `decnet-prober`, `decnet-mutator`. Each declares `Restart=on-failure`, `RestartSec=5s`, `User=decnet`, `Group=decnet`. A `decnet.target` groups them for `systemctl start decnet.target`. Bus is startable too β€” chicken-and-egg is fine: systemd brings it up, the API's cached `get_app_bus()` result won't self-heal without an API restart, but that's the existing singleton limitation (documented in `decnet/bus/app.py`), not a supervisor problem. 2. **Polkit rule** (`packaging/polkit/50-decnet-workers.rules`) allowing the `decnet` group to `start` / `stop` / `restart` units matching `decnet-*.service` and `decnet.target` without a password. The API runs as `decnet`, so `systemctl --no-ask-password start decnet-` just works. 3. **`decnet/web/services/systemd_control.py`** β€” small helper wrapping `systemctl start|stop|status ` via `asyncio.create_subprocess_exec`. Hardcoded unit name mapping from `KNOWN_WORKERS` (prevents command injection; name validation already enforced at the router). Exposes `start(name)`, `stop(name)`, `is_active(name)`, `list_installed()` returning `set[str]`. 4. **New admin endpoints:** - `POST /api/v1/workers/{name}/start` β€” validates against `KNOWN_WORKERS`, shells out via the helper, returns 202 on success with `{"accepted": true, "worker": , "action": "start"}`. 503 if systemd is unreachable (e.g. dev box without systemd). - `POST /api/v1/workers/start-all` β€” **best-effort**. Iterates `KNOWN_WORKERS`, calls `start()` on each, aggregates results. Returns `{"started": [...], "failed": [{name, reason}], "already_running": [...]}` with 200 even on partial success. UI surfaces a summary toast. - Keep existing bus-based STOP endpoint β€” it already works without root; no need to route STOP through systemd. 5. **UI (`Config.tsx`):** - Detect unit availability once per panel mount via `GET /api/v1/workers` enriched with `installed: bool` per row (added to `WorkerStatus`). START button enabled only when `installed == true`. - Existing "START ALL WORKERS" placeholder in the header wires to `POST /workers/start-all`; toast renders `"STARTED Β· 5 Β· ALREADY RUNNING Β· 2 Β· FAILED Β· 1 (bus: permission denied)"`. - Add a `starting` row state (spinner) between intent and observed heartbeat; transitions to `ok` on next refresh, `unknown`/`stale` if the unit never comes up (systemd log link in tooltip). 6. **Tests:** - `tests/web/services/test_systemd_control.py` β€” monkeypatch `asyncio.create_subprocess_exec`, assert the right argv. - `tests/api/workers/test_start_workers.py` β€” 202 admin / 403 viewer / 404 unknown name for single-worker start; aggregation shape for start-all with a mix of success/failure. - Frontend test coverage stays out of scope (no harness in repo). **Deferred (documented on this ticket, not separate DEBTs):** - **a. SWARM-host worker start/stop.** Today `agent` / `forwarder` / `updater` publish heartbeats to their *own* host's bus, not the master's β€” the Workers panel rows stay `unknown` by design. A master-originated start/stop for those daemons needs to ride the existing swarm mTLS API (`/api/v1/swarm/hosts/{id}/workers/{name}/{action}`), not the bus. Out of v2 as well β€” not a near-term need. When we ship it, reuse the same systemd helper on the agent side behind the mTLS boundary. - **b. DECNET-side crash-quarantine policy.** v1 uses systemd's defaults (`Restart=on-failure`, `RestartSec=5s`, no upper bound on restart count) because they're battle-tested and cover 95% of operational reality. A sophisticated circuit-breaker β€” "quarantine worker X after N crashes in M minutes, expose the quarantine state in the panel" β€” is valuable for a honeypot-in-production context where a compromised template could cause tight crash loops that mask themselves as flapping. Design sketch: extend the worker_registry to track `(name, last_crash_ts)` triples sourced from `systemctl is-failed`, flip a row to `quarantined` state after threshold, require an admin "CLEAR QUARANTINE" click (β†’ `systemctl reset-failed`) before further auto-restart. Not blocking v1. **Status:** Open. Depends on the Workers panel (shipped) and `deploy/decnet-bus.service` pattern being extended to the other workers. ### DEBT-036 β€” Session-profile ingester (keystroke-dynamics extraction from transcript shards) **Files:** `decnet/web/ingester.py` (or new sibling under `decnet/session_profiler/`), `decnet/web/db/models/attackers.py:SessionProfile` (table already exists, ships empty), `decnet/templates/_shared/sessrec/sessrec.c` (emitter side β€” already done), `decnet/web/router/attackers/api_get_attacker_detail.py` (consumer β€” already joins SessionProfile when present). The `SessionProfile` SQLModel table has been committed to storage since session recording v1 landed (see `decnet/web/db/models/attackers.py:97-143`). Every column β€” `kd_iki_mean`, `kd_iki_stdev`, `kd_iki_p50`, `kd_iki_p95`, `kd_enter_latency_p50/p95`, `kd_burst_ratio`, `kd_think_ratio`, `kd_ctrl_backspace/wkill/ukill/abort/eof`, `kd_arrow_rate`, `kd_tab_rate`, `kd_digraph_simhash`, `total_keystrokes`, `session_duration_s` β€” is nullable by design because the **ingester that populates them does not exist yet** (documented as gap #2 in `SIGNAL_CAPTURE_AUDIT.md`). Every session that gets recorded lands an empty row (or, today, no row at all) while the `[t, "i", d]` event stream in the shard carries every signal those columns exist to capture. **Motivating case.** Given the last 14 keystrokes of one real session (the `wget scanme.nmap.orgh` sequence from shard `2026-04-24`), a manual pass over the "i" events trivially recovers: - Coefficient of variation β‰ˆ **0.74** β€” solidly in the human band (scripts <0.1, jittered tools 0.3-0.6, humans 0.7-1.5+). - A **467 ms pause** before the URL argument β€” classic semantic-boundary "thinking pause" between the command verb and its argument. Bots don't emit these; they fire the whole pre-composed line at uniform cadence. - Tight **intra-word bigrams** β€” `ge` 79 ms, `t` 83 ms β€” muscle-memory transitions. - Slow **start-of-action latency** β€” `w` β†’ `g` at 225 ms, characteristic of "initiating a command" vs "executing" a remembered one. All four signals fall out of the schema for free. CoV from `kd_iki_mean` + `kd_iki_stdev`. Semantic pauses from `kd_think_ratio`. Bigram timing from `kd_digraph_simhash`. The fourth (start-of-action latency) doesn't have a column yet β€” see "Schema extensions" below. **Design:** 1. **Trigger.** Subscribe on the bus to `attacker.session.ended` *or* (pragmatic fallback until DEBT-031's deferred session-boundary topic lands) poll `Log` rows with `event_type = "session_recorded"` that lack a `SessionProfile(sid=sid)` companion row. The poll path is what ships first; wire the bus later without changing the ingester body. 2. **Read side.** For each (decky, service, sid), resolve the shard via the fallback-scan path already shipped in `323077b` (`api_get_transcript._find_shard_with_sid`). Extract only `[t, "i", d]` events β€” the per-session index built by `_get_index` already buckets events by sid, so this is O(keystrokes-in-sid), not O(shard). 3. **Feature extraction.** One bounded pass over the input events: - IATs: pairwise `events[i].t - events[i-1].t`, clipped at e.g. 10 s so genuine "went to get coffee" gaps don't destroy the stdev. - Control-key rates: count backspace / ^U / ^W / ^C / ^D / arrow / tab against `total_keystrokes`, ratios not raw counts. - Enter latencies: IAT of each `\r` relative to the previous non-`\r` input. - Burst / think ratios: fraction of IATs below 200 ms / above 1 s. - SimHash: 8-byte Hamming-comparable digest over the top-N digraphs, weighted by occurrence. 4. **Write side.** One `session_profile` upsert per sid. Idempotent on re-run (same sid β†’ same row). 5. **Schema extensions** (motivated by the manual analysis above β€” not blocking v1 but worth adding in the same commit if the ingester gets scheduled): - `kd_start_of_action_latency_ms` β€” IAT of the first keystroke after each prompt redraw (or approximated by "first keystroke after an idle gap >1 s"). User's point 5. - `kd_pause_hist_burst / _think / _distracted` β€” three-bucket pause-length histogram (<200 ms / 200-1500 ms / >1500 ms), more discriminating than a flat burst-vs-think ratio. User's middle suggestion. - `kd_top_bigrams` JSON blob β€” top-N (bigram, count, mean_iat_ms) tuples. Complement to `kd_digraph_simhash` that answers "same typist in same mental state", not just "same typist". User's first suggestion. **Non-negotiables:** - Bounded by the existing 10 MB per-session shard cap; no new disk-free precheck needed. - No PII beyond what the shard already stores. Raw keystroke `d` values (which include the attacker's passwords in the input stream) MUST NOT land in `SessionProfile` columns β€” only timing and frequency aggregates. Bigram SimHash uses *characters*, not *content* β€” but document this explicitly in the column docstring so a future contributor doesn't "improve" it into something that leaks. - Idempotent: re-running the ingester on a sid that already has a `SessionProfile` row overwrites deterministically (same shard, same `[t,"i",d]` events β†’ same features). - `FakeBus` / poll-only must keep this functional when `DECNET_BUS_ENABLED=false` β€” mirrors the DEBT-031 rollout pattern. **Acceptance:** - Shipping a decky, running a real SSH session, disconnecting β†’ within one ingester tick a `SessionProfile` row exists with non-null `kd_iki_mean`, `kd_iki_stdev`, `kd_burst_ratio`, `kd_think_ratio`, `total_keystrokes`, `session_duration_s`. - The motivating-case wget session produces CoV β‰ˆ 0.74 Β± 0.05 when the ingester processes it β€” sanity check against the manual analysis. - The AttackerDetail page surfaces at least `kd_iki_mean` + `kd_burst_ratio` somewhere in the keystroke-dynamics section, unblocking the "is this the same typist" hover story. **Status:** Open. Depends on the shard-scan fallback (shipped in `323077b`) and `SessionProfile` schema (shipped with session recording v1). The bus-trigger path depends on DEBT-031's deferred `attacker.session.started/ended` topics, but poll-driven ingestion works today and can ship first. ### DEBT-035 β€” Artifacts written as the container uid, not the API's **Files:** `decnet/services/ssh.py`, `decnet/services/telnet.py`, `decnet/templates/{ssh,telnet}/{Dockerfile,entrypoint.sh}`, `decnet/composer.py` (wherever bind mounts for `/var/lib/decnet/artifacts/**` are generated), `decnet/web/router/transcripts/api_get_transcript.py` (consumer). Every decoy container that produces artifacts (session recordings, captured uploads, credential dumps) writes into a host bind-mount under `/var/lib/decnet/artifacts/{decky}/{service}/...`. The writer is whatever uid is running inside the container β€” typically `root` (uid 0 inside the container, which maps to the host's `root` or the container's own unprivileged `decnet` uid depending on the template's `USER` directive). The API, on the other hand, runs under whatever `--user` was passed to `decnet init` β€” `anti` on dev boxes, `decnet` in production. On mismatch, the API process hits `PermissionError` the moment it tries to `stat()` the artifacts dir. The transcripts endpoint now soft-fails this into a 404 (shipped in `323077b`), which keeps the API up but still leaves the operator unable to view any session that was recorded before the mismatch was fixed by hand. **Evidence (dev box, 2026-04-24):** ``` PermissionError: [Errno 13] Permission denied: '/var/lib/decnet/artifacts/omega-decky/ssh/transcripts' ``` Workaround: `sudo chown -R anti:anti /var/lib/decnet/artifacts`. Every new decky re-creates the dir as whatever uid the container uses, so the workaround has to be re-run β€” which doesn't scale. **Design options (pick one, not all):** 1. **Container runs as the host API's uid.** `compose_fragment()` for every artifact-producing service injects `user: "{host_uid}:{host_gid}"` into the compose snippet, sourcing the uid/gid from whatever `DECNET_API_UID` / `DECNET_API_GID` the master detected at init time (or `id -u` / `id -g` of the current process at compose time). This is the cleanest but has the most blast radius β€” bind mounts need to be pre-chowned to that uid before the container starts, and some templates have `entrypoint.sh` steps that assume root (e.g. `setcap`, `chmod` of system files during service setup). 2. **Setgid bit on the artifacts tree + shared group.** `mkdir -p /var/lib/decnet/artifacts && chmod 2775 /var/lib/decnet/artifacts && chgrp decnet /var/lib/decnet/artifacts`. Every new file inherits the `decnet` group; the API (member of `decnet`) can read regardless of which uid wrote. Still requires each container to `chmod g+r` its output β€” sessrec/emitter code would need a small change to `umask(0002)` or explicit `fchmod` calls. Less invasive but fragile: any writer that forgets the umask silently regresses. 3. **Sidecar post-processor.** A long-running daemon under the API's uid `inotify`-watches `/var/lib/decnet/artifacts/**`, re-chowns new files on creation. Works without touching any template, but adds a new process and a race window between "file created" and "file readable by API". Not a great shape for an already-worker-heavy architecture. **Recommendation:** option 1, with the init command handling the setup (mkdir the artifacts tree with mode 0775, group = `--group`, then propagate the uid/gid into the compose generator). Option 2 as a fallback where option 1 can't land (e.g. templates that genuinely need root inside the container, like the conpot ICS template). **Acceptance:** - A fresh `decnet init --user anti --group anti` β†’ deploy a decky β†’ exercise a recorded session β†’ the API (running as `anti`) can read `/var/lib/decnet/artifacts/.../transcripts/sessions-*.jsonl` **without any manual chown**. - The soft-fail path shipped in `323077b` stays as defence-in-depth β€” the API must never 500 on a permission mismatch, but it also shouldn't *need* to soft-fail on a healthy install. **Status:** Open. Current workaround is `sudo chown -R : /var/lib/decnet/artifacts` after every new deploy; soft-fail in the transcripts endpoint keeps the API alive in the interim. ### DEBT-037 β€” Webhook delivery guarantees beyond MVP **Files:** `decnet/webhook/` (**new**), `decnet/web/db/models/webhooks.py` (**new**), `decnet/web/router/webhooks/` (**new**). The webhook worker (Wazuh / Shuffle / TheHive / n8n integration path) ships MVP-first: subscription CRUD + a `decnet webhook` worker that subscribes to the internal bus, forwards matching events as HTTP POSTs with HMAC-SHA256 signatures (`X-DECNET-Signature: sha256=`), and retries 3Γ— with exponential backoff. Simple-mode UI exposes an enum of event families (`AttackerDetail` / `DeckyStatus` / `SystemStatus`); Advanced mode exposes raw bus-topic patterns. Payload bodies are the existing Pydantic response models β€” no new schema. What MVP deliberately defers: 1. ~~**Circuit breaker.**~~ βœ… **Shipped 2026-04-24.** After `DECNET_WEBHOOK_CIRCUIT_THRESHOLD` (default 5) consecutive failures the worker calls `trip_webhook_circuit(uuid, ts)` β€” flips `enabled=False`, stamps `auto_disabled_at`, fires a reload. Operator clears the trip by re-enabling via PATCH, which zeros the counter and clears the stamp. UI surfaces `TRIPPED Β· ` chip on the row; page header shows a `N TRIPPED` count. 2. **Dead-letter table.** Events that exhaust retries are dropped with a log line, not persisted. Operators can't replay a missed event after they fix their Shuffle flow. Minimum viable: `webhook_dead_letters(subscription_id, topic, payload_json, final_error, dropped_at)` with a TTL sweep, and `POST /webhooks/{id}/replay?since=...` to re-queue. 3. **Delivery audit log.** No persisted record of "what went where and when." Useful for compliance and for debugging "why didn't TheHive see that alert." Same table shape as dead-letter but success-path entries with retention knob. 4. **Batch delivery / coalescing.** Every event fires one HTTP POST. High-volume topics (`system.log` on a busy master) will happily saturate the egress. Post-MVP, add a bounded batch window (e.g. up to 50 events or 500 ms) and POST an envelope `{events: [...]}`. 5. **Per-subscription rate limiting.** An admin who subscribes to `>` gets every event DECNET ever emits. A token-bucket cap (requests/sec to a given destination) protects both the webhook worker and the destination from operator self-inflicted DoS. 6. **Template overrides.** Shuffle accepts the DECNET shape; TheHive wants an observable-style envelope; Wazuh wants a flat `decoder + field` shape. MVP ships one shape. Post-MVP: per-subscription Jinja-ish payload template, or a small set of named adapters (`"shape": "thehive" | "wazuh" | "raw"`). 7. **Secret rotation.** HMAC secret is stored plaintext in the DB and rotated by UPDATE. Post-MVP: encrypt at rest (using the existing JWT secret as KEK), dual-secret window during rotation so in-flight verifications don't fail. **Non-negotiable even at MVP:** - HMAC signing (already scoped in MVP β€” listed here only to clarify it's NOT on the deferred list). - `DECNET_BUS_ENABLED=false` must leave the webhook worker functional in a degraded "disabled" mode that surfaces its state via the Workers panel, matching DEBT-031's pattern. - Retry backoff MUST jitter; synchronized retries across a fleet of DECNET masters would be its own DoS. **Status:** Not yet started. Opens alongside the webhook MVP commit β€” the MVP PR will reference this entry and the follow-up work will close items 1–7 in priority order (circuit breaker first, batch delivery last). ### DEBT-032 β€” Prober can't detect fingerprint rotation without mutation **Files:** `decnet/prober/worker.py` (~lines 235, 286, 334, 392), `decnet/web/db/models.py` (new `decky_service_fingerprints` table). Substrate identity is `(service_name, implementation_fingerprint)`, not service name alone. A base-image rebuild that rotates OpenSSH 8.4 β†’ 9.2 β€” or any recompose that changes JARM / HASSH / TCP fingerprint without changing the service list β€” is a substrate transition from the attacker's recon POV, and today the correlation graph sees none of it. The prober already computes JARM (`worker.py:286`), HASSH (`worker.py:334`), and TCP fingerprint (`worker.py:392`), and emits each as RFC 5424 syslog + optional bus publish. What's missing is **per-(decky, service, probe_type) persistence** to diff against: the current dedup set `probed: dict[IP β†’ {probe_type β†’ set(ports)}]` (`worker.py:235`) is in-memory and scoped to one run, so any restart loses history and any same-IP probe on a changed substrate can't be detected as a change. **Design:** 1. New SQLModel table `decky_service_fingerprints` keyed by `(decky_name, service, probe_type)` with `last_hash, last_seen_at, sample_count`. One upsert per probe; bounded by fleet Γ— probe families. 2. Prober reads `last_hash` before emitting; on diff, emits a new `substrate_fingerprint_changed` event (RFC 5424 syslog + `decky.{id}.fingerprint` bus topic) with `{decky, service, probe_type, old_hash, new_hash}`. On match, upsert the timestamp and skip the event. 3. Correlator consumes the new event kind into a parallel per-decky index (mirroring the mutation index landed in this session) and interleaves `πŸ” decky-03 hassh drift` markers in `AttackerTraversal.fingerprints_during`. 4. Divergence detector: compare `substrate_state(t)` fold (mutations) vs `observed_identity(t)` fold (fingerprints) per decky. A fingerprint change without a preceding mutation β‡’ `substrate_divergence` finding β€” container drift, compromised base image, rootkit banner rewrite, or prober lag. Falls out of the data model for free once both streams exist. **Prerequisite satisfied:** mutation event stream + correlator mutation-kind parser landed alongside this DEBT entry (commits `f875350`, `fa0cdb3`, `bf5ed7a`, `d4d8a2a` on `dev`). The fingerprint stream plugs into the same substrate: same RFC 5424 emission pattern, sibling per-decky engine index, same timeline interleaving. **Status:** Open β€” deferred to its own commit sequence. The dedup state in `worker.py:235` is the only thing standing between "JARM hash computed" and "substrate rotation detected." --- ## 🟒 Low ### ~~DEBT-022 β€” Debug `print()` in correlation engine~~ βœ… CLOSED (false positive) `decnet/correlation/engine.py:20` β€” The `print()` call is inside the module docstring as a usage example, not in executable code. No production code path affected. ### DEBT-023 β€” Unpinned base Docker images **Files:** All `templates/*/Dockerfile` `debian:bookworm-slim` and similar tags are used without digest pinning. Image contents can silently change on `docker pull`, breaking reproducibility and supply-chain integrity. **Status:** Deferred β€” requires `docker pull` access to resolve current digests for each base image. ### ~~DEBT-024 β€” Stale service version hardcoded in Redis template~~ βœ… RESOLVED ~~**File:** `templates/redis/server.py:15`~~ `REDIS_VERSION` updated from `"7.0.12"` to `"7.2.7"` (current stable). ### ~~DEBT-025 β€” No lock file for Python dependencies~~ βœ… RESOLVED ~~**Files:** Project root~~ `requirements.lock` generated via `pip freeze`. Reproducible installs now available via `pip install -r requirements.lock`. --- ## Summary | ID | Severity | Area | Status | |----|----------|------|--------| | ~~DEBT-001~~ | βœ… | Security / Auth | resolved `b6b046c` | | ~~DEBT-002~~ | βœ… | Security / Auth | closed (by design) | | ~~DEBT-003~~ | βœ… | Security / Infra | closed (false positive) | | ~~DEBT-004~~ | βœ… | Security / API | resolved `b6b046c` | | ~~DEBT-005~~ | βœ… | Testing | resolved | | ~~DEBT-006~~ | βœ… | Testing | resolved | | ~~DEBT-007~~ | βœ… | Testing | resolved | | ~~DEBT-008~~ | βœ… | Security / Auth | resolved | | ~~DEBT-009~~ | βœ… | Observability | closed (false positive) | | ~~DEBT-010~~ | βœ… | Code Duplication | resolved | | DEBT-011 | 🟑 Medium | DB / Migrations | deferred (Alembic scope) | | ~~DEBT-012~~ | βœ… | Config | resolved | | ~~DEBT-013~~ | βœ… | Security / Input | resolved | | ~~DEBT-014~~ | βœ… | Reliability | resolved | | ~~DEBT-015~~ | βœ… | Security / API | resolved | | ~~DEBT-016~~ | βœ… | Security / API | resolved | | ~~DEBT-017~~ | βœ… | Reliability | resolved | | ~~DEBT-018~~ | βœ… | Infra | resolved | | ~~DEBT-019~~ | βœ… | Security / Infra | resolved | | ~~DEBT-020~~ | βœ… | Docs | resolved | | ~~DEBT-021~~ | βœ… | Architecture | resolved `de84cc6` | | ~~DEBT-022~~ | βœ… | Code Quality | closed (false positive) | | DEBT-023 | 🟒 Low | Infra | deferred (needs docker pull) | | ~~DEBT-024~~ | βœ… | Infra | resolved | | ~~DEBT-025~~ | βœ… | Build | resolved | | DEBT-026 | 🟑 Medium | Features | deferred (out of scope) | | DEBT-027 | 🟑 Medium | Features | deferred (out of scope) | | DEBT-028 | 🟑 Medium | Testing | deferred (needs DinD CI) | | DEBT-029 | 🟑 Medium | Architecture / Bus | βœ… resolved | | DEBT-030 | 🟑 Medium | Web / Live mutations | βœ… resolved (Phase A) | | ~~DEBT-031~~ | βœ… | Workers / Bus integration | resolved | | DEBT-032 | 🟑 Medium | Correlation / Prober | open | | DEBT-033 | 🟑 Medium | Storage / Session recording | open | | DEBT-035 | 🟑 Medium | Artifacts / Filesystem perms | open | | DEBT-036 | 🟑 Medium | Correlation / Keystroke dynamics | open | | DEBT-037 | 🟑 Medium | Integration / Webhooks | open (tracks MVP follow-ups) | **Remaining open:** DEBT-011 (Alembic), DEBT-023 (image pinning), DEBT-026 (modular mailboxes), DEBT-027 (Dynamic bait store), DEBT-028 (deploy endpoint tests), DEBT-032 (fingerprint rotation detection), DEBT-033 (transcript shard rotation), DEBT-035 (artifacts uid/gid alignment), DEBT-036 (session-profile ingester), DEBT-037 (webhook delivery hardening). **Estimated remaining effort:** ~24 hours. DEBT-030 Phase B (optimistic staged-buffer editor) is a follow-up, not debt.