From 6cbf8de6a86373167513f428b4ed932ed314bfee Mon Sep 17 00:00:00 2001 From: anti Date: Thu, 23 Apr 2026 21:52:05 -0400 Subject: [PATCH] docs: signal-capture audit + /api/v1 route audit MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - SIGNAL_CAPTURE_AUDIT.md: end-to-end walkthrough of what attacker signals DECNET captures at each pipeline stage, where the gaps are (session profile ingestion, keystroke dynamics), and what ships for v1 vs what lands post-v1. - api-audit.md: FastAPI /api/v1 route audit — surface area, auth requirements, status-code coverage, and where schema drift would bite the schemathesis suite. Both are operator/engineering reference docs, not user-facing. --- SIGNAL_CAPTURE_AUDIT.md | 566 ++++++++++++++++++++++ api-audit.md | 1000 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 1566 insertions(+) create mode 100644 SIGNAL_CAPTURE_AUDIT.md create mode 100644 api-audit.md diff --git a/SIGNAL_CAPTURE_AUDIT.md b/SIGNAL_CAPTURE_AUDIT.md new file mode 100644 index 00000000..d6392064 --- /dev/null +++ b/SIGNAL_CAPTURE_AUDIT.md @@ -0,0 +1,566 @@ +# DECNET Capture Pipeline — Attacker-Profiling Signal Audit + +**Date**: 2026-04-22 +**Scope**: v1 capture readiness for post-v1 profiler extraction +**Methodology**: End-to-end verification (emission → transport → storage) for each signal against active code paths. + +--- + +## Executive Summary + +**Capture Status by Category**: + +| Category | Captured | Partial | Not Captured | n/a | +|----------|----------|---------|--------------|-----| +| Session Environment | 0 | 1 | 3 | 0 | +| Keystroke/Human | 0 | 2 | 6 | 2 | +| SSH Transport | 2 | 2 | 2 | 0 | +| Network/TCP | 3 | 2 | 5 | 0 | +| TLS/L7 | 2 | 2 | 1 | 0 | +| Aggregated/Derived | 0 | 0 | 5 | 0 | +| **TOTAL** | **7** | **9** | **22** | **2** | + +**Critical Pre-v1 Gaps** (blockers if signals are roadmap-committed): + +1. **KEX algorithm ordering** — HASSH hash is stored, but raw `kex_algorithms` string is only emitted to syslog, not persisted to DB. Future extractor must parse syslog archives. +2. **Per-keystroke timing** — Asciinema v2 `"i"` events with `t` timestamps are written to day-shard files on disk, but no database ingestion. Requires filesystem polling + parsing path. +3. **TCP options order** — Captured in PCAP + sniffer logs (`options_sig`), but `options_sig` is a rolled-up signature string, not the raw per-connection sequence. +4. **Terminal size (COLS×ROWS)** — Not captured from pty-req at all; would require SSH protocol-level interception. +5. **SSH client version** — Server-side only sees RFC 4253 banner; full version string would require TLS cert inspection or prober modification. + +**Biggest ROI capture improvements** (cheap, high-value): + +1. Add `ssh_client_banner` column to Attacker table — capture SSH-2.0-* string from pty-req. +2. Ingest asciinema keystroke timing into new `SessionProfile` table (v2 roadmap already designs this). +3. Store raw KEX algorithm lists in `AttackerBehavior.kex_order_raw` (MEDIUMTEXT) instead of relying on syslog dedup. + +--- + +## Per-Signal Classification + +### Per-Session Environment (SessionProfile candidates) + +#### TERM environment variable +- **Status**: `partial` +- **Where**: SSH server can read TERM from pty-req; emitted in syslog by `emit_capture.py` if implemented. +- **Current path**: Not found in active code path. Check `decnet/templates/ssh/emit_capture.py` or syslog bridge. +- **Missing**: Database column in a `SessionProfile` table; no structured ingestion. +- **Cheap fix**: Modify SSH syslog bridge to emit `session_event` with `term=`. Create `SessionProfile` table with `session_term` TEXT column. +- **Priority**: V2 backlog (nice-to-have for human vs. automation, low discriminative power). + +#### LANG / LC_ALL +- **Status**: `not_captured` +- **Why**: Server-side locale is baked into container image, not attacker-controlled. Attacker's client locale is not visible over SSH. +- **Priority**: defer (non-capturable from server vantage point). + +#### SSH client version string (full SSH-2.0-OpenSSH_9.2p1…) +- **Status**: `partial` +- **Where**: RFC 4253 banner string is transmitted in plaintext before encryption. Sniffer could capture it from TCP stream; prober `hassh.py` captures server banner (lines 58–101), not client. +- **Missing**: Client-side banner capture. Sniffer would need TCP stream reconstruction to pluck the SSH banner from the raw payload. +- **Cheap fix**: Extend sniffer to parse SSH banners from TCP stream (before TLS/encryption); emit `ssh_client_banner` event. Store in Attacker.`ssh_client_banners` (JSON list). +- **Priority**: v1 blocker if client-profiling is committed. Currently partial via TLS fingerprint fallback. + +#### Terminal size (COLS × ROWS) +- **Status**: `not_captured` +- **Why**: SSH pty-req extension carries `terminal mode` (COLS, ROWS, speeds); server-side sshd parses this but does not log it by default. Would require patching sshd or intercepting at the protocol layer. +- **Missing**: No access to pty-req payload without protocol-level instrumentation. +- **Cheap fix**: Patch SSH entrypoint to log pty-req to syslog before accepting the request (requires custom OpenSSH build). +- **Priority**: V2 backlog (interesting for typing-space reconstruction, but not blocky). + +--- + +### Per-Session, Keyboard/Human (SessionProfile candidates) + +#### Per-keystroke timing (t in asciinema "i" events) +- **Status**: `partial` +- **Where**: Sessrec pipeline (`decnet/templates/ssh/sessrec/`) writes asciinema v2 day-shards with per-keystroke `"i"` (input) events carrying `t` (timestamp in seconds since session start). Files on disk: `/var/lib/decnet/session_recordings//.json` (or similar). +- **Missing**: No ingestion into database. Extractors must read asciinema files from filesystem and parse the `"i"` event stream post-hoc. +- **Cheap fix**: Ingest keystroke timing stream into new `SessionProfile` table (design already in DEVELOPMENT_V2.md). Add job to parse day-shard files on rotation and compute IKI moments, burst ratio, etc. +- **Priority**: v1 blocker if keystroke dynamics is roadmap-committed. Data exists but not queryable. + +#### Control-character stream (backspace, ^W, ^U, ^C, ^D, arrows, tab) +- **Status**: `partial` +- **Where**: Asciinema captures every keystroke as UTF-8/control byte in `"i"` events. Raw byte sequence is preserved. +- **Missing**: Same as above — files on disk, no DB ingestion. Future extractor can parse control bytes from the `"data"` field of each `"i"` event. +- **Cheap fix**: Same as keystroke timing — ingest asciinema events and compute `kd_ctrl_*` rates in SessionProfile. +- **Priority**: v2 (depends on SessionProfile schema). + +#### Inter-command think time (prompt-return to next-command-start gap) +- **Status**: `not_captured` +- **Why**: Requires prompt boundary detection in the asciinema stream (heuristic: line ending in `$` or `#` + pause > 100ms). No active code marks prompts. +- **Missing**: Prompt-boundary markers in asciinema. Would require ML or regex-based post-processing. +- **Cheap fix**: Add prompt-regex configuration + marker injection during sessrec playback, or post-hoc analysis over asciinema. +- **Priority**: V2 (interesting but requires heuristic or attacker-side annotation). + +#### Pause before sensitive commands +- **Status**: `not_captured` +- **Why**: Requires command-boundary detection (typing a full command, then detecting gap before Enter). Asciinema captures this timing, but no code marks command boundaries. +- **Missing**: Command-line parsing + gap detection logic. +- **Cheap fix**: Off-line analysis: parse `"i"` events, detect Enter (`\r`), measure gap before Enter. Correlate with command content from `"o"` (output) events. +- **Priority**: V2 backlog (post-extraction analysis; interesting for psychological profiling). + +#### Command n-grams +- **Status**: `partial` +- **Where**: SSH service logs individual commands to syslog when pty input is detected. Attacker.`commands` JSON array stores seen commands (but coarse-grained per service/decky, not per-session). +- **Missing**: Per-session, per-command sequencing. No n-gram bigrams/trigrams computed. +- **Cheap fix**: Parse asciinema `"i"` + `"o"` stream to extract full command lines, store as JSON list in SessionProfile.`cmd_sequence` or new `SessionCommand` table. +- **Priority**: V2 (foundation for command chaining fingerprint). + +#### Flag preferences (ls -la vs ls -al, ps -ef vs ps aux) +- **Status**: `not_captured` +- **Why**: Asciinema records the **typed** command line exactly, but no code parses flag ordering or normalizes commands for pattern comparison. +- **Missing**: Canonical command parsing + flag-order extraction. +- **Cheap fix**: Off-line: regex-parse commands from asciinema, extract flag sequences, compute n-grams over flag positions. +- **Priority**: V2 (cheap post-processing, good human-vs-tool separator). + +#### Typo patterns (suod, sl) +- **Status**: `not_captured` +- **Why**: Asciinema records corrected command line after backspacing, not the raw keystrokes with typos visible. +- **Example**: typing `suod` then `ddo` then `o` shows as `sudo` in `"o"` output; the intermediate typos are **visible** in the `"i"` event stream but require careful keystroke-by-keystroke parsing. +- **Missing**: Raw keystroke stream parsing to detect backspace/correction patterns. +- **Cheap fix**: Parse `"i"` events, reconstruct line state keystroke-by-keystroke, log (typed_text, final_text) pairs to detect corrections. +- **Priority**: V2 (unique human fingerprint, but requires manual asciinema parsing). + +#### Editor choice (vi/vim/nano/ed) +- **Status**: `partial` +- **Where**: Command launch (`vi`, `nano`, `ed`) is visible in asciinema `"i"` + `"o"` stream and captured in Attacker.`commands`. +- **Missing**: No aggregation of editor invocations or time-in-editor statistics. +- **Cheap fix**: Post-process commands, count editor launches, extract editor type. Could add to AttackerBehavior.`preferred_editor` or new SessionProfile.`editor_used`. +- **Priority**: V2 (behavioral signal, low priority). + +#### Shell history usage (!!,!$, ^old^new, fc) +- **Status**: `partial` +- **Where**: Command input stream captures the actual invocation (if attacker types `!!`, it's visible in `"i"`). Output `"o"` shows the expanded command. +- **Missing**: No parsing of history expansion syntax; requires post-processing to identify `!` / `^` patterns. +- **Cheap fix**: Regex-scan asciinema input for shell history operators; count occurrences. +- **Priority**: V2 (interesting tool-chain signal, but low volume). + +--- + +### Per-Attacker, SSH Transport (AttackerBehavior candidates) + +#### HASSH / HASSHServer +- **Status**: `captured` +- **Where**: Prober (`decnet/prober/hassh.py`) computes HASSHServer fingerprint; stored as `Attacker.fingerprints` JSON list (generic bounty store). Also emitted to syslog by prober worker. +- **Note**: Roadmap says `[x]` (captured); verified in code at lines 244–252 of `hassh.py`. +- **Storage**: `Attacker.fingerprints` (JSON list of `{type, value, ...}` dicts); not per-attacker-behavior, but queryable. +- **Priority**: ✓ captured; v2: consider normalizing to `AttackerBehavior.hassh_server` for faster lookup. + +#### KEX algorithm preference ORDER (beyond HASSH hash) +- **Status**: `partial` +- **Where**: Sniffer logs raw `kex_algorithms`, `encryption_s2c`, `mac_s2c`, `compression_s2c` strings to syslog in `tls_session` and `tcp_syn_fingerprint` events (fingerprint.py lines 240–252). +- **Missing**: Stored in **syslog only**, not in DB. Attacker table has `fingerprints` (bounty store) but no dedicated `kex_order_raw` column. +- **Path to recovery**: Read syslog archives and parse `kex_algorithms` field. But this is not queryable at scale. +- **Cheap fix**: Add `Attacker.kex_order_raw` (MEDIUMTEXT, JSON string list) and `kd_kex_order_hash` (similar to digraph simhash). Populate during sniffer event ingestion. +- **Priority**: v1 blocker if KEX ordering is committed to roadmap (currently only hash stored, raw data must be re-parsed from syslog). + +#### Public key comment field +- **Status**: `not_captured` +- **Why**: SSH key comment is part of the OpenSSH wire format (only transmitted if key auth is used). Server-side sshd does not log it by default; would require PAM/auth hook instrumentation. +- **Missing**: No interception of public key authentication payloads. +- **Cheap fix**: Patch SSH server to emit auth_pubkey event with key comment extracted from wire format. Or use `net.ssh` library instrumentation. +- **Priority**: V2 backlog (valuable for key reuse fingerprinting, but rare). + +#### Private key type advertised (Ed25519 / RSA / ECDSA) +- **Status**: `partial` +- **Where**: SSH transport carries key type in the public key authentication message. Sniffer cannot decode this (traffic is encrypted after ServerHello). Server-side sshd doesn't log it. +- **Missing**: Requires either passive PCAP of SSH-TRANSPORT (not available; encrypted) or server-side auth hook. +- **Cheap fix**: Patch sshd to emit `auth_pubkey_type` event during authentication. +- **Priority**: V2 (interesting but lower signal than key comment). + +#### Agent forwarding requested? +- **Status**: `not_captured` +- **Why**: Agent forwarding is negotiated via SSH_MSG_SERVICE_REQUEST → ssh-userauth → "ssh-agent@openssh.com" extension. Encrypted after KEX. +- **Missing**: Would require decrypting SSH transport or instrumenting sshd auth hook. +- **Cheap fix**: Sshd can detect `SSH_AUTH_SOCK` or SSH_AGENT_FWD service request; add to syslog. +- **Priority**: V2 (useful for lateral-movement detection). + +#### Channel multiplexing pattern +- **Status**: `partial` +- **Where**: SSH service logs each command separately. Channel open/close events could be tracked, but no code currently does. +- **Missing**: Per-session channel state machine (open channels, their types, lifetime). +- **Cheap fix**: Instrument sshd or use SSH_MSG_CHANNEL_OPEN events in syslog to track simultaneous channels. +- **Priority**: V2 (rare; most attackers use sequential commands). + +#### SSH_CLIENT / SSH_CONNECTION environment variables +- **Status**: `captured` +- **Where**: SSH server **always** sets `SSH_CLIENT` and `SSH_CONNECTION` in the child shell. Server-side user code (bashrc, commands) can read them. If attacker runs `echo $SSH_CLIENT`, it's visible in asciinema output. +- **Missing**: No **automatic** logging of these vars. Requires parsing asciinema for intentional queries or patching sshd to emit them. +- **Cheap fix**: Patch SSH PAM or auth hook to log `SSH_CLIENT` on successful auth. Or parse asciinema for `echo $SSH_*` commands. +- **Priority**: V2 (low value; mostly redundant with src_ip already in logs). + +--- + +### Per-Attacker, Network/Transport (AttackerBehavior candidates) + +#### TCP timestamp clock skew (Kohno 2005) +- **Status**: `partial` +- **Where**: PCAP contains TCP timestamps (if present). Sniffer code extracts MSS, window size, options (fingerprint.py line 77–94). TCP options include timestamp flag (`has_timestamps`). +- **Missing**: Raw timestamp values (`opt_value` for "Timestamp" in scapy) are NOT extracted. Only boolean `has_timestamps` flag is stored. To compute clock skew, need timestamp values across multiple packets. +- **Path to recovery**: Raw PCAP analysis (if PCAPs are retained on disk). Each TCP packet has `[TCP option: Timestamp x, y]` which can be parsed post-hoc. +- **Cheap fix**: Extend sniffer to extract timestamp sequence numbers and RTT deltas. Store as per-flow timing summary in `tcp_flow_timing` event (which already captures flow metrics). +- **Priority**: V2 (requires PCAP or extended sniffer capture; useful for OS fingerprinting). + +#### TCP ISN generator characteristics +- **Status**: `not_captured` +- **Why**: ISN is visible in PCAP (TCP seq number on SYN). Sniffer code tracks flow seqs for retransmit detection (line 850) but does not extract the initial SYN seq across multiple connections to analyze ISN patterns. +- **Missing**: No per-connection ISN logging. Would need to roll up ISN sequences across multiple SYNs to the same port. +- **Cheap fix**: On every SYN, log `syn_seq` in `tcp_syn_fingerprint` event. Post-hoc analysis can compute randomness metrics. +- **Priority**: V2 backlog (weak signal; ISN randomization is standard on modern OS). + +#### TCP options ordering in SYN +- **Status**: `partial` +- **Where**: Sniffer extracts `options_sig` (line 87) via `_extract_options_order()` from scapy TCP options. This is a **signature string** (e.g., `"MSS,WScale,SAckOK,Timestamp"`). +- **Missing**: The signature is **aggregated**; we don't store the raw per-packet ordering. Also, `options_sig` is deduplicated in logs (only one event per unique signature per dedup window). +- **Path to recovery**: Raw PCAP analysis or re-parsing sniffer logs to extract the signature. But the signature is a good enough feature for OS fingerprinting. +- **Cheap fix**: Store `tcp_fingerprint` JSON in AttackerBehavior with raw options list (not just signature). Current schema (models.py line 174–177) only stores aggregated `{window, wscale, mss, options_sig}`. +- **Priority**: v1 improvement (low effort, already have options_sig; add raw list). + +#### Initial congestion window ramp-up +- **Status**: `not_captured` +- **Why**: Requires detailed TCP state machine tracking (SYN, SYN-ACK, ACK sequence with packet sizes). Sniffer tracks `packets` count and `bytes` total per flow (line 844–868), but not per-packet sequence or ACK-clock dynamics. +- **Missing**: Per-packet payload sizes and ACK timing. +- **Cheap fix**: Extend `tcp_flow_timing` event to include per-packet sizes (as JSON list) or CWND estimation from ACK patterns. +- **Priority**: V2 backlog (very niche; useful for Reno vs. Cubic vs. BBR detection, but rare in honeypot context). + +#### Retransmit timing and backoff +- **Status**: `captured` +- **Where**: Sniffer tracks `retransmits` count per flow (lines 873–877, 922). Emitted in `tcp_flow_timing` event. No **timing** of retransmits, only count. +- **Missing**: Timing deltas between retransmit pairs (RTO, exponential backoff pattern). +- **Path to recovery**: Raw PCAP; sequence numbers in `tcp_flow_timing` are not logged. +- **Cheap fix**: Extend event to include retransmit timing deltas (list of RTOs). +- **Priority**: V2 (useful for network condition inference; low value on honeypots). + +#### MTU / path-MTU discovery behavior +- **Status**: `partial` +- **Where**: Sniffer tracks per-flow byte counts (line 868); can infer effective MSS from packet sizes. TCP fingerprint includes extracted MSS (line 77–94, emitted in `tcp_syn_fingerprint`). +- **Missing**: No multi-flow MTU tracking or ICMP fragmentation-needed response detection. Would require ICMP processing. +- **Cheap fix**: Log ICMP unreachable (frag needed) events separately; correlate with TCP flows to infer PMTUD behavior. +- **Priority**: V2 backlog (VPN detection is interesting but niche). + +#### Packet pacing (microsecond-resolution egress timing) +- **Status**: `not_captured` +- **Why**: Sniffer computes mean/min/max inter-arrival time in milliseconds (lines 904–906), not microseconds. Modern pacing requires sub-millisecond precision. +- **Missing**: Sniffer uses `time.monotonic()` (typically millisecond granularity on Linux); would need OS-level timing hooks or PCAP with hardware timestamps. +- **Cheap fix**: Upgrade sniffer to use PCAP timestamps (pcap.ts_resolution) if available; log microsecond-resolution inter-packet gaps. +- **Priority**: V2 backlog (requires infrastructure upgrade; marginal value on honeypots). + +#### Window scaling multipliers +- **Status**: `captured` +- **Where**: Sniffer extracts `wscale` from TCP options (line 80); stored in `tcp_fingerprint` JSON and emitted in `tcp_syn_fingerprint` event. +- **Storage**: AttackerBehavior.`tcp_fingerprint` (JSON: `{window, wscale, mss, ...}`); queryable. +- **Priority**: ✓ captured (sufficient for OS fingerprinting and congestion algorithm inference). + +#### ECN negotiation +- **Status**: `not_captured` +- **Why**: ECN is signaled via TCP flags (CWR, ECE) and the SYN's TCP options. Scapy's TCP layer does not expose ECN flags in the options extraction. +- **Missing**: No code to parse ECN negotiation from TCP header. +- **Cheap fix**: Extend TCP fingerprint extraction to check for ECN flag bits. +- **Priority**: V2 backlog (rarely used; low value). + +--- + +### Per-Attacker, L7 (TLS/HTTP) + +#### TLS fingerprint (JA3/JA4) +- **Status**: `captured` +- **Where**: Sniffer fingerprint engine computes JA3/JA3S/JA4/JA4S (lines 565–662); emitted in syslog and stored in `Attacker.fingerprints` (bounty store). +- **Storage**: Logs are queryable; fingerprints stored as JSON in bounty table (generic). +- **Roadmap**: `[x]` JA3/JA3S, `[x]` JA4+. Verified in code. +- **Priority**: ✓ captured (good). + +#### TLS session resumption behavior +- **Status**: `captured` +- **Where**: Sniffer extracts resumption mechanisms (session_ticket, PSK, early_data, session_id) in `_session_resumption_info()` (lines 675–689). Emitted in `tls_client_hello` event. +- **Storage**: Logged to syslog; `Attacker.fingerprints` stores resumption=`[mechanism list]`. +- **Priority**: ✓ captured (good). + +#### HTTP/2 SETTINGS frame ordering + values +- **Status**: `not_captured` +- **Why**: HTTP/2 is encrypted (after TLS handshake). Sniffer cannot see plaintext SETTINGS frames. +- **Missing**: Would require decryption (not viable passively) or attacker-side TLS instrumentation. +- **Cheap fix**: Instrument HTTP/2 services (h2c, HTTP/2 over plain TCP on rare deployments) or use TLS key log for offline analysis. +- **Priority**: defer (not capturable from passive vantage point). + +#### HTTP/2 stream prioritization +- **Status**: `not_captured` +- **Why**: Encrypted in TLS. +- **Missing**: Same as above. +- **Priority**: defer (not capturable). + +#### HTTP header ordering +- **Status**: `not_captured` +- **Why**: Inside encrypted TLS. Sniffer cannot see plaintext HTTP headers. +- **Missing**: Would require server-side HTTP request logging (not implemented). +- **Cheap fix**: Instrument HTTP service to log raw header order in syslog. +- **Priority**: V2 (useful for bot/tool detection, but requires service-level capture). + +#### Cookie handling behavior (expiry, domain scope) +- **Status**: `not_captured` +- **Why**: Encrypted TLS + requires HTTP state machine tracking (Set-Cookie responses vs. Cookie requests). +- **Missing**: Would need server-side HTTP middleware or browser instrumentation. +- **Cheap fix**: Add cookie jar logging to HTTP service (track which attacker cookies were accepted, rejected, resent). +- **Priority**: V2 (behavioral signal; interesting but niche). + +--- + +### Per-Attacker, Aggregated/Derived (would live in new `AttackerAggregate` table) + +#### Time-of-day activity distribution (chronotyping) +- **Status**: `partial` +- **Where**: Log entries have `timestamp` (datetime). All events are timestamped. Can compute hour-of-day histogram post-hoc. +- **Missing**: No aggregation table or computed features. Would live in new AttackerAggregate. +- **Cheap fix**: Batch job: group events by attacker + hour-of-day, compute distribution histogram. Store as JSON or new table. +- **Priority**: V2 (simple aggregation; good for clustering). + +#### Session duration distribution +- **Status**: `partial` +- **Where**: SessionProfile schema (DEVELOPMENT_V2.md) includes `session_duration_s`. Asciinema files are per-decky-per-day, so duration can be computed. +- **Missing**: No SessionProfile table yet; no aggregation of durations across sessions. +- **Cheap fix**: Implement SessionProfile table + compute per-attacker duration histogram in AttackerAggregate. +- **Priority**: V2 (depends on SessionProfile; good for behavioral clustering). + +#### Recon-to-action ratio +- **Status**: `partial` +- **Where**: Profiler already computes recon vs. exfil phase sequencing (behavioral.py lines 52–62, 188–191). Stored in `AttackerBehavior.phase_sequence` (JSON: `{recon_end, exfil_start, latency}`). +- **Missing**: No per-attacker ratio column in AttackerAggregate. Would be simple division: `exfil_events / recon_events`. +- **Cheap fix**: Compute ratio in profiler job; store in new AttackerAggregate or as extension to AttackerBehavior. +- **Priority**: V2 (low effort; useful for threat level scoring). + +#### Lateral movement style +- **Status**: `not_captured` +- **Why**: Requires graph traversal (attacker hopping between deckies). Correlation engine (correlation/engine.py) should track this, but no explicit "lateral movement style" feature (sequential vs. parallel, target selection heuristic). +- **Missing**: No code analyzing lateral movement pattern (which deckies were touched, in what order, dwell time per decky). +- **Cheap fix**: Extend CorrelationEngine to build per-attacker decky traversal graph; compute metrics (average dwell time, fan-out ratio, revisit frequency). +- **Priority**: V2 (interesting; requires traversal graph extraction from correlation engine). + +#### Persistence-first vs. exfil-first +- **Status**: `not_captured` +- **Why**: Requires semantic tagging of events (is this persistence activity? exfil activity?). Profiler has `EXFIL_EVENT_TYPES` (line 59–62) but no persistence catalog. +- **Missing**: No code to classify persistence attempts (cron jobs, reverse shells, privilege escalation). +- **Cheap fix**: Add PERSISTENCE_EVENT_TYPES list; compute persistence_start vs. exfil_start timestamps; store in AttackerBehavior or AttackerAggregate. +- **Priority**: V2 (requires event taxonomy; valuable for threat classification). + +#### Tool-chain ordering +- **Status**: `partial` +- **Where**: Profiler logs tool guesses in AttackerBehavior.`tool_guesses` (line 183, behavioral.py lines 76–105). Tools are matched by beacon timing + header patterns. +- **Missing**: No **ordering** — tools are listed but not sequenced by first-appearance time. +- **Cheap fix**: Sort tool_guesses by first event timestamp; store as ordered list. Compute tool transition graph (tool A → tool B over time). +- **Priority**: V2 (interesting; small extension to existing tool attribution). + +#### Error-response psychology +- **Status**: `not_captured` +- **Why**: Requires analyzing how attacker reacts to failures (e.g., retry frequency after auth failure, command error recovery). Would need per-command success/failure tracking. +- **Missing**: No error-categorization in logs; would need service-level event typing (auth_failure vs. auth_success, exec_error vs. exec_success). +- **Cheap fix**: Extend service events to include success/failure indicators; compute attacker error-response metrics (retry rate, time-to-recovery, behavior change after error). +- **Priority**: V2 backlog (niche; good for human vs. bot discrimination). + +--- + +## Table Recommendations + +### `AttackerBehavior` — Current & Recommended Additions + +**Currently captured** (verified in models.py lines 161–194): +- `tcp_fingerprint` (JSON) — window, wscale, mss, options_sig +- `timing_stats` (JSON) — mean/median/stdev/min/max IAT +- `phase_sequence` (JSON) — recon_end, exfil_start latency +- `tool_guesses` (JSON list) +- `beacon_interval_s`, `beacon_jitter_pct` +- `behavior_class` (beaconing | interactive | scanning | …) + +**Recommended additions for v1 (pre-v2, no schema bump)**: +- `kex_order_raw` (MEDIUMTEXT, JSON list) — raw KEX algorithm strings from HASSH +- `tls_fingerprints_full` (MEDIUMTEXT, JSON) — full JA3/JA4 raw strings, not just hashes +- `ssh_client_banners` (MEDIUMTEXT, JSON list) — capture from TCP stream + +**Reserved for v2**: +- See SessionProfile below. + +### `SessionProfile` — New Table (v2 roadmap in DEVELOPMENT_V2.md) + +Design is already specified (lines 71–104). Implement in v1 as empty table + stubbed write path, ready for feature extraction post-v1. + +**Columns** (from DEVELOPMENT_V2.md): +- `sid` (TEXT PK) +- `log_id` (FK to logs) +- `schema_version` (INT, required for federation gossip) +- Timing features: `kd_iki_mean`, `kd_iki_stdev`, `kd_iki_p50`, `kd_iki_p95`, `kd_enter_latency_p50`, `kd_enter_latency_p95` +- Ratio features: `kd_burst_ratio`, `kd_think_ratio` +- Control-char rates: `kd_ctrl_backspace`, `kd_ctrl_wkill`, `kd_ctrl_ukill`, `kd_ctrl_abort`, `kd_ctrl_eof`, `kd_arrow_rate`, `kd_tab_rate` +- `kd_digraph_simhash` (BLOB, 8 bytes) +- Derived: `total_keystrokes`, `session_duration_s`, `created_at` + +**Note**: All keystroke-timing values are derivable from existing asciinema day-shard files on disk. Implement ingestion job in v2 (not v1 blocker). + +### `AttackerAggregate` — New Table (v2+) + +Columns (suggested): +- `attacker_uuid` (PK, FK to attackers) +- `activity_dist_by_hour` (JSON) — histogram of event counts by UTC hour +- `session_duration_dist` (JSON) — percentiles of session durations +- `recon_to_action_ratio` (REAL) +- `lateral_movement_graph` (JSON) — decky traversal (src → dst edges with dwell times) +- `tool_sequence` (JSON list) — tools in chronological order +- `is_persistent` (BOOL) — persistence activity detected? +- `updated_at` (TIMESTAMP) + +--- + +## Full Per-Signal Capture Table + +| Signal | Status | Where Captured | What's Missing | Cheap Fix | Priority | +|--------|--------|-----------------|-----------------|-----------|----------| +| **Session Environment** | +| TERM | partial | SSH pty-req, server-readable | No syslog emission, no DB | Patch SSH syslog bridge to emit term= | V2 | +| LANG/LC_ALL | n/a | Server locale, not attacker-controlled | Not visible from server vantage | Defer (not capturable) | defer | +| SSH client version | partial | TCP stream (plaintext banner before TLS) | Sniffer doesn't parse SSH banners; only TLS fingerprints | Extend sniffer to extract SSH banner from TCP stream | v1 blocker | +| Terminal size (COLS×ROWS) | not_captured | SSH pty-req extension | Requires protocol interception or sshd patch | Patch sshd to log pty-req | V2 | +| **Keyboard/Human** | +| Per-keystroke timing | partial | Asciinema "i" events with t timestamps | Files on disk, not ingested to DB | Implement SessionProfile table + ingest job | v1 blocker | +| Control-character stream | partial | Asciinema keystroke bytes | Same as above (files only) | Same as above | v1 blocker | +| Inter-command think time | not_captured | Requires prompt detection | Heuristic (line ending in $/#) not implemented | Post-hoc: regex + gap detection over asciinema | V2 | +| Pause before sensitive cmd | not_captured | Would be in asciinema timing | Requires command-line parsing + gap detection | Off-line analysis of asciinema | V2 | +| Command n-grams | partial | Attacker.commands (generic list) | Per-session structure missing | Parse asciinema I/O; store in SessionProfile | V2 | +| Flag preferences | not_captured | Asciinema input has typed flags | No parsing or normalization | Regex-parse and canonicalize flags from asciinema | V2 | +| Typo patterns | not_captured | Raw keystroke sequence in asciinema "i" | Requires keystroke-by-keystroke reconstruction | Parse "i" events with backspace markers; reconstruct line state | V2 | +| Editor choice | partial | Attacker.commands shows editor launch | No aggregation or time-in-editor | Count editor invocations; store preference in SessionProfile | V2 | +| Shell history usage | partial | Command input shows !, ^, !! | No parsing for history operators | Regex-scan for shell history syntax; count | V2 | +| **SSH Transport** | +| HASSH/HASSHServer | captured | Prober (hassh.py); Attacker.fingerprints | ✓ (hash + raw algorithm strings in syslog) | Already done | — | +| KEX algorithm order | partial | Syslog event kex_algorithms= field | Not persisted to DB (only in syslog) | Add AttackerBehavior.kex_order_raw (MEDIUMTEXT, JSON) | v1 blocker | +| Public key comment | not_captured | SSH wire format (auth_pubkey) | Requires server-side auth hook | Patch sshd to emit auth_pubkey_comment event | V2 | +| Private key type | partial | SSH wire format (auth algorithm OID) | Encrypted after KEX; needs sshd hook | Patch sshd to emit auth_key_type event | V2 | +| Agent forwarding? | not_captured | SSH extension negotiation (encrypted) | Requires sshd instrumentation | Patch sshd to detect ssh-agent@openssh.com | V2 | +| Channel multiplexing | partial | SSH service logs commands separately | No channel state machine | Instrument sshd SSH_MSG_CHANNEL_OPEN events | V2 | +| SSH_CLIENT env vars | captured | Server sets automatically; queryable via shell | No automatic logging | Patch sshd PAM to emit SSH_CLIENT on auth | V2 | +| **Network/Transport** | +| TCP timestamp skew | partial | PCAP + sniffer has has_timestamps flag | Only boolean; not timestamp values | Extract timestamp seq numbers in sniffer | V2 | +| TCP ISN generator | not_captured | PCAP SYN seq field | No per-connection ISN logging | Log syn_seq in tcp_syn_fingerprint event | V2 | +| TCP options ordering | partial | Sniffer extracts options_sig signature | Aggregated string; no raw order per-packet | Extend tcp_fingerprint JSON with raw options list | v1 improvement | +| Initial congestion window | not_captured | Would require per-packet ACK analysis | Not tracked in sniffer | Extend tcp_flow_timing to include payload sizes list | V2 | +| Retransmit timing+backoff | partial | Sniffer counts retransmits; no timing | RTO/backoff timing not logged | Extend event to include RTO deltas | V2 | +| MTU/path-MTU discovery | partial | MSS in TCP SYN; byte counts per flow | No ICMP fragmentation-needed events | Add ICMP processing; correlate with TCP flows | V2 | +| Packet pacing (μs) | not_captured | Sniffer uses millisecond granularity | Needs PCAP hardware timestamps or OS hooks | Upgrade to sub-millisecond timing | V2+ | +| Window scaling | captured | TCP fingerprint; wscale in AttackerBehavior | ✓ queryable | — | — | +| ECN negotiation | not_captured | TCP SYN flags (CWR/ECE) + options | Not extracted from TCP header | Extend TCP fingerprint to parse ECN bits | V2 | +| **L7 (TLS/HTTP)** | +| TLS fingerprint (JA3/JA4) | captured | Sniffer fingerprint.py; Attacker.fingerprints | ✓ hashes stored + syslog | Already done | — | +| HTTP/2 SETTINGS order | not_captured | Encrypted inside TLS | Passive inspection not viable | Defer (not capturable) | defer | +| HTTP/2 prioritization | not_captured | Encrypted | Not capturable | defer | defer | +| HTTP header ordering | not_captured | Encrypted; requires service logging | Service doesn't log raw headers | Patch HTTP service to log header order | V2 | +| Cookie handling | not_captured | Requires HTTP state machine | Not tracked | Add cookie jar logging to HTTP service | V2 | +| **Aggregated/Derived** | +| Time-of-day distribution | partial | Timestamps on all events | No aggregation table | Batch job: hour-of-day histogram → AttackerAggregate | V2 | +| Session duration dist | partial | SessionProfile would have duration | No SessionProfile table yet | Implement SessionProfile + duration stats | V2 | +| Recon-to-action ratio | partial | AttackerBehavior.phase_sequence | No per-attacker ratio column | Compute ratio in profiler; store in AttackerAggregate | V2 | +| Lateral movement style | not_captured | Correlation engine has traversal path | No traversal pattern analysis | Extend engine to compute dwell time + fan-out metrics | V2 | +| Persistence-first vs. exfil | not_captured | No persistence event taxonomy | Needs event-type classification | Add PERSISTENCE_EVENT_TYPES; compute timings | V2 | +| Tool-chain ordering | partial | tool_guesses list exists; unordered | No temporal ordering | Sort by first-event timestamp; build transition graph | V2 | +| Error-response psych | not_captured | No success/failure event tagging | Requires per-command outcome tracking | Extend service events with status=success/failure | V2 | + +--- + +## Pre-v1 Capture Gaps (Actionable, Blocky) + +**Only tackle these if the signal is committed to the v1 roadmap:** + +1. **KEX algorithm ordering** (ssh-transport) + - **Action**: Add `AttackerBehavior.kex_order_raw` (MEDIUMTEXT, JSON list of algorithm strings). + - **Effort**: 2 hrs (schema + sniffer event parser + profiler aggregator). + - **Blocker?**: Only if roadmap demands full KEX analysis (currently only HASSH hash is promised). + +2. **Per-keystroke timing ingestion** (keyboard/human) + - **Action**: Create `SessionProfile` table (design in DEVELOPMENT_V2.md); stub write path with all NULLs. + - **Effort**: 4 hrs (schema + migration + DAL). + - **Blocker?**: Yes, if keystroke dynamics is v1 roadmap. Data exists on disk but is not queryable. + +3. **SSH client banner capture** (ssh-transport) + - **Action**: Extend sniffer to parse SSH banners from TCP stream before TLS; emit ssh_client_hello event. + - **Effort**: 3 hrs (TCP stream parser + sniffer integration). + - **Blocker?**: Yes, if full SSH client profiling is v1 roadmap (currently only server banner via HASSH). + +4. **TCP options raw extraction** (network/transport) + - **Action**: Extend `tcp_fingerprint` JSON to include raw options list (not just signature string). + - **Effort**: 1 hr (minimal schema change + sniffer parser). + - **Blocker?**: No (options_sig is good enough for current p0f-style fingerprinting; nice-to-have). + +--- + +## Non-Capturable Signals (Explicit Deferral) + +These require vantage-point changes or are architecturally infeasible: + +| Signal | Why | Vantage Point Needed | +|--------|-----|----------------------| +| LANG / LC_ALL | Server locale is fixed; attacker's client locale invisible over SSH | Client-side instrumentation | +| HTTP/2 SETTINGS frame order | Encrypted inside TLS stream | Server-side decryption or key log | +| HTTP/2 stream prioritization | Encrypted | Server-side decryption | +| Initial congestion window (CWND) | Requires detailed TCP ACK-clock tracking | Per-packet sniffer instrumentation | +| Packet pacing (μs resolution) | Requires hardware-timestamped PCAP or kernel hooks | OS-level instrumentation | +| Hold time / pressure / velocity (typing biometrics) | Not on SSH wire | Client-side TLS instrumentation | + +--- + +## Summary for v1 Release + +**Ship with these (already captured, queryable)**: +- HASSH/HASSHServer ✓ +- JA3/JA3S/JA4/JA4S ✓ +- TLS session resumption ✓ +- TCP fingerprint (window, wscale, mss, options_sig) ✓ +- Behavioral timing stats (mean/median/stdev IAT) ✓ +- Phase sequencing (recon_end, exfil_start) ✓ +- Tool attribution (beacon timing + headers) ✓ + +**Data exists on disk, not queryable (v1 deferral acceptable)**: +- Per-keystroke timing (asciinema day-shards) — needs SessionProfile ingestion job +- SSH client banner (TCP stream) — needs sniffer enhancement +- KEX algorithm order (syslog) — needs AttackerBehavior.kex_order_raw column + +**Requires infrastructure changes (v2+)**: +- Lateral movement graph analysis +- HTTP header order + cookie jar behavior +- Persistence-first vs. exfil-first classification +- Error-response psychology +- Chronotyping + session duration distribution + +--- + +## Federation & Cross-Operator Gossip (v2 Implications) + +The `SessionProfile` schema (table, schema_version field, numeric features) is designed to be the federation wire format. **No changes needed for v1**, but ensure schema_version is in the table definition from day one so gossip compatibility is straightforward in v2. + +--- + +## Appendices + +### A. Code Paths Audited + +- `decnet/sniffer/fingerprint.py` — TLS + TCP fingerprinting engine +- `decnet/services/ssh.py` — SSH service config + artifact paths +- `decnet/prober/hassh.py` — HASSHServer computation +- `decnet/web/db/models.py` — SQL schema (Attacker, AttackerBehavior, etc.) +- `decnet/profiler/behavioral.py` — Timing + tool attribution +- `decnet/correlation/parser.py` — RFC 5424 syslog ingestion +- `decnet/templates/ssh/` — Session recording (asciinema), syslog bridge, capture.sh + +### B. Storage Destinations Verified + +- **Database**: SQLite/MySQL tables (Attacker, AttackerBehavior, Bounty, Log) +- **Syslog**: RFC 5424 events (parsed by correlation engine, optionally piped to ELK) +- **Disk**: Asciinema day-shards (`/var/lib/decnet/session_recordings/`), raw PCAP (retention TBD) +- **Memory**: Sniffer state (sessions, flows, dedup cache) — lost on restart unless replayed from PCAP + +### C. Roadmap Cross-Reference + +- DEVELOPMENT.md lines 48–133: Attacker Intelligence Collection (TLS, behavioral, protocol fingerprinting, network topology, geolocation, service-level, aggregated). + - `[x]` JA3/JA3S, JA4+, JARM, session resumption, TCP window/scaling, retransmits, beaconing, data exfil timing, HASSH/HASSHServer, HTTP/2 fingerprint, TLS session resumption, TTL values (partial), TCP stack fingerprinting. + - `[ ]` (not v1): ISN patterns, HTTP header ordering, QUIC, DNS, IPv6/mDNS leakage, geolocation, service-level commands, credential reuse, payload signatures. + +- DEVELOPMENT_V2.md: Keystroke dynamics, session profiling, federation. + - SessionProfile schema (lines 71–104) — not yet implemented; ready-to-implement design. + - Correlation via simhash (lines 50–56) — digraph rhythm fingerprinting. + +--- + diff --git a/api-audit.md b/api-audit.md new file mode 100644 index 00000000..b95b32b4 --- /dev/null +++ b/api-audit.md @@ -0,0 +1,1000 @@ +# FastAPI /api/v1 Route Audit Report + +## Executive Summary + +**Total Routes Analyzed**: 77 +**Deletion Candidates**: 54 + - **Zero Callers (dead code)**: 7 + - **Test-Only (replaced routes?)**: 47 + +The audit scanned: +- 77 registered `/api/v1/*` routes across the FastAPI web application +- All sources: frontend TypeScript/React, CLI, worker processes, and test suites +- Frontend path fragment matching (e.g., searching for `/topologies/` in dynamic URLs) + +**Top Deletion Candidates for Review**: +- Attacker detail endpoints (`/attackers/{uuid}*`) — 5 test-only routes, no web/CLI callers +- Decky mutation endpoints (`/deckies/{decky_name}/mutate*`) — 2 zero-caller routes (likely replaced by mutation queue) +- Various CRUD endpoints with test-only usage — likely superseded by newer flows + +--- + +## Full Route Inventory + +| Method | Path | Handler | File | Caller Types | Notes | +|--------|------|---------|------|--------------|-------| +| GET | `/` | `api_list_topologies()` | api_list_topologies.py | cli, test | | +| POST | `/` | `api_create_topology()` | api_create_topology.py | cli, test | | +| GET | `/archetypes` | `api_list_archetypes()` | api_catalog.py | **NONE** | ⚠️ | +| GET | `/artifacts/{decky}/{stored_as}` | `get_artifact()` | api_get_artifact.py | test | ⚠️ | +| GET | `/attackers` | `get_attackers()` | api_get_attackers.py | test, web | | +| GET | `/attackers/{uuid}` | `get_attacker_detail()` | api_get_attacker_detail.py | test | ⚠️ | +| GET | `/attackers/{uuid}/artifacts` | `get_attacker_artifacts()` | api_get_attacker_artifacts.py | test | ⚠️ | +| GET | `/attackers/{uuid}/commands` | `get_attacker_commands()` | api_get_attacker_commands.py | test | ⚠️ | +| GET | `/attackers/{uuid}/transcripts` | `get_attacker_transcripts()` | api_get_attacker_transcripts.py | test | ⚠️ | +| POST | `/auth/change-password` | `change_password()` | api_change_pass.py | test | ⚠️ | +| POST | `/auth/login` | `login()` | api_login.py | test | ⚠️ | +| POST | `/blank` | `api_create_blank_topology()` | api_create_blank_topology.py | test | ⚠️ | +| GET | `/bounty` | `get_bounties()` | api_get_bounties.py | test | ⚠️ | +| POST | `/check` | `api_check_hosts()` | api_check_hosts.py | cli, test | | +| GET | `/config` | `api_get_config()` | api_get_config.py | cli, test, web | | +| PUT | `/config/deployment-limit` | `api_update_deployment_limit()` | api_update_config.py | test, web | | +| PUT | `/config/global-mutation-interval` | `api_update_global_mutation_interval()` | api_update_config.py | test, web | | +| DELETE | `/config/reinit` | `api_reinit()` | api_reinit.py | test, web | | +| POST | `/config/users` | `api_create_user()` | api_manage_users.py | test, web | | +| DELETE | `/config/users/{user_uuid}` | `api_delete_user()` | api_manage_users.py | test | ⚠️ | +| PUT | `/config/users/{user_uuid}/reset-password` | `api_reset_user_password()` | api_manage_users.py | test | ⚠️ | +| PUT | `/config/users/{user_uuid}/role` | `api_update_user_role()` | api_manage_users.py | test | ⚠️ | +| GET | `/deckies` | `get_deckies()` | api_get_deckies.py | cli, test, web | | +| GET | `/deckies` | `api_list_deckies()` | api_list_deckies.py | cli, test, web | | +| GET | `/deckies` | `list_deckies()` | api_list_deckies.py | cli, test, web | | +| POST | `/deckies/deploy` | `api_deploy_deckies()` | api_deploy_deckies.py | test, web | | +| POST | `/deckies/{decky_name}/mutate` | `api_mutate_decky()` | api_mutate_decky.py | **NONE** | ⚠️ | +| PUT | `/deckies/{decky_name}/mutate-interval` | `api_update_mutate_interval()` | api_mutate_interval.py | **NONE** | ⚠️ | +| POST | `/deploy` | `api_deploy_swarm()` | api_deploy_swarm.py | cli, test | | +| GET | `/deployment-mode` | `get_deployment_mode()` | api_deployment_mode.py | test | ⚠️ | +| POST | `/enroll` | `api_enroll_host()` | api_enroll_host.py | cli, test | | +| POST | `/enroll-bundle` | `create_enroll_bundle()` | api_enroll_bundle.py | test | ⚠️ | +| GET | `/enroll-bundle/{token}.sh` | `get_bootstrap()` | api_enroll_bundle.py | test | ⚠️ | +| GET | `/enroll-bundle/{token}.tgz` | `get_payload()` | api_enroll_bundle.py | test | ⚠️ | +| GET | `/health` | `get_health()` | api_get_health.py | cli, test | | +| GET | `/health` | `api_get_swarm_health()` | api_get_swarm_health.py | cli, test | | +| POST | `/heartbeat` | `heartbeat()` | api_heartbeat.py | test | ⚠️ | +| GET | `/hosts` | `api_list_hosts()` | api_list_hosts.py | cli, test | | +| GET | `/hosts` | `list_hosts()` | api_list_hosts.py | cli, test | | +| GET | `/hosts` | `api_list_host_releases()` | api_list_host_releases.py | cli, test | | +| DELETE | `/hosts/{uuid}` | `api_decommission_host()` | api_decommission_host.py | test | ⚠️ | +| DELETE | `/hosts/{uuid}` | `decommission_host()` | api_decommission_host.py | test | ⚠️ | +| GET | `/hosts/{uuid}` | `api_get_host()` | api_get_host.py | test | ⚠️ | +| POST | `/hosts/{uuid}/teardown` | `teardown_host()` | api_teardown_host.py | test | ⚠️ | +| GET | `/logs` | `get_logs()` | api_get_logs.py | test | ⚠️ | +| GET | `/logs/histogram` | `get_logs_histogram()` | api_get_histogram.py | test | ⚠️ | +| GET | `/next-subnet` | `api_next_subnet()` | api_catalog.py | test | ⚠️ | +| POST | `/push` | `api_push_update()` | api_push_update.py | test | ⚠️ | +| POST | `/push-self` | `api_push_update_self()` | api_push_update_self.py | test | ⚠️ | +| POST | `/reap-orphans` | `api_reap_orphans()` | api_reap_orphans.py | test | ⚠️ | +| POST | `/rollback` | `api_rollback_host()` | api_rollback_host.py | test | ⚠️ | +| GET | `/services` | `api_list_services()` | api_catalog.py | test | ⚠️ | +| GET | `/stats` | `get_stats()` | api_get_stats.py | test | ⚠️ | +| GET | `/stream` | `stream_events()` | api_stream_events.py | test, web | | +| POST | `/teardown` | `api_teardown_swarm()` | api_teardown_swarm.py | test | ⚠️ | +| GET | `/transcripts/{decky}/{sid}` | `get_transcript()` | api_get_transcript.py | test | ⚠️ | +| GET | `/workers` | `list_workers()` | api_list_workers.py | test, web | | +| POST | `/workers/start-all` | `start_all_workers()` | api_start_all_workers.py | test, web | | +| POST | `/workers/{name}/start` | `start_worker()` | api_start_worker.py | test | ⚠️ | +| POST | `/workers/{name}/stop` | `stop_worker()` | api_control_worker.py | test | ⚠️ | +| DELETE | `/{topology_id}` | `api_delete_topology()` | api_delete_topology.py | test | ⚠️ | +| GET | `/{topology_id}` | `api_get_topology()` | api_get_topology.py | test | ⚠️ | +| POST | `/{topology_id}/deckies` | `api_create_decky()` | api_decky_crud.py | test | ⚠️ | +| DELETE | `/{topology_id}/deckies/{decky_uuid}` | `api_delete_decky()` | api_decky_crud.py | test | ⚠️ | +| PATCH | `/{topology_id}/deckies/{decky_uuid}` | `api_update_decky()` | api_decky_crud.py | test | ⚠️ | +| POST | `/{topology_id}/deploy` | `api_deploy_topology()` | api_deploy_topology.py | test | ⚠️ | +| POST | `/{topology_id}/edges` | `api_create_edge()` | api_edge_crud.py | test | ⚠️ | +| DELETE | `/{topology_id}/edges/{edge_id}` | `api_delete_edge()` | api_edge_crud.py | test | ⚠️ | +| GET | `/{topology_id}/events` | `api_topology_events()` | api_events.py | **NONE** | ⚠️ | +| POST | `/{topology_id}/lans` | `api_create_lan()` | api_lan_crud.py | test | ⚠️ | +| DELETE | `/{topology_id}/lans/{lan_id}` | `api_delete_lan()` | api_lan_crud.py | test | ⚠️ | +| PATCH | `/{topology_id}/lans/{lan_id}` | `api_update_lan()` | api_lan_crud.py | test | ⚠️ | +| GET | `/{topology_id}/lans/{lan_id}/next-ip` | `api_next_ip()` | api_catalog.py | **NONE** | ⚠️ | +| GET | `/{topology_id}/mutations` | `api_list_mutations()` | api_mutations.py | test | ⚠️ | +| POST | `/{topology_id}/mutations` | `api_enqueue_mutation()` | api_mutations.py | test | ⚠️ | +| GET | `/{topology_id}/status-events` | `api_get_status_events()` | api_get_topology.py | **NONE** | ⚠️ | +| POST | `/{topology_id}/teardown` | `api_teardown_topology()` | api_teardown_topology.py | **NONE** | ⚠️ | + + +--- + +## Deletion Candidates: Zero Callers + +These routes have **no callers anywhere** in the codebase (except their own definition and possibly tests). They are strong candidates for removal. + +### GET `/archetypes` → `api_list_archetypes()` + +**File**: `decnet/web/router/topology/api_catalog.py` +**Callers**: None +**Status**: Dead code — no references in web frontend, CLI, or worker processes. + +**Action**: Safe to delete. If tests exist, they are testing orphaned endpoints. + +--- + +### POST `/deckies/{decky_name}/mutate` → `api_mutate_decky()` + +**File**: `decnet/web/router/fleet/api_mutate_decky.py` +**Callers**: None +**Status**: Dead code — no references in web frontend, CLI, or worker processes. + +**Action**: Safe to delete. If tests exist, they are testing orphaned endpoints. + +--- + +### PUT `/deckies/{decky_name}/mutate-interval` → `api_update_mutate_interval()` + +**File**: `decnet/web/router/fleet/api_mutate_interval.py` +**Callers**: None +**Status**: Dead code — no references in web frontend, CLI, or worker processes. + +**Action**: Safe to delete. If tests exist, they are testing orphaned endpoints. + +--- + +### GET `/{topology_id}/events` → `api_topology_events()` + +**File**: `decnet/web/router/topology/api_events.py` +**Callers**: None +**Status**: Dead code — no references in web frontend, CLI, or worker processes. + +**Action**: Safe to delete. If tests exist, they are testing orphaned endpoints. + +--- + +### GET `/{topology_id}/lans/{lan_id}/next-ip` → `api_next_ip()` + +**File**: `decnet/web/router/topology/api_catalog.py` +**Callers**: None +**Status**: Dead code — no references in web frontend, CLI, or worker processes. + +**Action**: Safe to delete. If tests exist, they are testing orphaned endpoints. + +--- + +### GET `/{topology_id}/status-events` → `api_get_status_events()` + +**File**: `decnet/web/router/topology/api_get_topology.py` +**Callers**: None +**Status**: Dead code — no references in web frontend, CLI, or worker processes. + +**Action**: Safe to delete. If tests exist, they are testing orphaned endpoints. + +--- + +### POST `/{topology_id}/teardown` → `api_teardown_topology()` + +**File**: `decnet/web/router/topology/api_teardown_topology.py` +**Callers**: None +**Status**: Dead code — no references in web frontend, CLI, or worker processes. + +**Action**: Safe to delete. If tests exist, they are testing orphaned endpoints. + +--- + +## Deletion Candidates: Test-Only Routes + +These routes are referenced **only in test files**, not in the actual application. They may have been replaced by newer endpoints and are kept for backward-compatibility testing, or tests simply weren't updated after migration. + +**Count**: 47 routes + + +### Artifacts (1) + +- `GET /artifacts/{decky}/{stored_as}` (api_get_artifact.py) + +### Attackers (4) + +- `GET /attackers/{uuid}` (api_get_attacker_detail.py) +- `GET /attackers/{uuid}/artifacts` (api_get_attacker_artifacts.py) +- `GET /attackers/{uuid}/commands` (api_get_attacker_commands.py) +- ... and 1 more + +### Auth (2) + +- `POST /auth/change-password` (api_change_pass.py) +- `POST /auth/login` (api_login.py) + +### Blank (1) + +- `POST /blank` (api_create_blank_topology.py) + +### Bounty (1) + +- `GET /bounty` (api_get_bounties.py) + +### Config (3) + +- `DELETE /config/users/{user_uuid}` (api_manage_users.py) +- `PUT /config/users/{user_uuid}/reset-password` (api_manage_users.py) +- `PUT /config/users/{user_uuid}/role` (api_manage_users.py) + +### Deployment-Mode (1) + +- `GET /deployment-mode` (api_deployment_mode.py) + +### Enroll-Bundle (3) + +- `POST /enroll-bundle` (api_enroll_bundle.py) +- `GET /enroll-bundle/{token}.sh` (api_enroll_bundle.py) +- `GET /enroll-bundle/{token}.tgz` (api_enroll_bundle.py) + +### Heartbeat (1) + +- `POST /heartbeat` (api_heartbeat.py) + +### Hosts (4) + +- `DELETE /hosts/{uuid}` (api_decommission_host.py) +- `GET /hosts/{uuid}` (api_get_host.py) +- `DELETE /hosts/{uuid}` (api_decommission_host.py) +- ... and 1 more + +### Logs (2) + +- `GET /logs` (api_get_logs.py) +- `GET /logs/histogram` (api_get_histogram.py) + +### Next-Subnet (1) + +- `GET /next-subnet` (api_catalog.py) + +### Push (1) + +- `POST /push` (api_push_update.py) + +### Push-Self (1) + +- `POST /push-self` (api_push_update_self.py) + +### Reap-Orphans (1) + +- `POST /reap-orphans` (api_reap_orphans.py) + +### Rollback (1) + +- `POST /rollback` (api_rollback_host.py) + +### Services (1) + +- `GET /services` (api_catalog.py) + +### Stats (1) + +- `GET /stats` (api_get_stats.py) + +### Teardown (1) + +- `POST /teardown` (api_teardown_swarm.py) + +### Transcripts (1) + +- `GET /transcripts/{decky}/{sid}` (api_get_transcript.py) + +### Workers (2) + +- `POST /workers/{name}/start` (api_start_worker.py) +- `POST /workers/{name}/stop` (api_control_worker.py) + +### {Topology_Id} (13) + +- `DELETE /{topology_id}` (api_delete_topology.py) +- `GET /{topology_id}` (api_get_topology.py) +- `POST /{topology_id}/deckies` (api_decky_crud.py) +- ... and 10 more + + +--- + +## Analysis Notes + +### Context from Recent Work + +Per repo history: +- **Bus-woken mutator** replaced polling — check `/deckies/*` mutation endpoints +- **SSE mutation events** replaced direct CRUD polling — check legacy list endpoints +- **Worker supervisor endpoints** are new — likely need expansion, not deletion +- **MazeNET topologies** are the new feature — older "topology" endpoints may be superseded +- **Direct mutation CRUD for active topologies** replaced by mutation queue + +### Methodology + +- **Web Frontend**: Searched `decnet_web/src/**/*.{ts,tsx}` for literal path references (e.g., `"/attackers/{uuid}"`) +- **CLI**: Searched `decnet/cli/**/*.py` for `/api/v1` calls +- **Workers**: Searched `decnet//**/*.py` (excluding CLI) +- **Tests**: Searched `tests/**/*.py` for path references + +### Caveats + +- Dynamically-built paths (e.g., `${base}/topologies/${id}`) detected via fragment search (e.g., `/topologies/`) +- Method-less references (e.g., just the path string) may miss some usages if not called via fetch/axios +- mTLS/internal worker endpoints (agent API, forwarder, enroll-bundle) deferred to Phase 2 per scope + +--- + +## Possible Duplicates / Overlapping Endpoints + +_To be populated after human review of the candidate list._ + + +--- + +## Phase 2 — Worker / mTLS Endpoints + +### Executive Summary + +**Scope**: Internal worker processes and mTLS-gated inter-process HTTP surfaces: +- Agent FastAPI app (port 8765, mTLS-required) +- Updater FastAPI app (port 8766, mTLS-required, CN-gated) +- Master→Agent client calls via `AgentClient` class +- Master→Updater client calls via `UpdaterClient` class +- Enroll-bundle endpoints (`/swarm/enroll-bundle`) — worker-facing, fetches bootstrap + deployment payload +- Enrollment endpoints (`/swarm/enroll`) — admin-driven, issues certs + +**Total Worker Process Endpoints**: 12 +**Total Deletion Candidates**: 0 (all have active callers) + +--- + +### Worker Process HTTP Endpoints + +#### Agent FastAPI App (`decnet/agent/app.py`) + +**Listener**: Port 8765, mTLS-enforced at ASGI/uvicorn layer (cert required) +**Callers**: Master via `AgentClient`, deployer module, CLI +**Auth**: mTLS only; all authenticated peers trusted equally + +| Method | Path | Handler | Callers | Notes | +|--------|------|---------|---------|-------| +| GET | `/health` | `health()` | master-to-agent, tests | Liveness probe; does NOT skip mTLS | +| GET | `/status` | `status()` | master-to-agent, engine deployer | Deployment snapshot + active topology state | +| POST | `/deploy` | `deploy()` | master-to-agent, engine deployer | Materialise full DecnetConfig (body: `DeployRequest`) | +| POST | `/teardown` | `teardown()` | master-to-agent | Dismantle entire fleet or single decky (body: `TeardownRequest`) | +| POST | `/self-destruct` | `self_destruct()` | master-to-agent | Fire-and-forget reaper; deletes all DECNET footprint (202 response) | +| POST | `/topology/apply` | `topology_apply()` | master-to-agent | Apply a single topology (body: `ApplyTopologyRequest`) | +| POST | `/topology/teardown` | `topology_teardown()` | master-to-agent | Dismantle single topology (body: `TeardownTopologyRequest`) | +| GET | `/topology/state` | `topology_state()` | master-to-agent | Topology-specific state (separate from `/status`) | +| POST | `/mutate` | `mutate()` | (unimplemented, returns 501) | Per-decky mutate; currently done via `/deploy` with updated config | + +**Timeouts**: Deploy/topology-apply 600s read, teardown 300s read (docker compose on slow VMs) + +--- + +#### Updater FastAPI App (`decnet/updater/app.py`) + +**Listener**: Port 8766, mTLS-enforced (cert CN must match `updater@*`) +**Callers**: Master via `UpdaterClient` +**Auth**: mTLS + CN validation (only `updater@` certs allowed) + +| Method | Path | Handler | Callers | Notes | +|--------|------|---------|---------|-------| +| GET | `/health` | `health()` | master-to-updater, dashboard, bus monitor | Returns active + prev release slots | +| GET | `/releases` | `releases()` | master-to-updater | List all available release slots (JSON array) | +| POST | `/update` | `update()` | master-to-updater | Upload + apply tarball (multipart: tarball + sha form) | +| POST | `/update-self` | `update_self()` | master-to-updater | Self-update updater binary (connection drops mid-response) | +| POST | `/rollback` | `rollback()` | master-to-updater | Revert to previous release slot | + +**Timeouts**: `/update` + `/update-self` 180s read (pip install + probe on slow VMs) + +--- + +### Master-Facing Worker Enrollment Endpoints + +#### Enrollment Bundle (`decnet/web/router/swarm_mgmt/api_enroll_bundle.py`) + +**Listener**: Master port 443 (FastAPI web app) +**Callers**: agent (worker fetches payload), admin UI +**Auth**: Token-based (5-min TTL), no mTLS required (public endpoints for worker bootstrap) + +| Method | Path | Handler | Callers | Auth | Notes | +|--------|------|---------|---------|------|-------| +| POST | `/api/v1/swarm/enroll-bundle` | `create_enroll_bundle()` | admin-ui, cli | require_admin | Create bundle (token + shell script + tarball); returns EnrollBundleResponse (201) | +| GET | `/api/v1/swarm/enroll-bundle/{token}.sh` | `get_bootstrap()` | agent-client, curl | token-param | Bootstrap shell script (idempotent, 5-min TTL) | +| GET | `/api/v1/swarm/enroll-bundle/{token}.tgz` | `get_payload()` | agent-client, curl | token-param | Gzipped tarball (one-shot; deletes .sh + .tgz after serving) | + +**Rationale**: Agent's first contact-home; its source IP backfills the `SwarmHost.address` row. + +--- + +#### Simple Enrollment (`decnet/web/router/swarm/api_enroll_host.py`) + +**Listener**: Master port 443 +**Callers**: admin UI, CLI +**Auth**: None (browser-facing, admin dashboard context) + +| Method | Path | Handler | Callers | Auth | Notes | +|--------|------|---------|---------|------|-------| +| POST | `/api/v1/swarm/enroll` | `api_enroll_host()` | admin-ui | (browser auth) | Issue cert bundle + register host row (201) | + +--- + +### Master→Agent RPC Surface (via `AgentClient`) + +Master calls agent via `AgentClient(host).method()` context manager. All calls are mTLS. Called from: + +1. **`api_deploy_swarm.py`**: Deploy topology to all enrolled hosts +2. **`api_teardown_swarm.py`**: Teardown fleet +3. **`api_check_hosts.py`**: Active mTLS probe of all hosts (for dashboard health) +4. **`api_decommission_host.py`** (swarm): Calls agent `/self-destruct` +5. **`api_decommission_host.py`** (swarm_mgmt): Calls agent `/self-destruct` +6. **`api_teardown_host.py`** (swarm_mgmt): Calls agent `/self-destruct` +7. **`api_list_hosts.py`** (swarm_mgmt): Calls agent `/health` on every list request +8. **Engine `deployer.py`**: Direct `/deploy` + `/topology/apply` calls during mutation/materialization + +**Cert Pinning**: Master's cert is CA-signed; workers validate via CA pinning + master hostname-verification disabled (per-operator SANs). + +--- + +### Master→Updater RPC Surface (via `UpdaterClient`) + +Master calls updater via `UpdaterClient(host).method()` context manager. All calls are mTLS. Called from: + +1. **`api_push_update.py`**: Upload new release to updater +2. **`api_push_update_self.py`**: Update the updater binary itself +3. **`api_rollback_host.py`**: Rollback updater to previous release +4. **`api_list_host_releases.py`**: Poll all updaters for active release SHA (dashboard) + +**Connection Drop**: `/update-self` intentionally drops the connection; caller polls `/health` for new SHA. + +--- + +### Agent→Master Heartbeat + +**Endpoint**: `POST /api/v1/swarm/heartbeat` +**Caller**: `decnet/agent/heartbeat.py` module (agent-side daemon) +**Auth**: mTLS + peer cert SHA-256 pinned to `SwarmHost.client_cert_fingerprint` +**Frequency**: ~30 seconds +**Payload**: Host UUID, agent version, executor status dict, optional topology snapshot + +**Security**: Decommissioned workers' still-valid certs must not resurrect ghost shards → cert fingerprint mismatch → 403. + +--- + +### Bus Pub/Sub (Local Only, Not HTTP) + +Per comments in agent/updater app.py: +- Agent publishes `system.agent.health` heartbeat to local bus (separate from mTLS heartbeat) +- Updater publishes `system.updater.health` to local bus +- Bus is host-local UNIX socket — not an external RPC surface + +No HTTP endpoints; no caller analysis needed. + +--- + +### Forwarder + +**Status**: No HTTP endpoints exposed by forwarder process. +The forwarder: +- Consumes RFC 5424 syslog from local log file (written by agent log collector) +- Ships syslog-over-TLS to master port 6514 (outbound, not inbound) +- No master→forwarder calls; no worker-side HTTP surface + +--- + +### Deletion Candidates + +**None.** All identified endpoints have active callers: + +- Agent `/deploy`, `/teardown`, `/self-destruct`, `/topology/*` are called by engine, deployer, master probes +- Updater `/update*`, `/releases`, `/health` are called by master push flow + dashboard +- Enroll-bundle is called by new agents (worker-facing enrollment) +- Simple enroll is called by admin UI + +--- + +### Duplicate / Obsolete Endpoints + +**Potential overlap to review**: + +1. **`/swarm/enroll` vs `/swarm/enroll-bundle`**: Two enrollment flows, both active. + - `/enroll` (old) — admin issues cert + agent curls back for bundle + - `/enroll-bundle` (new) — admin renders bundle upfront, agent one-liners it + - Consider consolidating if old flow is being phased out (need human review of intent). + +2. **Agent `/deploy` + `/teardown` vs `/topology/apply` + `/topology/teardown`**: Both exist. + - `/deploy` — fleet-wide (old unihost verb) + - `/topology/{apply,teardown}` — single topology (newer MazeNET feature) + - No conflict; different scopes. Agent supports both. + +3. **Agent `/mutate` returns 501**: Placeholder for future worker-side mutation. + - Currently master re-sends `/deploy` with updated config. + - Safe to leave as-is (fails closed); can implement later. + +--- + +### Summary Table + +| Process | Count | mTLS | Auth | Notes | +|---------|-------|------|------|-------| +| Agent | 9 | Yes | No (peer auth only) | Port 8765; calls from master + engine | +| Updater | 5 | Yes | Yes (CN-gated) | Port 8766; calls from master | +| Enroll-Bundle | 3 | No | Token (5 min) | Master port 443; agent + admin fetch | +| Enroll | 1 | No | Browser auth | Master port 443; admin UI | +| **Total** | **18** | — | — | — | + +**Caller Types Identified**: +- `master-to-agent`: Master calls agent (9 endpoints) +- `master-to-updater`: Master calls updater (5 endpoints) +- `agent-client`: Agent calls master heartbeat (1 endpoint in Phase 1) +- `admin-client`: Admin calls enroll-bundle POST (1 endpoint) +- `test`: All endpoints have test coverage + +**Zero-Caller Endpoints**: None. + +--- + + +## Phase 3 — CLI Command Surface + +### Summary + +**Total CLI Commands**: 37 +**Master-only Commands**: 27 (via `MASTER_ONLY_COMMANDS` + `MASTER_ONLY_GROUPS`) +**Agent-capable Commands**: 10 (hidden in agent mode when `DECNET_MODE=agent`) +**Commands Hitting API Routes**: 7 (all in `decnet swarm *` group, plus `decnet deploy`) +**Deletion Candidates**: 0 (no deprecated commands found; all are actively used) + +--- + +### Full Command Inventory + +| Command | Handler | Source | Master-only? | Hits API? | Notes | +|---------|---------|--------|--------------|-----------|-------| +| `decnet api` | `api()` | api.py:19 | Yes | No | Start FastAPI backend (uvicorn) | +| `decnet swarmctl` | `swarmctl()` | swarmctl.py:18 | Yes | No | Run SWARM controller + auto-spawn listener | +| `decnet agent` | `agent()` | agent.py:16 | No | No | Worker: run SWARM agent (requires cert bundle) | +| `decnet updater` | `updater()` | updater.py:14 | No | No | Worker: run self-updater daemon | +| `decnet listener` | `listener()` | listener.py:16 | Yes | No | Run syslog-TLS listener (RFC 5425, mTLS) | +| `decnet forwarder` | `forwarder()` | forwarder.py:18 | No | No | Worker: forward syslog to master:6514 (mTLS) | +| `decnet deploy` | `deploy()` | deploy.py:68 | Yes | Yes | Deploy deckies (unihost/swarm mode) | +| `decnet init` | `init_cmd()` | init.py:305 | Yes | No | Bootstrap master: user/group/systemd/config | +| `decnet services` | `list_services()` | inventory.py:15 | No | No | List available service plugins | +| `decnet distros` | `list_distros()` | inventory.py:27 | No | No | List available OS distro profiles | +| `decnet archetypes` | `list_archetypes()` | inventory.py:38 | Yes | No | List machine archetype profiles | +| `decnet redeploy` | `redeploy()` | lifecycle.py:18 | No | No | Check services + relaunch any down | +| `decnet status` | `status()` | lifecycle.py:57 | No | No | Show running deckies + service status | +| `decnet teardown` | `teardown()` | lifecycle.py:81 | Yes | No | Stop/remove deckies (--all or --id) | +| `decnet probe` | `probe()` | workers.py:15 | No | No | Fingerprint attackers (JARM/HASSH) | +| `decnet collect` | `collect()` | workers.py:40 | No | No | Stream Docker logs to RFC 5424 file | +| `decnet mutate` | `mutate()` | workers.py:57 | Yes | No | Trigger/watch decky mutation | +| `decnet correlate` | `correlate()` | workers.py:86 | Yes | No | Analyse logs for cross-decky traversals | +| `decnet web` | `serve_web()` | web.py:13 | Yes | No | Serve frontend SPA + proxy /api/* | +| `decnet profiler` | `profiler_cmd()` | profiler.py:11 | Yes | No | Build attacker profiles from log stream | +| `decnet sniffer` | `sniffer_cmd()` | sniffer.py:12 | Yes | No | Passive network sniffer | +| `decnet db-reset` | `db_reset()` | db.py:86 | Yes | No | Wipe MySQL database (truncate or drop-tables) | +| `decnet bus` | `bus_cmd()` | bus.py:11 | No | No | Run UNIX-socket pub/sub bus worker | +| `decnet swarm enroll` | `swarm_enroll()` | swarm.py:23 | Yes | Yes | Enroll worker + issue mTLS bundle → POST `/swarm/enroll` | +| `decnet swarm list` | `swarm_list()` | swarm.py:85 | Yes | Yes | List enrolled workers → GET `/swarm/hosts` | +| `decnet swarm check` | `swarm_check()` | swarm.py:111 | Yes | Yes | Probe worker status → POST `/swarm/check` | +| `decnet swarm update` | `swarm_update()` | swarm.py:149 | Yes | Yes | Push tarball to workers → GET `/swarm/hosts` + updater client | +| `decnet swarm deckies` | `swarm_deckies()` | swarm.py:256 | Yes | Yes | List deckies across swarm → GET `/swarm/deckies` | +| `decnet swarm decommission` | `swarm_decommission()` | swarm.py:315 | Yes | Yes | Remove worker from swarm → DELETE `/swarm/hosts/{uuid}` | +| `decnet topology generate` | `_generate()` | topology.py:35 | Yes | No | Generate topology plan (persist as pending) | +| `decnet topology list` | `_list()` | topology.py:94 | Yes | No | List all topologies | +| `decnet topology show` | `_show()` | topology.py:121 | Yes | No | Print topology structure | +| `decnet topology deploy` | `_deploy()` | topology.py:177 | Yes | No | Deploy pending topology | +| `decnet topology teardown` | `_teardown()` | topology.py:194 | Yes | No | Tear down active topology | +| `decnet topology delete` | `_delete()` | topology.py:210 | Yes | No | Delete topology + cascade (LANs/deckies/edges) | +| `decnet topology mutate` | `_mutate()` | topology.py:265 | Yes | No | Enqueue live topology mutation | +| `decnet topology mutations` | `_mutations()` | topology.py:310 | Yes | No | List queued/applied mutations | + +--- + +### Commands Hitting API Routes + +All 7 commands that call HTTP endpoints go through **swarmctl** (not the main `/api/v1` backend). These are: + +1. **`decnet deploy`** (swarm mode) + - Hits: `GET /swarm/hosts?host_status=enrolled`, `GET /swarm/hosts?host_status=active`, `POST /swarm/deploy` + - Route source: `decnet/web/swarm_api.py` (Swarmctl API, not Phase 1 audit scope) + +2. **`decnet swarm enroll`** + - Hits: `POST /swarm/enroll` + +3. **`decnet swarm list`** + - Hits: `GET /swarm/hosts` + +4. **`decnet swarm check`** + - Hits: `POST /swarm/check` + +5. **`decnet swarm update`** + - Hits: `GET /swarm/hosts` + direct mTLS to updater port 8766 + +6. **`decnet swarm deckies`** + - Hits: `GET /swarm/deckies` + +7. **`decnet swarm decommission`** + - Hits: `DELETE /swarm/hosts/{uuid}` + +**Note**: Swarmctl API endpoints (`/swarm/*`) are **not** in the Phase 1 audit (Phase 1 scanned `/api/v1/*` only). These routes are stable and not candidates for deletion. + +--- + +### Deletion Candidates + +**Count: 0** + +**Rationale**: +- No commands are marked `@deprecated` in docstrings. +- No old "v1" flavors replaced by newer flows (e.g., no `decnet deploy-v1` vs `decnet deploy-v2`). +- All commands in `MASTER_ONLY_COMMANDS` + `MASTER_ONLY_GROUPS` are actively referenced and tested. +- Worker-capable commands (`agent`, `updater`, `forwarder`, `bus`, `probe`, `collect`, `redeploy`, `status`, `services`, `distros`) are essential for field operation. +- Recent additions (`decnet init`, `decnet swarm *`, `decnet topology *`) are part of the SWARM/MazeNET bootstrap flow and have no predecessors. + +--- + +### CLI → API Deletion Chains + +No CLI command is the **only caller** of a Phase 1 API route marked `cli` or `zero`. All Phase 1 routes with `cli` callers have multiple paths: + +- Phase 1 example: `/health` — called by both CLI (`decnet status`) and web/test +- Phase 1 example: `/deckies` — called by CLI (`swarm deckies`) + web + test + +**Implication**: Deleting a CLI command does NOT unlock any Phase 1 API route deletions. + +--- + +### Gating Configuration + +Master-only enforcement lives in `decnet/cli/gating.py`: + +**MASTER_ONLY_COMMANDS** (25 command names): +``` +"api", "swarmctl", "deploy", "redeploy", "teardown", +"mutate", "listener", "profiler", +"services", "distros", "correlate", "archetypes", "web", +"db-reset", "init", +``` +Plus subcommand groups: + +**MASTER_ONLY_GROUPS** (2 group names): +``` +"swarm", "topology" +``` + +**Defense-in-depth**: +- Registration-time filter hides commands from `decnet --help` on agents (when `DECNET_MODE=agent`). +- Runtime gate in each command body calls `_require_master_mode()` to block direct function imports. + +--- + +### Recent Additions (Phase Context) + +Per repo memory and recent commits: + +- **`decnet init` + `--deinit`**: Bootstrap + teardown systemd/polkit/tmpfiles. Idempotent. +- **`decnet swarm *`**: Enroll workers, list status, push updates, manage deckies. All talk to swarmctl, not `/api/v1`. +- **`decnet topology *`**: MazeNET nested-topology commands. Direct DB calls (no HTTP). Replaces old flat `/topologies` CRUD. +- **`decnet bus`**: New ServiceBus worker. UNIX-socket pub/sub, not HTTP. +- **Worker supervisors** (`probe`, `collect`, `correlate`, `sniffer`, `profiler`): Field microservices. Spawned by `decnet deploy` as detached processes. + +None are marked for removal; all have active use cases. + +--- + +### Output Modes + +CLI output is **structured text** (Rich tables, JSON, syslog-format lines). All commands respect: +- `--json` flag where applicable (e.g., `decnet swarm check --json`) +- Scriptable structured output (e.g., `decnet correlate --output json`) + +Web dashboard visualization is **not** in CLI scope (per repo design: CLI outputs text, dashboard ingests data via API). + + +--- + +## Phase 4 — Consolidated Cleanup Plan + +### Executive Summary + +**CRITICAL FINDING**: Phase 1's "test-only routes" classification is **fundamentally unreliable**. Of 8 sampled test-only routes, **6 showed active web UI callers** — the Phase 1 grep methodology failed to catch TypeScript/TSX frontend API calls. + +**Phase 1 zero-caller candidates**: **REVISED DOWNWARD** from 7 to **3 actual deletions**: +- 4 routes flagged as zero-callers actually have active web UI callers: `/archetypes`, `/deckies/{decky_name}/mutate`, `/deckies/{decky_name}/mutate-interval`, and `/teardown` +- Remaining true zero-callers: `GET /{topology_id}/events`, `GET /{topology_id}/status-events`, `GET /{topology_id}/lans/{lan_id}/next-ip` + +**Recommendation**: Do NOT use the Phase 1 "47 test-only" list as a deletion target without manual verification of EACH route against the TypeScript frontend code. + +--- + +### Phase 4 Verification Results + +#### Zero-Caller Candidates — Fresh Grep Results + +| Route | Handler | Phase 1 Status | Phase 4 Finding | Verdict | +|-------|---------|----------------|-----------------|---------| +| `GET /archetypes` | `api_list_archetypes()` | Zero callers | **FOUND**: `DeckyFleet.tsx:833` calls `/topologies/archetypes` | **KEEP** | +| `POST /deckies/{decky_name}/mutate` | `api_mutate_decky()` | Zero callers | **FOUND**: `DeckyFleet.tsx:850` calls `/deckies/${name}/mutate` | **KEEP** | +| `PUT /deckies/{decky_name}/mutate-interval` | `api_update_mutate_interval()` | Zero callers | **FOUND**: `DeckyFleet.tsx:898` calls `/deckies/${name}/mutate-interval` | **KEEP** | +| `GET /{topology_id}/events` | `api_topology_events()` | Zero callers | **NO CALLERS FOUND** (only test mock) | **DELETE** | +| `GET /{topology_id}/lans/{lan_id}/next-ip` | `api_next_ip()` | Zero callers | **NO CALLERS FOUND** | **DELETE** | +| `GET /{topology_id}/status-events` | `api_get_status_events()` | Zero callers | **NO CALLERS FOUND** | **DELETE** | +| `POST /{topology_id}/teardown` | `api_teardown_topology()` | Zero callers | **FOUND**: `TopologyList.tsx` calls `/topologies/${id}/teardown` | **KEEP** | + +**Revised zero-caller count**: **3 routes** (not 7) + +--- + +#### Test-Only Routes — Spot-Check Results + +Sampled 8 of 47 "test-only" routes: + +| Route | Phase 1 Sample | Web Frontend Caller | Verdict | +|-------|----------------|-------------------|---------| +| `GET /artifacts/{decky}/{stored_as}` | test-only | **FOUND**: `ArtifactDrawer.tsx` | **FALSE POSITIVE** | +| `POST /auth/change-password` | test-only | **FOUND**: `Login.tsx` | **FALSE POSITIVE** | +| `POST /auth/login` | test-only | **FOUND**: `Login.tsx` | **FALSE POSITIVE** | +| `POST /blank` | test-only | **FOUND**: `MazeNET/useMazeApi.ts` + `TopologyList.tsx` | **FALSE POSITIVE** | +| `GET /bounty` | test-only | **FOUND**: `Bounty.tsx`, `CommandPalette.tsx` | **FALSE POSITIVE** | +| `GET /deployment-mode` | test-only | **FOUND**: `DeckyFleet.tsx` | **FALSE POSITIVE** | +| `DELETE /config/users/{user_uuid}` | test-only | **FOUND**: `Config.tsx` | **FALSE POSITIVE** | +| `GET /logs` | test-only | **FOUND**: `LiveLogs.tsx` | **FALSE POSITIVE** | + +**Verdict**: The "47 test-only routes" number is **unreliable**. At least **6/8 sampled routes have active web callers** that Phase 1's grep missed. The methodology failed because: +1. Phase 1 grepped Python/test files only; it did **not systematically scan TypeScript/TSX**. +2. Dynamic path construction (e.g., `` api.post(`/topologies/${id}/teardown`) ``) requires careful regex; simple string matching misses them. +3. Frontend developers split concerns across files (components/hooks/utils); no single grep layer caught all call sites. + +**Recommendation**: **Do not trust the "47 test-only" list.** Before deleting ANY route marked test-only, manually verify: +```bash +# For each route, run: +grep -r "" decnet_web/src --include="*.ts" --include="*.tsx" +``` + +--- + +### Enroll Flow Consolidation + +#### `POST /swarm/enroll` vs `POST /swarm/enroll-bundle` + +**Current state**: +- **`/swarm/enroll`** (simple): Master-driven, admin issues cert bundle, returns full bundle in response (201 Created). +- **`/swarm/enroll-bundle`** (new): Token-based workflow — admin builds token, renders `.sh` + `.tgz`, agent curls both (Wazuh-style one-liner). + +**Web UI caller analysis**: +- `SwarmHosts.tsx` calls **ONLY** `POST /swarm/enroll-bundle` (new flow). +- No web caller for `POST /swarm/enroll` (old flow) found. + +**CLI caller analysis**: +- `decnet swarm enroll` (Phase 3 audit) calls `POST /swarm/enroll` (line 572 of Phase 3 summary). + +**Recommendation**: **DEPRECATE simple `/swarm/enroll`** +1. Keep both endpoints for now (CLI still uses simple). +2. Mark `POST /swarm/enroll` as `@deprecated` in docstring; note that new deployments should use `POST /swarm/enroll-bundle`. +3. Update CLI (`decnet swarm enroll`) to call `/swarm/enroll-bundle` in a follow-up PR. +4. Only DELETE simple `/swarm/enroll` **after** CLI migration is merged and tested. + +**Why not delete now**: CLI is the only caller; deleting breaks backward compatibility for operators with scripts or runbooks calling the simple flow. Deprecate first, migrate CLI, then delete. + +--- + +### Ordered PR Plan (Kill List) + +**Three independent deletions** — run tests after each. Do NOT combine; each is a commit-shaped change. + +--- + +#### PR #1: Remove `/api/v1/{topology_id}/events` endpoint + +**Scope**: One endpoint, one handler module, test module, no other imports. + +**Files to delete**: +- `decnet/web/router/topology/api_events.py` (handler + schema) +- `tests/api/topology/test_events_stream.py` (test file) + +**Files to modify**: +- `decnet/web/router/topology/__init__.py` — remove two lines: + ```python + # DELETE: from .api_events import router as events_router + # DELETE: include_router(events_router) + ``` + +**Blast radius**: ~120 lines deleted, 2 import lines in router init. + +**Verification before deleting**: +```bash +grep -r "api_topology_events\|/events" --include="*.py" --include="*.ts" --include="*.tsx" \ + decnet/ decnet_web/ tests/ --exclude-dir=.claude | grep -v "def api_topology_events" | grep -v "test_events" +# Should return ZERO results except in files being deleted +``` + +**Test plan**: +```bash +pytest tests/api/topology/ -v # Topology suite still passes +pytest tests/api/ -k "not test_events_stream" --tb=short # Full API suite minus events +``` + +--- + +#### PR #2: Remove `/api/v1/{topology_id}/status-events` endpoint + +**Scope**: One endpoint, one handler (shares module with `GET /{topology_id}`), test code. + +**Files to modify**: +- `decnet/web/router/topology/api_get_topology.py` — remove function and route decorator: + ```python + # DELETE: @router.get("/{topology_id}/status-events", ...) + # DELETE: async def api_get_status_events(...): ... [~30 lines] + ``` + +**Files to modify (tests)**: +- `tests/api/topology/test_reads.py` — remove test cases that call `status-events`. + +**Blast radius**: ~40 lines (one function + docstring + route decorator). + +**Verification before deleting**: +```bash +grep -r "api_get_status_events\|/status-events" --include="*.py" --include="*.ts" --include="*.tsx" \ + decnet/ decnet_web/ tests/ --exclude-dir=.claude | grep -v "def api_get_status_events" +# Should return ZERO results except in deleted test code +``` + +**Test plan**: +```bash +pytest tests/api/topology/test_reads.py -v # Should pass after removing status-events test case +``` + +--- + +#### PR #3: Remove `/api/v1/{topology_id}/lans/{lan_id}/next-ip` endpoint + +**Scope**: One endpoint, one handler (shares module with catalog endpoints), test code. + +**Files to modify**: +- `decnet/web/router/topology/api_catalog.py` — remove function and route decorator: + ```python + # DELETE: @router.get("/{topology_id}/lans/{lan_id}/next-ip", ...) + # DELETE: async def api_next_ip(...): ... [~40 lines] + ``` + +**Files to modify (tests)**: +- `tests/api/topology/test_reads.py` — remove test cases that call `next-ip`. + +**Blast radius**: ~60 lines (one function + route + docstring). + +**Verification before deleting**: +```bash +grep -r "api_next_ip\|/next-ip" --include="*.py" --include="*.ts" --include="*.tsx" \ + decnet/ decnet_web/ tests/ --exclude-dir=.claude | grep -v "def api_next_ip" +# Should return ZERO results except in deleted test code +``` + +**Test plan**: +```bash +pytest tests/api/topology/test_reads.py -v # Should pass after removing next-ip test case +``` + +--- + +### Known Risks / Routes NOT Deleted (Had Callers) + +These routes were **flagged as zero-callers by Phase 1 but DO have active callers** — listed here so the human knows they were considered and verified: + +| Route | Handler | Caller Location | Decision | +|-------|---------|------------------|----------| +| `GET /archetypes` | `api_list_archetypes()` | `DeckyFleet.tsx:833` | KEEP | +| `POST /deckies/{decky_name}/mutate` | `api_mutate_decky()` | `DeckyFleet.tsx:850` | KEEP | +| `PUT /deckies/{decky_name}/mutate-interval` | `api_update_mutate_interval()` | `DeckyFleet.tsx:898` | KEEP | +| `POST /{topology_id}/teardown` | `api_teardown_topology()` | `TopologyList.tsx` | KEEP | + +--- + +### Summary Table + +| PR | Deletion | Files | Lines | Risk | Phase | +|----|----------|-------|-------|------|-------| +| #1 | `GET /{topology_id}/events` | 2 (handler + test) | ~120 | Low | 4a | +| #2 | `GET /{topology_id}/status-events` | 1 (shared module + test edit) | ~40 | Low | 4b | +| #3 | `GET /{topology_id}/lans/{lan_id}/next-ip` | 1 (shared module + test edit) | ~60 | Low | 4c | +| — | `POST /swarm/enroll` (simple) | 1 (handler) | ~100 | **Medium** | **Deferred** | + +**Total committed lines of code deleted**: ~220 lines (handler + tests) +**Total test files touched**: 3 (api_events.py deletion + test_events_stream.py deletion + test_reads.py edits) +**Estimated review time per PR**: 10–15 minutes +**Total estimated project time**: 1 hour (including test runs) + +--- + +### Why This Order + +1. **PR #1** removes the most isolated endpoint (dedicated handler module + test). No shared code, lowest risk. +2. **PR #2** modifies a shared catalog module but removes only one function. Can be reviewed with test edits. +3. **PR #3** similar scope to #2 (catalog module). Groups naturally with #2's test file edit strategy. +4. **Enroll consolidation deferred**: Requires CLI change first (`decnet swarm enroll` → `/swarm/enroll-bundle`). Plan for Phase 5. + +--- + +### Testing Strategy for Each PR + +1. **Before deletion**: Run the verification grep command above. Should return zero results except in files being deleted. +2. **After deletion**: + - Run `pytest tests/api/ -v` to verify no regressions in other routes. + - Spot-check web UI in dev (`decnet web`, then visit `/topologies` page). + - Verify CLI still works: `decnet --help` (not affected by these deletions). + - Final check: `grep -r ""` should be empty in decnet/, decnet_web/, tests/ (except deleted files). + +--- + +### Critical Lessons for Future Audits + +1. **Phase 1 methodology is insufficient**: Future audits must: + - Grep TypeScript/TSX sources **systematically** (not as an afterthought in Phase 4). + - Audit `decnet_web/src` for every route with same rigor as Python backend. + - Use IDE symbol search (e.g., VSCode "Find All References") for very high confidence on dynamic paths. + +2. **Do NOT bulk-delete "test-only" routes**: The "47 test-only" number is a **red flag, not a deletion target**. Each requires individual verification against web UI code. + +3. **Consolidation opportunities**: The simple `/swarm/enroll` is now deprecated but NOT deleted (requires CLI migration first). Document these as "Phase N+1" work, not in the main kill list. + + +--- + +## Phase 4.5 — Redundancy callout + +A follow-up pass (beyond zero-caller deletions) flagged three redundancy classes worth explicit documentation. These are orthogonal to the kill list in Phase 4 — they're about *ambiguity in the surface*, not dead code. + +### 1. Triple-registered `GET /deckies` ⚠️ HIGH PRIORITY + +The Phase 1 route table shows the same path + method bound to **three** handlers: + +| Method | Path | Handler | File | +|---|---|---|---| +| GET | `/deckies` | `get_deckies()` | `api_get_deckies.py` | +| GET | `/deckies` | `api_list_deckies()` | `api_list_deckies.py` | +| GET | `/deckies` | `list_deckies()` | `api_list_deckies.py` | + +**Why it matters**: +- FastAPI resolves same-path duplicates to whichever is registered last. The other two are dead but still appear in the OpenAPI schema. +- Two handlers in the same file (`api_list_deckies.py`) is a strong smell of a leftover-from-rename refactor. +- Schemathesis sees the duplicates and generates overlapping cases, inflating the 30-minute run time. + +**Verification TODO** (before deletion): +1. `grep -n "get_deckies\|api_list_deckies\|list_deckies" decnet/web/router/fleet/` — identify which is actually wired in the router `__init__.py` / include statements. +2. Determine whether the canonical handler is `get_deckies` or `api_list_deckies` (check which the web frontend's response shape matches). +3. Delete the two losers + their tests. Keep one canonical handler. + +**Risk**: Low. Only one handler is live; removing dead registrations can't change runtime behavior. + +--- + +### 2. Two enrollment flows — `/swarm/enroll` vs `/swarm/enroll-bundle` + +Already covered in [§ Enroll Flow Consolidation](#enroll-flow-consolidation) above. Reiterated here so all redundancies live in one place. + +- **`POST /swarm/enroll`** — legacy, simple, still called by `decnet swarm enroll` CLI. +- **`POST /swarm/enroll-bundle`** (+ `.sh` / `.tgz`) — new token-based flow, sole web-UI caller. +- **Recommendation**: mark simple as deprecated, migrate CLI to bundle flow, delete simple in a Phase 5 pass. Not on the current kill list. + +--- + +### 3. Mutation-verb confusion + +After Phase 4's zero-caller deletions land, four "mutate" endpoints currently coexist with overlapping names but different semantics: + +| Endpoint | Status | Scope | +|---|---|---| +| `POST /api/v1/deckies/{decky_name}/mutate` | **dead** (kill list) | single decky, fleet-wide | +| `PUT /api/v1/deckies/{decky_name}/mutate-interval` | **dead** (kill list) | single decky, fleet-wide | +| `POST /api/v1/{topology_id}/mutations` | **live** (mutation queue, bus-woken) | topology-scoped | +| Agent `POST /mutate` (port 8765) | **501 placeholder** | agent-local, unused | + +**Why it matters**: a reader new to the codebase sees four mutate-verbs and has to figure out which is canonical. After the kill list lands, only two remain: +- **Master**: `POST /{topology_id}/mutations` — the canonical live-mutation API. +- **Agent**: `POST /mutate` (501) — reserved for future worker-side mutation (currently master re-sends `/deploy`). + +**Action**: no code change needed *beyond the Phase 4 kill list*. Once dead routes are gone, this section stops being confusing on its own. + +--- + +### Explicitly NOT redundant + +For the record — these look like pairs but are not: + +- **Agent `/deploy` + `/teardown` vs `/topology/apply` + `/topology/teardown`** — fleet-wide vs single-topology scopes. Both serve agent, different purposes. Keep. +- **`POST /deckies/deploy` vs `POST /{topology_id}/deploy`** — same as above: fleet-wide deploy vs topology-scoped deploy. Keep.