DummyRepo couldn't instantiate — TLS-cert fingerprint rollup added a new
abstract method without a stub here. Add the override and a call site so
the abstract pass body is hit.
Three independent issues conspired to make stress tests record 0 requests:
1. Every virtual user did /auth/login in on_start. With 1000 users in a
spike window, bcrypt-bound logins never finished and on_start failed
for all users — aggregated requests stayed at 0. Pre-fetch a single
admin token in the fixture (cached per-host) and pass it via
DECNET_STRESS_TOKEN so locust users skip the login storm.
2. Locust exits non-zero on any request failure by default, causing
run_locust to throw away an otherwise valid stats CSV. Pass
--exit-code-on-error 0 so per-test assertions are the only fail gate.
3. test_stress_sustained ran two locust subprocesses against the same
uvicorn. Phase 1's keep-alive connections wedged phase 2 into 0
recorded requests ~2/3 of the time. Refactored stress_server into a
start_stress_server() context manager and gave each phase its own
uvicorn.
Stable 3/3 on full suite, 3/3 on test_stress_sustained alone.
Brings the federation-gossip columns on AttackerIdentity to life —
ja3_hashes, hassh_hashes, and the new tls_cert_sha256 — by projecting
the union of every member observation's fingerprints JSON onto the
identity at clusterer create / link / merge time.
- decnet/profiler/identity_rollup.py: pure extract_fp_summaries()
reads the production bounty shape (payload.fingerprint_type +
payload.{ja3,hash,cert_sha256}) and returns deduped+sorted JSON
list[str] per family, or None when a family has no signal so the
column stays NULL instead of '[]'.
- BaseRepository.update_identity_fingerprints + SQLModel impl: one
idempotent write that overwrites the three summary columns and
bumps updated_at.
- ConnectedComponentsClusterer: after every per-component
reconciliation (fresh-create OR existing-merge+link), recomputes
and writes the rollup for the target identity. Wrapped in a
best-effort helper so a write failure logs but never breaks the
tick.
- Tests: extract_fp_summaries unit (dedup, sort determinism,
unknown types ignored, malformed JSON, nested-stringified
payloads, non-string values); end-to-end clusterer ticks
populate the columns on create + on later observation links;
no-fingerprint clusters keep the columns NULL.
- FpCertificate renders the new cert_sha256 field (truncated, with
full hash on hover) and a FROM line carrying the prober-side
target_ip/port so the source is visible.
- tls_certificate payloads split on target_ip presence: prober certs
land under ACTIVE PROBES, sniffer certs under PASSIVE FINGERPRINTS.
Two synthetic fpType keys (tls_certificate_active /
tls_certificate_passive) drive the bucketing without disturbing
the on-the-wire fingerprint_type.
JARM probes are crafted ClientHellos with weird ciphers — they never
complete a real handshake, so the peer cert isn't reachable from
those sockets. After a non-empty JARM hash proves the port speaks
TLS, do a separate ssl.wrap_socket() against the same (ip, port) to
fetch and parse the leaf cert.
- decnet/prober/tlscert.py: fetch + parse via cryptography lib;
swallows all connect/handshake/parse failures (returns None).
- decnet/prober/worker.py::_capture_tls_cert: emits a tls_certificate
event with subject_cn / issuer / SANs / validity / SHA-256 +
publishes on the bus. Wired from _jarm_phase only when JARM
succeeds, so non-TLS ports never trigger a second connect.
- Tests cover happy path, cert-fetch failure, defense-in-depth crash,
empty-JARM skip, publish_fn, and parser edge cases (garbage DER,
empty bytes, missing SAN extension, non-self-signed).
Adds storage for TLS certificate details collected from attacker-run
servers by the active prober (sibling to the existing JARM probe).
- AttackerIdentity.tls_cert_sha256 / Campaign.tls_cert_sha256:
JSON list[str] columns mirroring ja3_hashes / hassh_hashes for
federation gossip.
- ingester clause 9b: emits a 'tls_certificate' fingerprint bounty
when a prober event carries subject_cn (disjoint from the existing
sniffer-gated clause).
- Prober-side capture (ssl.wrap_socket follow-up after JARM) and
profiler rollup land in sibling commits.
The check expects 405 for any HTTP method not declared on a path.
DECNET's topology router has a static `/topologies/services` (GET only)
sibling to a parameterized `/topologies/{topology_id}` (DELETE), so a
DELETE on the static path falls through to the parameterized route and
hits auth, which returns 401 — by design. Leaking 405-vs-401 would let
unauthenticated callers enumerate valid topology UUIDs.
The same shape applies to other static/dynamic sibling pairs across
the API. The check is fundamentally incompatible with that routing
strategy; document the omission inline.
Schemathesis fires up to 3000 examples per endpoint. POST /auth/login
caps at 10/5min per IP, so the second example onward returns 429 and
the positive_data_acceptance check flags it as RejectedPositiveData
(its allowed-status list is hardcoded in schemathesis to
2xx/401/403/404/409/5xx, so OpenAPI tweaks can't fix it).
DECNET_LIMITER_ENABLED=false exists for exactly this case (see
limiter.py docstring on stress/load testing).
Reverts the custom_openapi shim from 5d88346 / 9b1168c — the endpoint
already declares 429 in its responses= map (api_login.py:38), and the
shim turned out to address a problem that wasn't there. Drop the
companion test along with it.
Previous commit advertised 429 on every operation. Only routes
decorated with @limiter.limit can actually return slowapi's 429 —
currently just POST /api/v1/auth/login. Documenting it elsewhere is
dishonest and would mislead clients into expecting a response the
server cannot produce.
Walk slowapi's _route_limits / _dynamic_route_limits registries to
identify decorated endpoints, match them to FastAPI routes by
{module}.{name}, and only inject 429 on those.
Existing per-route 429 declarations (e.g. SSE connection-cap on
events streams via sse_limits) are untouched.
SlowAPI middleware can short-circuit any request with 429 once a
per-route or per-IP rate limit fires (e.g. POST /api/v1/auth/login is
capped at 10/5min). The OpenAPI spec did not declare 429 on any
operation, so schemathesis flagged legitimate rate-limit responses as
RejectedPositiveData / status-code-nonconformance failures.
Override app.openapi to inject a generic 429 response object on every
HTTP operation in the generated schema. Add a contract test that fails
if any operation drops the 429 advertisement.
- swarm/test_swarm_api, swarm/test_heartbeat: replace deprecated
asyncio.get_event_loop().run_until_complete() with asyncio.run();
the former raises in 3.11 once another test has set+closed a loop on
the main thread.
- prober/test_prober_bus, prober/test_prober_worker: extend tcp_fingerprint
mocks with tos/dscp/ecn/server_isn so the worker doesn't KeyError into
the prober_error branch.
- services/test_service_isolation: collector now retries on event-stream
errors instead of exiting; assert it stays running and cancel cleanly.
- live/test_imap_live, live/test_pop3_live: log format emits
outcome="failure", not "failed".
- live/test_service_isolation_live: is_service_container accepts label
OR state-name; rewrite the empty-state test against a synthetic
unlabeled container instead of the host's real fleet.
OpenSSH's native syslog ("Failed password", "Connection from",
"Connection closed by …") and the pam_unix lines emitted from sshd's
PAM stack add no signal beyond what auth-helper already captures as
structured login_attempt events. They cluttered the dashboard and
arrived without an SD wrapper, forcing prose-IP heuristics in the
collector.
Add a `:programname, isequal, "sshd" stop` rule above the forwarding
actions in /etc/rsyslog.d/50-journal-forward.conf. pam_unix lines from
sshd inherit programname=sshd so the same rule covers both. sudo /
login / su pam_unix lines keep flowing (different programname), so
post-login privilege escalation telemetry is preserved.
Native sshd and pam_unix lines route through rsyslog without the
relay@55555 SD wrapper and without key=value pairs, so attacker_ip
fell through to "Unknown". Add a prose-IP fallback to both parsers:
anchored patterns (from/rhost/client/src) win first so we never pick
the local listener in "Connection from X port Y on Z port 22", with
a bare-IPv4 scan as the last resort.
The prober writes events with hostname=decnet-prober and target_ip=
<the attacker being fingerprinted>. The parser pulls target_ip into
attacker_ip (it's one of _IP_FIELDS), which is correct for indexing
fingerprints under the attacker — but it had a side effect: every
fingerprinted attacker had two distinct deckies on file (the real
decoy they touched + decnet-prober) and the correlation engine's
traversals() classified that as lateral movement. Live dashboard
showed bogus "dmz-gateway -> decnet-prober" paths and TRAVERSAL
badges on attackers who'd done nothing but knock on the front door.
The prober is internal infrastructure, not a hop. Filter the
"decnet-" namespace out of distinct-decky counts and hop paths in
the engine. Fingerprints stay attached to the attacker profile via
the existing per-IP event index — just no longer as traversal.
Hit live on first VPS deploy: a window between the initial
client.containers.list() snapshot and the client.events() start-event
stream let topology service containers slip through, requiring an
operator restart for them to be picked up.
Two fixes:
* `_watch_events` now wraps the events() call in a retry loop with
exponential backoff (1s -> 30s cap). A docker.errors.APIError, daemon
reload, or SDK stream-decode hiccup used to make the executor task
return cleanly, leaving the collector "running" with no event
subscription. Future container starts were silently dropped until
the unit was restarted.
* New `_reconcile_loop` async task ticks every
DECNET_COLLECTOR_RECONCILE_S (default 30s), re-scans
client.containers.list(), and calls _spawn for any service container
not already in `active`. Belt to the event watcher's suspenders:
even if a start event is dropped during a reconnect window, the
reconciler picks it up within one cycle. Also prunes finished
futures from `active` so the dict's bounded by current container
count rather than agent lifetime churn.
The module docstring teaches inline comments — `mode = master # or
"agent"` is the canonical example for the [decnet] section. Python's
configparser ignores those by default unless inline_comment_prefixes
is set explicitly, so the comment became part of the value and
downstream validators rejected it ("mode must be 'agent' or 'master',
got 'master # or \"agent\"'").
Hit live on first VPS deploy: every CLI invocation crashed at import
time with a stack trace that didn't make it obvious the docstring's
example was the trigger. Now the parser does what the docs promise.
Decky service containers join their base via `network_mode:
container:<base>` and Docker binds that share at service start time. If
`docker compose up` recreates a base (e.g. ports: changes after a
forwards_l3 toggle) but decides services are unchanged, services keep
a stale FD into the destroyed namespace and end up with only `lo` — so
external traffic hits a closed port on the live base and gets RST.
Hit live on the first VPS deploy: external SSH to the dmz-gateway was
refused while sshd was listening, because base and service netns
inodes had drifted apart. `--always-recreate-deps` makes compose
rebuild every dependent whenever its base is recreated, removing the
race entirely.
The dashboard's /api/* proxy hardcoded 127.0.0.1 as the target host.
That works when the API binds to a wildcard or to loopback, but
breaks the moment an operator binds the API to a specific address —
e.g. a Tailscale IP for tailnet-only deploys: the API stops listening
on loopback entirely and the proxy gets ECONNREFUSED on every request.
The web command now reads DECNET_API_HOST and falls back to loopback
only when the API is on a wildcard (0.0.0.0 / :: / unset). A new
--api-host flag overrides at the CLI level.
Without rotation, the syslog listener and per-host collector grow
/var/log/decnet/ without bound — a noisy attacker (or an active
probe storm) fills the disk in hours on a small VPS. New
deploy/logrotate.d/decnet caps at 7 daily rotations or 100 MiB,
whichever comes first, and uses copytruncate because the ingester
and forwarder hold the files open via Python and won't reopen on
a rename rotation.
Wire install / remove into `decnet init` and `decnet init --deinit`
alongside the existing tmpfiles.d / polkit handling.
Refuse to start decnet.web.api when DECNET_MODE=agent (unless the
operator explicitly opts into dual-role with DECNET_DISALLOW_MASTER=
false). The Typer CLI already hides master-only commands on agents,
but a misconfigured systemd unit or a direct uvicorn invocation
would bypass that — now the lifespan itself refuses, before any
worker, DB or bus comes up.
Resolve DECNET_JWT_SECRET eagerly at startup so a missing or known-
bad value fails at boot rather than on the first auth-gated request.
The lazy-load shape stays useful for non-master CLIs.
Add validate_public_binding() called from the master API lifespan: when
DECNET_API_HOST is non-loopback, refuse to start if DECNET_CORS_ORIGINS
still contains a loopback origin (catches the "operator flipped to
0.0.0.0 to make it work and forgot to update CORS" footgun) or if
DECNET_CANARY_HTTP_BASE is plaintext http:// to a non-loopback host.
Log CRITICAL when DECNET_LIMITER_ENABLED=false on a public binding.
The validator no-ops under pytest so unrelated suites don't trip on it.
Add DECNET_VERIFY_HOSTNAME env knob; AgentClient and UpdaterClient
consult it when verify_hostname is None, giving production deploys
TLS hostname verification on top of the existing CA + fingerprint pin.
Default off so dev enrollments with mismatched SANs keep working.
Reject symlinks, hardlinks, device nodes and FIFOs in update tarballs;
validate each member's resolved path stays under dest after symlink
resolution; cap uncompressed size at 256 MiB to bound gzip-bomb damage;
strip setuid/setgid bits from extracted modes.
Add an optional sha256 form field to /update and /update-self; the
master client computes and sends it on every push, the executor
refuses to extract on mismatch. mTLS already authenticates the
master, so this is defence-in-depth against in-transit corruption
and gives operators a way to pin "exactly these bytes" for vetted
releases.
Both pages now layer on DeckyFleet.css + PersonaGeneration.css and use
the project's house vocabulary — fleet-root shell, page-header with
title-group + actions, btn / btn.violet / btn.ghost, info-banner with
the violet left rule, and the dim/matrix/alert text accents.
RealismConfig: inputs are flush-styled weight-input fields with a
violet focus ring; section heads carry a TOTAL badge; canary rows get
the project's amber accent; canary probability lives in a panel-bordered
slider row.
SyntheticFiles: the inline-styled table is now a styled .files-table
with the standard hover affordance, the filter-row uses tweak-group
label+select pairs, the drawer carries .drawer-eyebrow / .drawer-title
/ .meta-grid in the same style as the canary token drawer, and pager
buttons share the .btn.ghost.small treatment.
No behavioural change.
Single source of truth in decnet_web/src/realism/labels.ts: maps each
ContentClass enum value to a friendly display name ("Note",
"Cron Log", "Canary · AWS Credentials", …). Used by RealismConfig
(weight tables + class filter dropdown) and SyntheticFiles (table row
+ drawer detail).
Canary classes get a subtle amber accent so the dashboard's read of
"this row is callback-bearing" doesn't depend on prefix-spotting in
mono text. Raw enum value still appears in dim mono next to the label
so an operator copy/pasting from logs or grepping the codebase still
finds it.
No backend change: the wire shape is still the snake_case enum; the
beautification is render-time only.
New /realism-config page sits next to Persona Generation and
Synthetic Files under the Automation nav. Editable weight tables for
user / system / canary content classes (with live percent share),
plus a slider for canary_probability.
Wires GET/PUT /api/v1/realism/config — viewer can read; admin
required to save. Validation errors from the API are surfaced inline
rather than swallowed; the SAVE button refreshes from the server's
canonical snapshot so the operator sees exactly what landed (matters
because cross-list entries are silently dropped server-side).
New realism_config table (uuid PK + unique key) + two repo methods
(get/set) backs an admin-only GET/PUT /api/v1/realism/config surface.
The planner now exposes apply_payload(payload) / current_payload() /
reset_to_defaults() and reads its weights through mutable module
globals; pick() resolves the live values each call. Validation
catches negative weights, zero totals, out-of-range canary_probability,
unknown content_class names, and silently drops cross-list entries
(canary class on the user list, etc).
The orchestrator worker calls _refresh_realism_config(repo) on
startup and every 5 ticks (~5min at 60s interval). Operator changes
land within one refresh window with no bus signal — the simpler path
for a knob whose latency tolerance is minutes.
The (decky_uuid VARCHAR(64), path VARCHAR(1024)) UNIQUE constraint
generated a 4352-byte composite key under utf8mb4 (4 bytes/char),
busting MySQL's 3072-byte cap and crashing decnet api on init with:
Specified key was too long; max key length is 3072 bytes
Tighten path to VARCHAR(512) — (64+512)*4 = 2304 bytes, well under
the cap. Real realism + canary placement paths are short
(/home/<persona>/Documents/<file>, ~70 chars); 512 keeps headroom
without the index hassle. Pre-v1, no migration helper.
Adds a regression test pinning the (decky_uuid + path) byte budget so
a future widening fails loudly in CI rather than at MySQL deploy
time.
Surfaces realism subsystem state on the existing worker heartbeat
extra hook (system.orchestrator.health) — no new bus topic. Payload
carries {llm_enabled, llm_backend, llm_model, llm_breaker_state}, so
the dashboard's worker panel renders a live LLM badge with a colored
breaker-state dot:
closed (green) — LLM healthy
half_open (amber) — cooldown elapsed; next call is a probe
open (red) — short-circuiting to deterministic templates
Heartbeat is the canonical worker self-report channel; piggybacking on
extra(...) avoids a new topic family while keeping the snapshot
recomputed each beat (30s).
New /synthetic-files page sits next to Persona Generation and Canary
Tokens under the Automation nav group. Operators get a paginated
inventory of files realism has grown across the fleet (decky, path,
persona, content_class, last_modified, edit_count, hash) with filters
on decky / persona / content_class.
Decky filter is a dropdown sourced from /deckies — never free text.
Row click opens a drawer with the body preview; the drawer surfaces a
TRUNCATED chip when the stored body is at the 64KB cap.
Adds GET /api/v1/realism/synthetic-files (paginated list, filters by
decky_uuid, persona, content_class) and
GET /api/v1/realism/synthetic-files/{uuid} (single row with last_body
and a truncated:bool flag set when the stored body is at the 64KB cap).
Repo gains count_synthetic_files() and get_synthetic_file(uuid). The
list view drops last_body to keep the wire payload bounded; the detail
endpoint is the only path that returns it. Read-only — orchestrator
remains the sole writer.
FileAction and EditAction both write kind="file" — the discriminator
is action="file:create" vs "file:edit". The dashboard timeline used
to render both identically; now an EDIT sub-chip surfaces edits without
widening the kind enum (which doubles as the bus topic family).
No schema or API change. Polish only.
decnet/canary/cultivator wrote kind="http" for every cultivated
token, even DNS-trip ones (ssh_key, mysql_dump) and passive bait
(aws_creds). The canary worker uses kind to route attacker callbacks
to the right token; a misaligned kind means a real DNS resolution of
ssh_key or mysql_dump never attributes to the planted slug.
Add _GENERATOR_TO_KIND aligned with CanaryKind in models/canary.py
and look it up at create_canary_token time.
decnet/realism/naming._home and decnet/canary/cultivator._persona_login
both normalised "John Smith"→"johnsmith" with identical logic. Lift
to decnet.realism.personas.login_for(persona) and have both consumers
import it. Drift between the two would have left canary placement and
realism path naming using different login derivations.
The orchestrator worker clipped last_body at write time, but the repo
didn't enforce. A future caller that forgot the clip would write the
full body. Move the clip to record_synthetic_file and
update_synthetic_file via SYNTHETIC_FILE_BODY_LIMIT in
decnet/web/db/models/realism.py. Worker now passes the full body and
trusts the repo. Tests retargeted to assert repo enforcement.
Four gaps from the realism migration plan, plus one flaky test
fixed.
Added:
- tests/deploy/test_orchestrator_unit.py — replaces the dead
test_emailgen_unit.py. Asserts:
* decnet-orchestrator.service.j2 carries the DECNET_REALISM_*
env block (LLM, MODEL, TIMEOUT, PERSONAS) so per-host tuning
works without editing the .j2.
* Legacy DECNET_EMAILGEN_* vars are NOT referenced — clean break
contract from stage 5.
* decnet.target wants orchestrator + canary, does NOT want
decnet-emailgen.service. Anti-regression for service-collapse.
* deploy/decnet-emailgen.service.j2 stays deleted.
- tests/orchestrator/test_worker_integration.py — new
test_one_tick_email_branch_records_orchestrator_email. Pins the
action-roll to email, seeds a topology with an IMAP mail decky +
two personas, stubs LLM + docker-exec write paths, verifies an
orchestrator_emails row + bus event land. Restores end-to-end
email coverage that was lost when the pre-collapse
test_worker_integration.py was deleted.
- tests/realism/test_synthetic_files_truncation.py — pins the 64KB
last_body cap on create + edit, and documents the consequence:
edit candidates carry a truncated snapshot of files that exceeded
the cap. If a future change lifts the cap, _LIMIT in the test
must lift with it.
Fixed flaky:
- tests/orchestrator/test_scheduler.py — two pick_file tests
pinned to random.Random(1). Without a seed, the 3% canary gate
(stage 7) and 10% leave-alone roll occasionally flaked the
assertions because the _FakeRepo doesn't carry a
create_canary_token method.
Note: the existing
test_realism_subprocess_import_personas_rejects_in_agent_mode
already covers agent-mode rejection of decnet realism
import-personas; no new gating test needed.
Stage 7 — final stage of the realism migration. Canary plants are
now scheduled by the same realism planner that handles inert content,
keeping the orchestrator as the single decision point and avoiding
duplicate diurnal / persona / rate-limit logic in the canary
subsystem.
New surface:
- decnet/canary/cultivator.py: cultivate(plan, repo) builds a
CanaryContext, calls the right generator (canary_aws_creds ->
aws_creds, canary_mysql_dump -> mysql_dump, …), persists the
canary_tokens row before plant so the canary worker can attribute
callbacks even on plant-time previews. Resolves canary placements
to credible operator paths (~/.aws/credentials, ~/.ssh/id_rsa,
/var/backups/db_backup.sql).
- realism/planner.py adds 8 canary content_classes uniformly weighted
inside a 3% probability gate. Hard-capped: each tick at most one
canary; create branch falls through to inert otherwise.
- scheduler.pick_file dispatches canary content_class to the
cultivator; FileAction grows an optional content_bytes field so
binary canary artifacts (DOCX/PDF/honeydoc) survive the wire
intact instead of being utf-8 round-tripped.
- SSHDriver._run_file uses content_bytes when set, falls back to
encoding the str content otherwise.
Stealth (per feedback_stealth.md): cultivator does not introduce
any DECNET literal; the underlying generators are already
stealth-clean and the test suite asserts the contract holds.
Tests cover round-tripping every canary class through the cultivator,
verifying placement-path conventions, persona-login normalisation
("John Smith" -> /home/johnsmith/.aws/credentials), and the
no-DECNET-leak invariant.
Stage 6 of the realism migration. User-class file bodies (note,
todo, draft, script) optionally get LLM-authored content; system
classes (cron / daemon logs, /tmp caches) stay template-only because
formulaic *is* the right look for them.
New surface:
- realism.llm.circuit.LLMCircuitBreaker — process-local sliding-window
breaker. 3 consecutive failures trip open; 60s cooldown to half-open;
half-open success closes, failure re-opens. Protects the orchestrator
tick from sustained Ollama wedges (per-call timeout already covers
one-shot hangs).
- realism.prompts._style — em-dash suppression lifted from the
email prompt. Persona.uses_llms_heavily opts out per the
feedback_em_dash_llm_tell.md memory. Includes strip_em_dashes
belt-and-braces sub for output that slipped past the prompt rule.
- realism.prompts.filebody — class-conditioned prompts (note / todo
/ draft / script) with persona context, language pinning, output
shape rule.
- realism.bodies.make_body_with_llm — async wrapper around make_body
that calls the LLM when one is provided AND the breaker allows.
Falls back to template on timeout / error / empty / system-class.
Wiring:
- scheduler.pick_file accepts optional llm + llm_breaker + llm_timeout.
When the planner picks a create action and the content_class is a
user-class, the body_hint is replaced with the LLM-authored body
(or falls back to the deterministic body_hint).
- orchestrator.worker constructs get_llm() at startup gated by
DECNET_REALISM_LLM env var (any non-empty value enables; empty /
"off" / "none" / "0" disables). Passes llm + breaker through every
tick.
- decnet orchestrate gains --llm/--no-llm flag overriding the env var.
Stage 3b of the realism migration. A TODO.md planted on Monday gets a
checkbox flipped on Tuesday; a notes file grows a follow-up line; a
cron log gets a fresh entry tacked on. The synthetic_files row's
edit_count, last_modified, and content_hash advance.
New surface:
- EditAction dataclass (peer of FileAction in scheduler.py): carries
decky, path, persona, content_class, previous_body, mtime, and
synthetic_file_uuid for the worker's update path.
- realism.bodies.next_iteration(cls, persona, prev, rng): per-class
deterministic mutators. TODO flips an unchecked box and/or appends;
notes/drafts/scripts append; logs are append-only (mirroring real
log behaviour). Canary, cache_tmp, email raise KeyError —
unsupported.
- realism.planner.pick gains an edit branch: 60% create, 30% edit
(when an edit_candidate is supplied), 10% leave-alone. Returns
None on leave-alone — quiet ticks are realism too.
- scheduler.pick_file pre-fetches a single edit candidate via
repo.pick_random_synthetic_file_for_edit ~50% of ticks; the
planner decides whether to use it.
- SSHDriver._run_edit: turns next_iteration output into a
plant_file call (mtime-bumped, mode 0o644). Stashes new_body in
result.payload so the worker can hash it for synthetic_files.
- worker._bump_synthetic_file_after_edit: patches edit_count + 1,
last_modified=now, content_hash, last_body for the row UUID.
No-op when the row was pruned mid-flight.
- events.to_row / topic_for / event_type_for now recognise
EditAction (kind="file", action="file:edit").
Stage 3 of the realism migration. Replaces orchestrator/scheduler.py's
hardcoded _FILE_TEMPLATES/_USERS (3 templates emitting epoch-suffixed
filenames like notes-1777315854.txt with identical bodies per
template) with a persona-driven realism engine.
New surface:
- SyntheticFile SQLModel (synthetic_files table, UNIQUE on
decky_uuid+path) — per-(decky, path) state for the future
edit-in-place flow. Pre-v1, no _migrate_* helper.
- BaseRepository methods: record_synthetic_file,
update_synthetic_file, list_synthetic_files,
pick_random_synthetic_file_for_edit (used by stage 3b).
- realism/naming.py: per-content-class filename templates,
persona-conditioned. /var/log/cron.log + logrotate skeleton for
system-class; /home/<persona>/TODO.md, scratch.md, etc. for
user-class. Anti-regression test pins "no 8+ digit decimals in
basenames" (the realism failure today).
- realism/bodies.py: deterministic body templates per content_class.
TODO body uses checkbox markdown, script body has a shebang, cron
body matches syslog cron shape ("CRON[PID]: (user) CMD (...)").
- realism/planner.py: pick(deckies, now, rng) returns a Plan.
Diurnal-gated, weighted user/system content split (70/30 user
bias). Create-only in stage 3; edit branch lands in stage 3b.
Scheduler split:
- scheduler.pick is now traffic-only (sync).
- scheduler.pick_file is async, takes a repo, resolves personas
(Topology.email_personas for topology-source deckies; global
realism.personas_pool otherwise), and maps Plan -> FileAction.
- FileAction gains persona/content_class/mtime fields.
Worker:
- _one_tick rolls 50/50 between traffic and file each tick. After a
successful FileAction plant, _record_synthetic_file persists or
patches the synthetic_files row (catching the unique-constraint
collision on re-plant of the same path).
- SSHDriver._run_file passes action.mtime through to plant_file so
files don't all stamp at wall-clock-now.
Stage 4 of the realism migration. Lifts the driver Protocol into a
proper ABC with default plant_file/read_file methods (raise
NotImplementedError), and adds get_driver_for(action) so the
orchestrator worker can dispatch by action shape without isinstance
chains.
SSHDriver now inherits ActivityDriver and implements:
- plant_file: streams base64 via stdin (ARG_MAX-safe, mirrors
decnet.canary.planter; commit c17b9e0). Honours mtime via touch -d
so realism-planned files don't all stamp at wall-clock-now.
- read_file: docker exec cat with FileNotFoundError on rc=1, used by
the upcoming EditAction (stage 3b).
EmailDriver inherits ActivityDriver. Driver alias kept for back-compat
during the migration; removed once realism stages 5-7 land.
Empty subpackage skeleton for the realism migration: ContentClass enum
(file/email/canary content categories), Plan dataclass (frozen, with
edit-action invariant), in_work_hours window check (wrap-around
supported, fail-open on parse error), and sample_mtime for backdated
file timestamps that snap into a persona's active hours.
Stage 1 of the orchestrator+canary realism unification — no
production caller wired yet; planner.pick is a stub returning None
until stage 3.
Mirrors the Canarytokens.org trick: a base64-wrapped CHANGE REPLICATION
SOURCE TO + START REPLICA block in the dump trailer. Importing the
file into MySQL resolves <slug>.<dns_zone> (DNS trip) and opens a 3306
replica handshake whose SOURCE_USER smuggles @@hostname and
@@lc_time_names of the victim DB.
DNS lookup alone is sufficient for detection via the existing canary
dns_server; capturing the smuggled metadata via a 3306 handshake
responder is a follow-up.
honeydoc previously emitted HTML only — operators picking 'Document'
out of the dropdown got a .html file dropped at /Documents/
quarterly_report.docx, which any attacker would clock the moment they
ran 'file' on it.
Two new generators that emit the real artifact format:
- honeydoc_docx: stdlib zipfile only. Builds a minimal but valid
Office Open XML zip with the same Q3 review body as the HTML
flavor and an external-image relationship pointing at the
callback URL — same trick the operator-upload DOCX instrumenter
uses, fetched on document open by Word and LibreOffice. Reuses
_drawing() and _next_rid() from instrumenters/docx.py to keep
the body/relationships shape identical between synthesised and
instrumented files.
- honeydoc_pdf: pikepdf-backed. One-page PDF in the 14 base fonts
(Helvetica, no font embedding), realistic body, /OpenAction /URI
on the catalog so most viewers fire the callback on document
open. Falls back to a clear error if pikepdf is missing so the
operator can switch to honeydoc / honeydoc_docx.
Default placement paths now reflect each generator's true extension
(.html / .docx / .pdf) so the UI suggests something sensible. Both
generators surfaced in the New Token modal's generator dropdown.
Real-world plant() crashed with OSError [Errno 7] Argument list too
long when an artifact (honeydoc HTML / DOCX / PDF) base64-encoded
into the sh -c script body exceeded the kernel's argv limit (typically
128KB-2MB depending on the host).
Fix: keep the script trivial ('mkdir -p ... && base64 -d > path && ...')
and stream the encoded bytes through 'docker exec -i ... sh -c'
stdin instead. _run() grew an optional stdin_bytes parameter that's
piped into proc.communicate(input=...). The stdin path covers
arbitrarily large artifacts.
Tests updated:
- test_plant_argv_and_base64_round_trip now asserts the docker -i
flag is present and the base64 payload reaches stdin (and notably
is NOT in the script body).
- _FakeProc.communicate accepts input=None across the board so the
patched fast path no longer trips on the new kwarg.
Fetches GET /deckies on page load and feeds the running fleet into
the create modal as a <select>. Falls back to an empty-state hint
('No deckies running. Deploy a fleet first.') when the list is
empty so the operator isn't staring at an unusable form. Default
selection is the first decky returned.
Switches the page header to the standard .fleet-root .page-header /
.page-title-group / h1 / .page-sub / .actions pattern used by every
other top-level page. Drops the redundant AUTOMATION supertitle (the
sidebar group already labels that) and the inline Target icon next
to the title. Action buttons use the project's btn / btn violet
classes for visual parity with ADD PERSONA / BULK UPLOAD.
Worker unit mirrors decnet-webhook.service shape: simple type, runs
as the decnet user/group, append-style log file, full security
hardening (NoNewPrivileges/ProtectSystem/ProtectHome/PrivateTmp/
LockPersonality + the rest). Added /var/lib/decnet to ReadWritePaths
because the API process persists operator-uploaded canary blobs there.
CAP_NET_BIND_SERVICE granted (ambient + bounded) so an operator who
overrides DECNET_CANARY_DNS_PORT to 53 or HTTP_PORT to 80/443 in
.env.local doesn't need to fight systemd. The defaults stay
unprivileged (5353 / 8088).
Added decnet-canary.service to decnet.target so 'systemctl start
decnet.target' brings it up alongside the rest of the workers.
decnet init auto-discovers deploy/decnet-*.service.j2 files (per
decnet/cli/init.py:_install_units) so no further wiring needed —
running 'decnet init' on a fresh host installs the new unit.
Static tests confirm the unit references decnet canary, depends on
the bus, carries the standard security directives, and is listed
in the master target.
Hooks decnet.canary.planter.seed_baseline into the deploy() flow's
fleet-mirror step. After upserting a FleetDecky as 'running' we seed
the configured baseline canary set on the freshly-deployed decky.
Persona detection: read d.nmap_os (Windows -> windows path-mapping,
otherwise linux). Failures are logged and surface as state=failed
rows in the UI; the deploy itself MUST NOT abort (resilience
principle in CLAUDE.md).
Tests confirm:
- seed_baseline produces one row per configured generator per decky;
- the deployer source wires seed_baseline inside a try/except so a
failure can't abort the deploy.
Two sub-routers under /api/v1/canary:
blobs (operator-uploaded artifacts, deduped by sha256):
- POST /blobs (multipart upload; admin)
- GET /blobs (list with token_count; admin)
- DELETE /blobs/{uuid} (refcount-aware; 409 when referenced; admin)
tokens (per-decky planted artifacts):
- POST /tokens (generate or instrument + plant; admin)
- GET /tokens?decky_name=&kind=&state= (filter; viewer)
- GET /tokens/{uuid} (detail; viewer)
- GET /tokens/{uuid}/preview (instrumented bytes; admin)
- GET /tokens/{uuid}/triggers (paged callback log; viewer)
- DELETE /tokens/{uuid} (revoke + bus event; admin)
XOR validation: exactly one of blob_uuid / generator must be set.
Path validation rejects relative/NUL/newlines/.. segments. Every
body-bearing route documents 400 plus 401/403/404 as applicable.
Stdlib MIME sniffer (no python-magic dep) covers PNG/JPEG/GIF/PDF/
HTML/XML/DOCX/XLSX/JSON/YAML/TOML/text/plain; everything else falls
through to passthrough.
Tests run end-to-end through the live FastAPI app (planter docker
exec is patched); 17 cases covering dedup, refcount, lifecycle,
XOR validation, path validation, and 404 paths.
decnet canary launches the HTTP + DNS callback receiver via
decnet.canary.worker.run. Mirrors the shape of decnet webhook
(typer command with --daemon flag, asyncio.run in the foreground).
Deliberately NOT added to MASTER_ONLY_COMMANDS — every host that
hosts deckies runs its own canary worker, and the bus events stay
local to that host (per-host webhook fanout handles SIEM egress).
decnet canary worker hosts both callback surfaces in one process:
- HTTP: a tiny FastAPI app on its own port (default 8088). The only
meaningful route is GET /c/{slug} which looks up the slug, persists
a CanaryTrigger, publishes canary.<id>.triggered, and returns a 1x1
transparent GIF. Unknown slugs return the same response (stealth);
no decnet strings leak in headers/banners; docs/openapi/redoc are
disabled. X-Forwarded-For is honored.
- DNS: an authoritative UDP server for *.<canary_zone> using
asyncio.DatagramProtocol with stdlib-only DNS wire-format parsing
(no dnslib dep). Same lookup -> persist -> publish flow, plus a
sinkhole A record (192.0.2.1) so the attacker's resolver doesn't
loop on NXDOMAIN. Single-label slugs only; multi-label probes
return NXDOMAIN. Pointer loops in malformed queries are caught
(10-hop cap) so an adversarial packet can't wedge the parser.
Tests cover both surfaces without privileged sockets:
- HTTP via Starlette TestClient: known/unknown slug, headers, XFF,
stealth-string assertions.
- DNS via direct DatagramProtocol drive: known slug -> ANSWER,
unknown -> NXDOMAIN, pointer-loop -> ValueError, malformed
packet -> silent drop.
Plant / revoke / seed_baseline using the same docker-exec-with-sh-c
pattern proven by decnet/orchestrator/drivers/ssh.py:_run_file.
Each plant call composes a single sh script:
mkdir -p <dirname> && printf %s <base64> | base64 -d > <path> &&
chmod <mode> <path> && touch -d @<mtime> <path>
Base64-on-the-host / decode-in-the-container keeps binary artifacts
(DOCX/PDF/PNG) safe across the argv boundary; the placement_path,
mode, and mtime are shlex-quoted.
State transitions hit the repo: planted -> failed on docker error
with stderr captured into last_error. Bus events fire on success
(canary.<id>.placed) and on revoke (canary.<id>.revoked) — wrapped
in try/except so a downed bus never blocks a placement.
seed_baseline(decky_name, repo) is the deploy-hook entry point —
reads DECNET_CANARY_BASELINE (default git_config,env_file,honeydoc,
aws_creds), persists one row per generator, plants each. Failed
placements are logged but do NOT abort; the deployer hook treats
the return list as informational.
Seven instrumenters that mutate operator-supplied artifacts to
embed the callback URL:
- passthrough — bytes unchanged; only DNS-callback tokens trip
detection, with the slug embedded in the placement path
- plain — substitutes {{CANARY_URL}}/{{CANARY_HOST}} placeholders;
falls back to appending a comment line whose prefix adapts to the
apparent file syntax (#, //, ;)
- html — injects a 1x1 tracking pixel before </body>, appends
if the close tag is missing
- docx — direct zipfile manipulation (no python-docx dep):
inserts an external-image Relationship into word/_rels/document.xml.rels
and a matching <w:drawing> element before </w:body>
- xlsx — sibling of docx; injects an external-image relationship
into xl/_rels/workbook.xml.rels (orphan rels are still fetched on
open by most viewers)
- pdf — uses pikepdf to install /OpenAction /URI on the catalog;
rejects with a clear message when pikepdf isn't installed
- image — uses Pillow to embed slug + URL in PNG tEXt / JPEG
comment; rejects with a clear message when Pillow isn't installed
DOCX and XLSX share the rId allocator + relationship injector via
the docx module; both work on stdlib zipfile only.
Tests synthesise minimal real DOCX/XLSX fixtures inline, round-trip
each instrumenter, and assert the callback URL ends up in the
mutated bytes while the file still parses.
Five built-in generators that produce deterministic fake artifacts
keyed by the token slug:
- aws_creds — passive [default]/[prod] credentials block, no
callback wiring (AWS-key tokens require an external
trap, which is post-v1)
- git_config — .git/config with origin url = http_base/c/<slug>/repo.git
- env_file — .env with API_BASE_URL + WEBHOOK_NOTIFY_URL embedding
the callback URL plus inert realism filler
- ssh_key — PEM-shaped fake private key whose host comment carries
<slug>.<dns_zone> when DNS is deployed, else the
http_base host
- honeydoc — minimal HTML report with a 1x1 tracking-pixel <img>
whose src is the callback URL; fallback for the
deploy-time baseline before the operator uploads a
real DOCX/PDF
Tests assert byte-stability (same ctx -> same bytes), slug presence
in the embedded fields, that aws_creds is intentionally URL-free,
and that every artifact carries operator-facing notes for the
preview endpoint.
Mirrors the decnet.intel layout (base + factory + lazy concrete
imports). Defines:
- CanaryArtifact / CanaryContext dataclasses + the generator and
instrumenter ABCs they share
- factory dispatch for generators (git_config/env_file/ssh_key/
aws_creds/honeydoc) and instrumenters (docx/xlsx/pdf/html/image/
plain/passthrough), plus pick_instrumenter_for_mime() for MIME-driven
dispatch on operator uploads
- persona-aware default placement paths (Linux vs. Windows-shaped)
and absolute-path validation that the API will use to validate
operator-supplied placement_path values
- on-disk blob store: sha256-keyed two-level fan-out, idempotent
writes, refcount-aware unlink (the DB row is the source of truth)
Also covers prior commits' tests (bus topics, models, repo CRUD)
under tests/canary/. 79 tests, all pass.
Adds the abstract surface on BaseRepository and the SQLModel-backed
implementation (shared by SQLite and MySQL) for:
- canary blobs (upsert-by-sha256, list-with-refcount, refcount-aware delete)
- canary tokens (create, slug lookup, list with filters, state update)
- canary triggers (record+bump-counters atomically, list, attribute)
The triggers path is a single session that inserts the row and bumps the
parent token's counters together, so a subscriber that reads the token
right after the bus event sees the updated count. Blob delete refuses
while any token (including revoked) still references the blob; pre-v1
revoked tokens stick around for forensic value.
Three new tables for the canary tokens feature:
- canary_blobs — operator-uploaded source artifacts, deduped by sha256
- canary_tokens — one planted artifact in one decky; carries the
callback slug, generator/instrumenter, and lifecycle
- canary_triggers — append-only log of every callback hit; attacker_id
back-filled by the correlator
Pydantic request/response shapes live in the same file per the
single-source-of-truth convention. No migrations file — pre-v1
SQLModel.metadata.create_all() covers it.
Reserved topic family for the upcoming canary-tokens feature so the
correlator and webhook fanout can subscribe to canary.> from day one.
No producers yet; planter, decnet canary worker, and API will publish
in subsequent commits.
* Dashboard / Layout / index CSS — flexbox cleanup so the sidebar
scrolls independently and dashboard panels fill available height
without overflowing the viewport (min-height: 0 on the flex
ancestors that were collapsing).
* pyproject.toml — add sqlite_vec runtime dep (groundwork for an
embeddings-backed feature ANTI is wiring up separately).
* decnet/templates/{rdp,smb}/ntlmssp.py — minimal Type 3 (Authenticate)
parser shared between the SMB and RDP-NLA templates. Lands NTLM
creds in the universal Credential table with secret_kind=ntlmssp_v1
/ ntlmssp_v2 and secret_b64 = base64 of the NtChallengeResponse so
the bounty pipeline can feed the right hashcat mode.
* scripts/decnet-init.sh — convenience wrapper around `sudo decnet init
--force` that targets the current working directory; saves operators
retyping the install paths during dev iterations.
New dashboard surface for editing the global emailgen persona pool —
the JSON file fleet (MACVLAN/IPVLAN) and SWARM-shard mail deckies pull
from. MazeNET topology personas are out of scope here; they're
configured per-topology in the topology editor.
Backend:
* GET/PUT /api/v1/emailgen/personas — admin-write, viewer-read. PUT
validates with the same Pydantic schema the worker uses
(parse_personas), drops invalid entries with a warning, returns 400
only when the entire payload fails. Path is operator-discoverable
on every response so a CLI-driven backup workflow stays visible.
Frontend:
* PersonaGeneration.tsx + .css — table + add/edit modal with the full
EmailPersona schema (name, email, role, tone, mannerisms list,
language, signature, active hours, reply latency, uses_llms_heavily).
Local edits are batched; explicit "SAVE CHANGES" writes back, with a
dirty-indicator pill and a "DISCARD" reset. Email uniqueness is
enforced client-side so the scheduler never picks the same persona
as both sender + recipient.
* Sidebar AUTOMATION group gains a "Persona Generation" entry next to
Orchestrator; route registered at /persona-generation.
The worker reads the same on-disk file the API writes — see
decnet.orchestrator.emailgen.global_pool. The API resets the
in-process cache on every read/write so the worker picks up dashboard
edits within its next tick rather than waiting on mtime.
The SSE pipe at /orchestrator/events/stream was already streaming
'orchestrator.email.{decky_uuid}' events (the subscription is for the
'orchestrator.>' wildcard), but the consumer side dropped them on the
floor. Three fixes to close the loop:
* useOrchestratorStream.ts now registers an 'email' SSE listener — the
EventSource silently ignores frames whose event name has no listener,
so missing this entry meant every email frame was dropped before
reaching the page's onEvent handler.
* /api/v1/orchestrator/events accepts kind=email and dispatches to
list_orchestrator_emails, adapting rows to the existing wire shape:
subject -> action, sender_email -> src_decky_uuid, recipient_email
-> dst_decky_uuid, plus email-specific extras (thread_id, language,
mail_decky_uuid, message_id, in_reply_to) ride along as top-level
keys.
* Orchestrator.tsx gains an 'email' tab in the kind filter and a
branch in the row renderer / inspector that:
- shows full sender / recipient (no UUID truncation),
- chips the language code next to the subject,
- relabels ACTION as SUBJECT in the inspector and surfaces
thread / in-reply-to / mail-decky details.
The 'all' tab continues to show traffic+file only (today's behavior);
operators see emails by switching to the email tab. A union view at
the API layer is the obvious follow-up but not necessary for now.
Plug emailgen into the systemd-supervised fleet:
- New deploy/decnet-emailgen.service.j2 mirroring decnet-orchestrator's
shape: simple service, restart-on-failure, docker supplementary group
(driver shells `docker exec` to drop EMLs into the spool), the same
hardening directives as the rest of the fleet.
- decnet.target now Wants both decnet-emailgen.service and
decnet-orchestrator.service. Orchestrator's absence from the target
was a historical oversight — fixing it here while the file is open.
`decnet init` already globs deploy/decnet-*.service.j2 so the new unit
ships automatically; no init-side change needed. Emailgen-specific env
knobs (DECNET_EMAILGEN_LLM, _MODEL, _PERSONAS, _TIMEOUT) are documented
in the unit and operator-tunable via /opt/decnet/.env.local.
Two-layer gating per CLAUDE.md:
- registration-time: emailgen added to MASTER_ONLY_GROUPS so agents
don't see the sub-app in 'decnet --help' at all.
- body-guard: _require_master_mode('emailgen ...') at the top of every
sub-command body so a direct callable import (third-party tooling)
still bails on agent hosts.
Matches the convention used for 'swarm', 'topology', 'geoip'. SWARM
agents push their generated mail through the master's emailgen worker
(or none at all); cross-agent emailgen federation stays out of scope.
Lift the Ollama subprocess shell-out out of EmailDriver and into a
proper provider subpackage shape:
decnet/orchestrator/emailgen/llm/
base.py — LLMBackend Protocol + LLMResult + LLMTimeout
factory.py — get_llm() reads DECNET_EMAILGEN_LLM
impl/ollama.py — current 'ollama run' subprocess path
impl/fake.py — canned-output backend used by tests
Driver now takes an LLMBackend on construction (or inherits the
factory default). Tests inject FakeBackend instead of monkeypatching
the subprocess layer, which is cleaner and ~10x faster. Swapping
Ollama for the Anthropic API / vLLM / llama.cpp is now a third branch
in factory.py; no driver rewrite needed.
Mirrors the convention used by decnet.web.db.factory + decnet.bus.factory
per the provider-subpackages-from-day-one rule in memory.
Two changes that unwind earlier MazeNET-only assumptions and fix a
realism tell:
1. Persona resolution is now per-decky-source, not topology-only. The
scheduler walks the union view (list_running_deckies, including
fleet MACVLAN/IPVLAN + SWARM shards) and picks the right persona
list for each source:
* topology decky -> Topology.email_personas (per-topology richness
preserved)
* fleet / shard -> a single host-wide pool loaded from disk
(DECNET_EMAILGEN_PERSONAS, /etc/decnet/email_personas.json, or
~/.decnet/email_personas.json)
Operators install the global pool via 'decnet emailgen
import-personas <file>' which validates with the same Pydantic
schema the worker uses.
2. The driver now runs 'touch -d <Date>' inside the docker exec right
after the EML write so file mtime matches the email's RFC 2822
Date: header. Without this an attacker 'ls -lt'ing the spool sees
every email clustered inside the worker's tick window — the
cluster itself was a stylometric tell.
CLI now exposes 'decnet emailgen' as a sub-app with 'run' (default,
backwards-compatible with bare 'decnet emailgen') and 'import-personas'.
list_running_deckies carries topology_id through so consumers can resolve
the parent topology without a second round-trip.
When IMAP_EMAIL_SEED / POP3_EMAIL_SEED points at a directory of .eml
files (the orchestrator emailgen worker's drop path,
/var/spool/decnet-emails/ by convention), the bait mailbox is replaced
with those LLM-generated, persona-driven, threaded messages. Empty /
missing dir keeps the hardcoded fallback so a fresh deployment is never
silent. Cached with mtime invalidation + a short TTL so a hot mailbox
doesn't pay the parse cost on every IMAP/POP3 command.
Replaces the DEBT-026 stub on both templates that named the env var but
never wired it through.
Second orchestrator worker (decnet emailgen) that drips persona-driven,
threaded, multi-language fake emails into running mail deckies. Personas
live on Topology.email_personas; topology-wide language_default falls
through to any persona that doesn't pin its own. Em-dashes are
suppressed at the prompt layer by default and only lifted for personas
explicitly marked uses_llms_heavily — em-dashes are an LLM tell and a
flat corpus of em-dashed mail is a giveaway.
EML delivery writes into /var/spool/decnet-emails/<thread>/<msg>.eml on
the mail decky via docker exec; wiring the IMAP/POP3 templates to read
from that spool (replacing the hardcoded _BAIT_EMAILS) is the next step.
Mirrors the CredentialsInspector pattern: clicking a row opens a
right-edge drawer with the full event payload pretty-printed and
copyable. The table view truncates the src/dst id to 8 chars; the
drawer shows the full identifier plus a SOURCE chip
(TOPOLOGY / FLEET / SHARD) so operators can tell at a glance whether
the orchestrator hit a MazeNET decky, a unihost fleet decky, or a
SWARM shard.
Source detection is purely client-side based on id shape — bare UUID
→ topology, "local:*" → fleet, "<host>:*" → shard. The server
already returns a normalized id from list_running_deckies; this
inspector just labels it.
Backdrop click closes via target===currentTarget guard (per the
React stop-propagation memory: never use stopPropagation on drawer
panels — it breaks native event delegation).
Live (in-flight stream) events use synthetic uuids prefixed "live-";
the drawer hides the EVENT UUID row and shows "LIVE EVENT" in the
header for those, since the server-side id won't exist until the
backend persists the row.
Once the orchestrator started seeing fleet + SWARM shard sources via
list_running_deckies (a844148), every event row landing on a fleet decky
broke the FK to topology_deckies — the column now carries opaque ids
("local:omega-decky" for fleet, "host_uuid:decky_name" for shards) that
will never match topology_deckies.uuid.
Symptom on the operator's mothership:
IntegrityError 1452 — orchestrator_events_ibfk_2 FK violated on every
tick once the reconciler populated fleet_deckies.
Index on dst_decky_uuid is preserved (the dashboard reads
"events for this decky" frequently); only the FK is removed. Keeps
data integrity loose by design — events are append-only history that
should outlive the deckies they reference.
Existing MySQL deployments need the FK dropped manually:
ALTER TABLE orchestrator_events
DROP FOREIGN KEY orchestrator_events_ibfk_2,
DROP FOREIGN KEY orchestrator_events_ibfk_1;
SQLite users are unaffected — SQLite doesn't enforce FKs by default.
The Workers panel (Config → Workers tab) hardcodes its row list in
KNOWN_WORKERS — by design, so a rogue publisher can't inject UI rows.
Three heartbeat-emitting workers were missing:
* clusterer — behavioral clustering (decnet/clustering/)
* campaign-clusterer — campaign assembly (decnet/clustering/campaign/)
* reconciler — host-local fleet convergence (added in 430262e)
Each already publishes on system.<name>.health via run_health_heartbeat,
so they show up live the moment they're added to the registry — no
frontend or subscriber wiring needed (Config.tsx renders whatever
/workers returns).
Also added to _PREFERRED_ORDER in start-all so START ALL WORKERS brings
them up in dependency-friendly order: data-plane → reconciler → intel
→ clustering → output → orchestrator.
Three deployable units (listener, web, swarmctl) intentionally remain
absent from KNOWN_WORKERS — they don't emit heartbeats (CLI / static
server / one-shot tooling), so they'd permanently render as UNKNOWN
and confuse operators. Adding them is a separate decision that needs
a "synthesize installed-but-silent rows" pass on the registry.
Two pieces, one PR because they share a deployment surface:
1. systemd. decnet-reconciler.service.j2 mirrors the orchestrator unit
shape (docker group, hardened sandbox, append-logs). Read-only
/var/lib/decnet so it can read decnet-state.json without write
access. Auto-discovered by `decnet init` via the existing
decnet-*.service.j2 glob — no init.py change needed. Added to
decnet.target so `systemctl start decnet.target` brings it up
alongside collector / sniffer / mutator / etc. Also added to the
agent reaper script so self-destruct cleans it up on workers.
2. Bus signal. reconcile_once now publishes
`decky.<host_uuid:name>.state` on every insert / delete /
state-changed transition. Reuses the existing DECKY_STATE topic
family (no bus/topics.py change → no wiki update needed per the
bus-signals doc rule). Composite host_uuid:name segment keeps
fleet rows distinguishable from MazeNET TopologyDecky rows whose
ids are bare UUIDs. Quiet ticks publish nothing — convergence
means silence.
Bus is plumbed through the worker, defaults to None for unit-test
callers. publish_safely keeps the source-of-truth contract: DB write
is authoritative, the publish is best-effort notification.
Captures previous_state into a local before update_fleet_decky_state
runs — a fake repo that mutates rows in-place would otherwise see the
post-update state and report previous == current. Real repos don't
have this concern but the fix is cheap and makes the function less
order-dependent.
Switches _one_tick from list_running_topology_deckies to
list_running_deckies (the union view added in 095500a). Resolves the
permanent "no actionable deckies (running+ssh count=0)" log on hosts
running only unihost MACVLAN / IPVLAN decoys — the orchestrator now
sees fleet_deckies rows alongside MazeNET topology rows and SWARM
DeckyShard rows.
Also fixes the misleading log message: the old "running+ssh count=N"
reported the *pre-filter* total (count of all running deckies, not
the SSH-eligible subset that scheduler.pick actually evaluates). New
line breaks down running, ssh_eligible, and per-source counts so
debugging "why isn't it picking?" no longer requires reading
scheduler internals.
Regression test: orchestrator integration suite now seeds fleet_deckies
rows (not just topology_deckies) and verifies a tick picks them and
records an event with dst="local:fleet-*" — proving the original bug
on the operator's mothership is fixed.
Adds decnet.fleet.reconciler — a pure async function plus a long-lived
worker — that periodically reconciles the three sources of truth on a
DECNET host:
1. decnet-state.json (CLI-canonical fleet record)
2. fleet_deckies table (DB mirror, written by engine.deployer)
3. docker inspect (actual per-container runtime state)
Drift handling:
* JSON has X, DB doesn't → INSERT (deploy ran with DB offline)
* DB has X (this host), JSON doesn't → DELETE (teardown ran with DB offline)
* Both have X, docker disagrees → flip state to running/failed/degraded
* Docker socket unreachable → leave existing state alone (don't
torch every row to torn_down)
Cross-host safety: deletions are scoped to host_uuid for the local host;
a master that runs both a local fleet and swarm workers will never
clobber a peer's slice.
CLI:
decnet reconcile --once # one-shot, prints counts
decnet reconcile [--interval N] # long-lived worker, mirrors
# orchestrator's lifecycle (control
# listener + heartbeat + tick loop)
Promotes decnet/fleet.py → decnet/fleet/ package so the reconciler can
live alongside it without name collision (build_deckies_from_ini and
all_service_names re-exported unchanged via __init__.py).
14 new tests cover state aggregation rules, all four drift directions,
host_uuid scoping, docker-unreachable safety, and worker shutdown via
the bus control event.
The unihost API path delegates to engine.deployer.deploy(), which now
writes both decnet-state.json (existing) and the fleet_deckies DB
table (added in 646aeec). Comment makes the single-sink design
explicit so future maintainers don't add a parallel save_state /
upsert_fleet_decky call here.
No behavioral change — every fleet-creation path on every host (CLI
deploy, this unihost API path, and per-worker SWARM agent deploys)
already routes through the engine.deployer single sink.
CLI deploy now writes both surfaces: decnet-state.json (existing,
canonical for offline / no-API hosts) and the new fleet_deckies DB
table (visible to orchestrator, web dashboard, REST API).
Best-effort: a DB outage logs a warning and returns. The JSON file
remains the source of truth for `decnet status`, `decnet teardown`,
sniffer, and collector — operators on a CLI-only host keep working.
_run_async helper bridges sync deploy() into the async repository.
Always uses a fresh thread because the API handler at
web.router.fleet.api_deploy_deckies invokes deploy() from inside a
FastAPI event loop, which would otherwise break asyncio.run.
Verified end-to-end against MySQL: deploy mirror inserts rows, union
view (list_running_deckies) returns them with source="fleet",
teardown mirror removes them. Works from both sync (CLI) and async
(API handler) call sites.
Adds a fleet_deckies table so DB-only consumers (orchestrator, web
dashboard, REST API) can see unihost / MACVLAN / IPVLAN deckies
without reading the JSON state file. Mirrors DeckyShard field-for-field.
Composite PK (host_uuid, name) future-proofs for a mothership that
runs both a local fleet and acts as a swarm master. host_uuid defaults
to the "local" sentinel — no FK to swarm_hosts because the local
mothership isn't enrolled as a worker.
Repo additions: upsert_fleet_decky, delete_fleet_decky,
list_fleet_deckies, list_running_fleet_deckies,
update_fleet_decky_state, plus list_running_deckies which unions
topology + fleet + shard sources for the orchestrator.
Smoke-tested round-trip against MySQL: upsert, list_running, union
view (source="fleet"), delete.
TTL extraction was already wired in the active prober and passive sniffer
plus profiler rollup; the checkbox was just stale. TCP/IP stack now
includes ToS/DSCP/ECN, IP-ID sequence classification, and ISN sequence
classification as of the previous three commits.
Mirrors the IP-ID classifier for TCP ISN values: per-source-IP rolling
deque (maxlen=8) populated from each inbound SYN's tcp.seq, classified
on every emission. A 'random' verdict is the modern norm; 'incremental',
'zero', or 'constant' indicates legacy stacks or hand-rolled raw-socket
tooling — a strong fingerprint signal.
Active prober now also captures server_isn (single sample, not classified
in-flight; downstream consumers correlating multi-probe results can apply
seq_class.classify_sequence themselves).
Profiler rollup carries the latest non-'unknown' label into
attacker.tcp_fingerprint. Dedup key already covers isn_class from
the previous commit, so transitions emit cleanly.
UI surfaces ISN class as a colour-coded tag with a ⚠ glyph for
non-random verdicts, since they're the genuinely interesting case.
Adds a per-source-IP rolling sample buffer (deque, maxlen=8) for IP-ID
values seen on attacker SYNs and a stdlib-only classifier in
decnet/sniffer/seq_class.py. Each new SYN appends ip.id and re-classifies
the buffer; the result is logged on tcp_syn_fingerprint events alongside
sample count.
The dedup key now folds in ipid_class so a transition from 'unknown' to
a definitive verdict emits exactly one fresh event instead of being
suppressed by the old (os|options) key. Profiler rollup carries the
latest non-'unknown' label into attacker.tcp_fingerprint.
UI surfaces it as a colour-coded tag in the TCP STACK panel: random
neutral, incremental amber, zero/constant green (the strong signal).
Active prober now reads ip.tos from the SYN-ACK and emits tos/dscp/ecn
alongside the existing TTL/window/options fields. dscp is folded into the
fingerprint hash so different DSCP markings produce distinct signatures.
Passive sniffer logs the same three fields on tcp_syn_fingerprint events;
profiler rollup carries them into the attacker tcp_fingerprint snapshot;
AttackerDetail's TCP STACK panel now surfaces DSCP and ECN cells.
Replaces inline styles + .bounty-root reuse with a dedicated
.orchestrator-root scope. Adds animated status pill (live/connecting/
error), bordered seg-group kind filter that matches DeckyFleet's
fleet-filter-group, dedicated kind chips (matrix-green for traffic,
violet for file), failure-row tint, and a brief 'fresh' tint for
just-prepended live rows that fades after 5s.
DEBT-042 — orchestrator failure-count badge is computed from the
in-memory SSE window; remediation is a dedicated stats endpoint.
DEBT-043 — no frontend test framework configured; the planned
Orchestrator.tsx component test couldn't be written without first
adding vitest + RTL.
New /orchestrator route. Paginated read-only event list with kind
filter (all|traffic|file), pause-stream toggle, in-window failure
badge ('X failures / 1h'), and an SSE-driven 'live' status pill.
Streamed rows prepend on top up to a 500-row in-memory cap.
Sidebar gains an AUTOMATION nav group; Orchestrator is the first
child. Future workers (mutator/prober activity) plug in as siblings.
Every 100 ticks, trim per-dst_decky_uuid history down to 10000 rows
(oldest first). Keeps the events table bounded on long-running fleets
without paying the cost on every write.
GET /api/v1/orchestrator/events — paginated list with optional
kind=traffic|file filter. GET /api/v1/orchestrator/events/stream —
SSE: snapshot on connect, live forward of orchestrator.> bus events
mapped to 'traffic' / 'file' SSE event names.
Repo gains list_orchestrator_events(limit, offset, kind?, since_ts?),
count_orchestrator_events(kind?), and prune_orchestrator_events
(per_dst_cap=10000) for periodic worker-side trimming.
Aligns the bus token with the DB column value; OrchestratorEvent.kind
is 'traffic'/'file' but the topic was 'activity'/'file'. The asymmetry
made consumer code (UI filter, SSE event names) need a translation
layer. No external subscribers existed yet.
Adds a new decnet orchestrate worker whose job is to keep the honeypot
ecosystem from looking suspiciously static — a frozen LAN with no
inter-host traffic and no filesystem aging is its own honeypot tell.
MVP scope:
- New OrchestratorEvent table + repo methods (purpose-built sibling
to Log so synthetic events stay separable from attacker-driven ones).
- New orchestrator.{activity,file}.<decky_id> bus topics +
system.orchestrator.health heartbeat.
- SSH-only driver. Traffic action runs python3 inside src container
to TCP-connect dst:22 and read the SSH banner — real on-the-wire
SSH-protocol traffic without shipping creds. File action drops or
refreshes a small file via docker exec on the destination.
- Random scheduler (50/50 traffic/file when >=2 SSH-capable deckies
are running). Diurnal shaping, role-aware pairing, and session-aware
backoff are explicit non-goals for MVP.
- CLI registration, systemd unit (SupplementaryGroups=docker),
worker-registry entry so the dashboard shows orchestrator health.
- 11 tests: scheduler policy, driver argv shape + injection-safety,
end-to-end one-tick integration with FakeBus + SQLite.
Adds proper /identities and /campaigns list pages following the
Bounty/Attackers convention (page-header + page-title-group +
controls-row + logs-section + logs-table + EmptyState). Both pages
live-update via the existing identity / campaign SSE streams.
Sidebar: Attackers, Identities, Campaigns now group under a
THREAT DATA NavGroup, matching the SWARM grouping pattern.
CampaignDetail and IdentityDetail rewritten to use the house class
system (page-header / logs-section / chip / dim-chip) instead of
inline styles. The campaign chip on IdentityDetail navigates to
/campaigns/:uuid; both pages share a small fp-group helper for
fingerprint listings (added to Dashboard.css).
decnet-clusterer.service.j2 ships the identity clusterer that
landed last session (was overlooked) — bus-woken on attacker.>,
publishes identity.> events.
decnet-campaign-clusterer.service.j2 ships the campaign clusterer
from this session — bus-woken on identity.>, publishes campaign.>
events plus the cross-family identity.campaign.assigned. After=
decnet-clusterer.service so the identity layer is up before the
campaign layer reads its rows.
decnet.target Wants both new units. Both follow the same security
hardening profile as enrich + reuse-correlator.
API: /api/v1/campaigns (paginated list), /api/v1/campaigns/{uuid}
(soft-merge chain follow), /api/v1/campaigns/{uuid}/identities
(member identities), and /api/v1/campaigns/events (SSE under
campaign.> + JWT-via-?token=, snapshot-on-connect). Mirror of the
identity router; same auth, same shape, same OpenAPI tags pattern.
Frontend: CampaignDetail.tsx page (same visual vocabulary as
IdentityDetail), useCampaignStream hook (mirror of
useIdentityStream), /campaigns/:id route, IdentityDetail's
CAMPAIGN badge becomes clickable and navigates to the campaign.
useIdentityStream now listens for identity.campaign.assigned so
the badge appears live without a manual refresh.
Runs the chained identity + campaign clustering pipeline against all
seven fixtures via from_synthetic / from_synthetic_identity adapters
and ratchets every YAML floor to 1.0 — the production clusterer
(and the reference clusterers used in the per-fixture tests) all
score perfectly across ARI / homogeneity / completeness /
singleton_recall on each fixture.
Three substrate fixes surfaced by the ratchet:
- Tuning: shared_infra now Jaccards payload+C2 only; decky_set moved
into cohort_weight to prevent fleet-scarcity false-merges (F1's
shared_wordlist failure mode). Tier weight raised to 1.0 so
shared payload+C2 alone crosses threshold (F5's intended pass).
- Adapter: from_synthetic_identity now reads SyntheticSession
started_at + duration_s for session_windows and per-decky
timestamps (the production-row adapter still uses start_ts/end_ts
when available).
- Fixture data: paused_campaign.yaml's JA3 collided exactly with
vpn_hopping.yaml's (same TLS extension list). The collision
fused two unrelated campaigns under the chained identity layer
in the noise_floor composite. Made paused's JA3 distinct.
Also wires Campaign / CampaignsResponse into models/__init__.py's
__all__ that was missed in the schema commit.
The campaign clusterer worker mirrors the identity-side worker shell
(bus connect, heartbeat, control listener, slow-tick fallback) but
wakes on identity.> instead of attacker.> — campaign-level work is
gated on identity-layer changes, not raw observations.
The connected-components implementation reads identities via
list_identities_for_clustering, projects them with from_identity_row,
runs union-find over combined_campaign_weight, writes campaigns rows,
sets attacker_identities.campaign_id, and runs the same revocable-
merge pass as the identity layer (a merged-out campaign whose
identities no longer co-cluster with the winner gets revoked).
Bus: adds campaign.> family (formed / identity.assigned / merged /
unmerged) plus the cross-family identity.campaign.assigned so
existing identity-stream subscribers see the badge update without
having to subscribe to campaign.>. Wiki Service-Bus.md updated in
wiki-checkout in the same wave per the project's bus-signals
discipline.
CLI: decnet campaign-clusterer registered as master-only via
MASTER_ONLY_COMMANDS; --poll-interval / --daemon mirror the identity
clusterer command surface.
The signal taxonomy for the campaign clusterer (next commit). Mirror
of the identity-layer module but with edge families that don't
translate 1:1: phase-handoff (load-bearing for F5 multi_operator —
the signal the identity-side fingerprint-disagreement veto deliberately
isn't), shared-infra (vetoed at identity level, primary positive
signal here), temporal-overlap (pairwise-relative — F7 invariance
preserved), cohort (weak supporting weight only).
Tier weights tuned so phase-handoff alone crosses threshold (F5),
shared-infra + temporal-overlap together cross (canonical co-op
pattern), and shared-infra + cohort together do NOT (F1
shared_wordlist's failure mode). The F7 time-shift invariant is
explicitly tested on every time-bearing edge and on the combined
weight.
Adds the campaigns table and the BaseRepository / SQLModelRepository
methods that the campaign-clusterer worker (next commit) needs to
populate it. Mirrors the AttackerIdentity layer: schema_version from
day one for federation gossip, soft-merge via merged_into_uuid with a
chain-walking get_campaign_by_uuid, list_campaigns excluding merged-
out rows while list_all_campaigns returns the unfiltered set for the
revoke pass. attacker_identities.campaign_id gets a real FK now that
the target table exists.
useIdentityStream hook mirrors useTopologyStream — opens an
EventSource against /api/v1/identities/events with the JWT in
?token=, dispatches the five named events (snapshot, formed,
observation.linked, merged, unmerged) to the consumer, reconnects
3s after any error.
AttackerDetail subscribes whenever it has an attacker id loaded.
On any event whose payload references this observation's uuid OR
the attacker's current identity_id, refetch /attackers/{id} so the
IDENTITY badge appears (or follows through merges / unmerges) live
without a tab refocus.
IdentityDetail subscribes whenever it has an identity id loaded.
On any event whose payload references this identity_id (formed for
it, merge winner / loser, unmerge resurrected / former-winner), it
refetches both the identity row and its observations list.
Both consumers filter inside onEvent — the hook itself is dumb glue
and stays unaware of which uuids any given component cares about.
Mirrors GET /api/v1/topologies/{id}/events: subscribes to identity.>
on the bus for the duration of the request and forwards each event as
a named SSE frame (formed / observation.linked / merged / unmerged).
The endpoint is broadly scoped (every identity event, not per-uuid)
because both AttackerDetail and IdentityDetail need the same
firehose: AttackerDetail watches for an identity.formed that finally
binds its identity_id; IdentityDetail watches for
observation.linked / merged / unmerged against its current row. A
per-uuid filter would force the client to know its identity before
subscribing, which it doesn't always.
JWT via ?token= (EventSource can't set headers), require_stream_viewer
gate, sse_connection_slot per-user cap, snapshot-on-connect with
the first 50 identities so the client buffer renders without a
separate REST call.
Bus-disabled / unreachable path keeps the connection alive on
keepalives so the client doesn't reconnect-storm; it can re-poll
the REST API on its own timer.
Reworks the clusterer's tick to handle multi-identity components and
re-evaluate prior merges. Two passes per tick:
Pass 1 — per-component reconciliation:
* Fresh component → mint identity (commit 4 path).
* Single-identity component → link unassigned observations.
* Multi-identity component → soft-merge: pick the smallest-uuid
winner deterministically, set merged_into_uuid on each loser,
link unassigned observations to the winner. Observations stay
FK'd to their original identity row — the merge is a soft
pointer, not a re-point. Audit trail preserved; cached
subscribers resolve through the chain.
Pass 2 — revocable-merge undo:
* For each merged-out identity, check whether its observations
still cluster with its winner's. If not, the merge is
contradicted by new evidence — clear merged_into_uuid and emit
identities_unmerged. The resurrected identity keeps its original
uuid, so subscribers that cached it during the merged interval
re-attach without a new lookup.
A pre-built merge-chain dict feeds Pass 1 so the effective-identity
lookup is O(1) per observation. The chain has a hop cap (paranoia
against accidental cycles in the underlying state).
Repo additions on BaseRepository + SQLModelRepository:
* list_all_identities() — includes merged-out rows.
* update_identity_merged_into(uuid, winner_or_None) — single
setter for both merge and unmerge.
DummyRepo coverage stub updated.
Tests:
* Two distinct identities bridged by a new observation merge with
the smaller uuid as winner.
* A pre-seeded soft-merge whose underlying observations diverge
gets revoked; resurrected uuid emerges with merged_into_uuid
cleared.
* Tick is idempotent under no state changes.
Two targeted invariants instead of a wholesale YAML-bounds re-use,
because the existing F6 bounds were tuned for the reference
composite_signals_clusterer (fingerprint OR C2). The production
clusterer trades that aggregation for tier discipline + the
fingerprint-disagreement veto, so its score profile differs even
when its judgments are correct — multi_operator stays as 2 truth
identities, paused_campaign's two DSL actors remain a single cluster
because they share fingerprints, etc. Wholesale bounds re-use would
fight the design.
The two production-side ratchets:
1. singleton_recall ≥ 0.95 at campaign-level scoring — truth-
singleton noise scanners must not be absorbed into real campaigns.
This is the F6 failure mode that motivates the fixture.
2. Intra-campaign recovery under cross-corpus interference:
* vpn_hopping's 5 rotations consolidate to one cluster.
* shared_wordlist A and B stay in disjoint clusters despite
sharing credentials with each other (and with the noise floor).
A future commit can revisit when the production clusterer's identity-
level truth alignment improves (e.g. when paused_campaign's DSL is
extended to mark its two actors as one truth identity).
Fixture 7 ratchet: one campaign across 3 multi-week operational
windows with stable JA3 + HASSH + C2. The production clusterer must
fold all 3 into one cluster despite multi-week silence between
windows; completeness = 1.0.
Time-shift invariance test: applying a +90 day delta to every
session start (and the per-attacker first/last seen) must produce
the same cluster membership as the baseline. This is the runtime
counterpart of the static no-time-fields check on Observation. If
either check ever fails, the clusterer has accidentally grown a
recency-aware edge — fixture 7's whole reason for existing.
Pins down the tier-discipline contract end-to-end:
- Credentials-only overlap doesn't fuse observations (F1 in
miniature).
- ASN-only overlap doesn't fuse observations (F2 in miniature).
- All three weak tiers (medium + low + very-low) stacked still
don't fuse — only a high-tier signal does.
- F1 (shared_wordlist) at identity-level: no false merges, every
row is its own predicted cluster, homogeneity = 1.0.
- F2 (vpn_hopping): 5 distinct ASNs collapse into 1 predicted
cluster, proving JA3 / HASSH dominate ASN as the design
requires.
The combination math itself was wired in commit 5; this commit is
the failure-mode regression suite that gates future tuning of the
tier weights.
Two operators cooperating on one campaign can share C2 endpoints +
stage-1 payloads while running distinct tooling — fixture 5
(multi_operator) is the canonical demonstration. The identity
clusterer must NOT fuse them: shared infra is a campaign-level
signal, not an identity-level one. The campaign clusterer (downstream
work) handles that grouping over identities.
Mechanism: when two observations have non-null fingerprints AND the
fingerprints fully disagree, the high-weight tier drops the payload
and C2 contributions to zero. JA3 / HASSH agreement still returns
1.0 directly — no veto applies when something agrees. Partial
agreement (one slot agrees, another disagrees) is treated as
agreement, since stable-tool partial overlap is more consistent
with one identity than two.
The veto only triggers when there is actual disagreement evidence —
two un-fingerprinted observations sharing a C2 still cluster, since
the absence of fingerprints is not the same as disagreement on them.
Fixture 5 production-clusterer assertion added at identity level:
ARI = 1.0, homogeneity = 1.0, exactly 2 predicted clusters from
2 truth identities. Phase-handoff edges (from the TODO) belong to
the downstream campaign clusterer, not this identity clusterer.
The clusterer now drops a single high-tier function call in favor of
a tier-weighted sum. Tier multipliers (high=1.0, medium=0.6, low=0.2,
very_low=0.05) are tuned so the threshold (1.0) admits high-tier
agreement alone while leaving every weaker tier — and every
combination of weaker tiers — under threshold.
Per-tier discipline tested:
- high alone clusters
- medium alone does NOT cluster (supporting signal only)
- low alone does NOT cluster (fixture 1's failure mode)
- very-low alone does NOT cluster (fixture 2's failure mode)
- all three weak tiers stacked still don't reach threshold
- high + medium clusters (high already saturates)
The combination is forward-compatible: low + very-low contributions
are computed today but always project to 0.0 because the production
adapter doesn't populate credentials / ASN-edge inputs into the
fixture path yet. Their contribution becomes load-bearing in commit 7
when the low-tier landing tightens the F1 / F2 bounds.
Fixture 4 (paused_campaign) ratchet added: high-tier signal carries
the multi-day-silence campaign into one identity. Time-agnostic
invariant — silence is irrelevant to the edge weight.
The connected-components clusterer now writes attacker_identities
rows + sets attackers.identity_id when high-weight signals (JA3 /
HASSH / payload-hash / C2-endpoint exact match) agree across
observations. Singletons stay un-fingerprinted and un-clustered.
Algorithm split:
- cluster_observations(observations) — pure union-find over the
high-weight edge function. Same code path for fixture validation
and production tick.
- from_attacker_row(row) — production-row adapter; recovers JA3 +
HASSH from Attacker.fingerprints JSON. Payload + C2 join from
logs in later commits; the function shape doesn't change.
Repo additions on BaseRepository + SQLModelRepository:
- list_attackers_for_clustering(limit=None)
- create_attacker_identity(row)
- set_attacker_identity_id(attacker_uuid, identity_uuid)
DummyRepo coverage stub updated.
v1 behavior is conservative: only assigns identities to observations
whose identity_id is currently NULL. Multi-identity components are
skipped this pass — merge / re-assign lands in commit 10 with
revocable merges.
Fixture bounds tightened against the production clusterer:
- lone_wolf (F3) — singletons stay singletons
- shared_wordlist (F1) — credential-only overlap doesn't cluster
(high-weight tier doesn't include credentials)
- vpn_hopping (F2, identity-level) — 5 rotated IPs with stable JA3
+ HASSH fold into one identity, ARI = 1.0, completeness = 1.0
Adds the four weight-tier edge functions as pure, time-agnostic
scoring primitives over an Observation projection. Each returns a
score in [0, 1]; the connected-components impl will combine + threshold
in subsequent commits.
Tier semantics (from IDENTITY_RESOLUTION.md):
- high — JA3/HASSH/payload-hash/C2-endpoint exact match
- medium — phase-bucketed command-sequence Jaccard
- low — credential-attempt-set Jaccard (defeated alone by F1)
- very low — ASN equality (defeated alone by F2)
Time-agnostic invariant is a static test: Observation has no time
fields, so no edge function can silently start using them. Fixture 7
forbids recency-decay clustering on multi-month APT campaigns.
A from_synthetic() adapter projects SyntheticAttacker corpora into
Observation; the production-row adapter lands when the clusterer
starts reading the attackers table.
Revocable merges (a contradiction-driven undo of identity.merged) ship
in the clusterer work; this reserves the topic up-front so identity.>
subscribers receive it day one without a re-subscribe.
The clusterer worker's ClusterResult fan-out now publishes on
identity.unmerged when populated. The skeleton clusterer never
populates it; the revocable-merge commit will.
Wiki update lives in wiki-checkout/Service-Bus.md (separate repo).
Adds the decnet clusterer master-only command + provider-subpackage
shape (base.py + factory.py + impl/connected_components.py) so
subsequent commits can land similarity-graph features without
churning callers.
The skeleton ConnectedComponentsClusterer.tick is a no-op; the
worker shell is fully wired (bus consumer on attacker.observed +
attacker.scored, slow-tick fallback, health heartbeat, control
listener, ClusterResult fan-out to identity.formed/observation.linked
/merged). Subscribers on identity.> see no events from this clusterer
until edge functions land, but the lifecycle is in place.
Multi-month APT campaign modeling real APT operational tempo: recon
over weeks, exploitation later, action-on-objectives later still.
The unique signal this fixture stresses is TIME-AGNOSTIC IDENTITY
across multi-week silences — a clusterer that silently expires old
edges fragments any campaign that operates over months.
Three DSL actors represent the operator's three operational windows
(week 2, month 2, month 3 of a 90-day campaign), all sharing JA3 +
HASSH + payload + C2 callback. Campaign-level fixture only — the
three actors mint distinct truth_identity_id rows by design (same
modeling caveat as fixtures 4 and 5).
The fixture's narrative mirrors how an APT works a deep nested
topology (DECNET MazeNET mode): map decoy networks for weeks, only
then commit to exploitation. Slow-and-low pacing is the signal.
recency_decay_clusterer added to fixture_harness — same edge
construction as composite_signals_clusterer, but each edge weighted
by exp(-time_distance / half_life_days) and dropped below a
threshold. Adversarial reference for slow_burn: with 14-day half-
life and 0.5 threshold, edges between operational windows (24+ days
apart) decay below threshold and drop. The campaign fragments into
three clusters; completeness collapses.
This is the canonical production failure mode for graph clusterers
that bound memory or bias toward "what's hot" by silently expiring
old edges. Catching it in synthetic data is what fixture 7 exists
for; the replay tier will surface real-world drift / dwell patterns
that calibrate the half-life threshold the real algorithm should
tolerate.
Four tests: corpus shape (window-isolated sessions, stable
fingerprint), pipeline pass via composite_signals_clusterer (time-
agnostic — folds all three windows), adversarial fragmentation
(3 clusters at 14-day half-life), long-half-life sanity (gentle
decay unions everything; confirms behavior depends on the half-life
parameter, not on something unrelated).
Bundles all five prior fixtures' campaigns into one corpus alongside
10 fresh Delivery-only noise scanners (on top of lone_wolf's 8
inherited). The fixture covers cross-corpus interference — signal
collisions across fixtures' JA3/HASSH/C2 strings, factory ID re-use,
clusterer ambiguity that only manifests when multiple campaigns
score together. Each constituent fixture already ships its own
in-fixture adversarial test; this one is the control for the class
of failures that single-corpus fixtures cannot catch.
Composition is declared via a fixture-6-specific include_fixtures
block in noise_floor.yaml. The test file's loader expands it into
a full corpus.campaigns spec at runtime so the factory itself stays
unaware — no factory primitive added for what only this fixture
needs. The 8 noise scanners declared by lone_wolf flow through
naturally; the extra_noise_scanners count adds 10 more.
composite_signals_clusterer (added in the fixture-5 commit) is the
pass clusterer — union-find combining (ja3, hassh) match OR
overlapping C2 callback. Approximates the planned similarity graph
well enough that every campaign resolves and every singleton stays
singleton in the merged corpus.
Three tests: corpus integrity (every campaign id present, 12
campaign-driven attackers + 18 noise = 30 total), pipeline pass
against the global bounds, and an explicit singleton-recall
assertion (21 truth-singletons — 1 lone wolf, 18 noise, 2
shared_wordlist actors whose campaigns are size 1 — all kept
singleton by the composite clusterer). Singleton recall is the
load-bearing metric here: noise absorption is the failure mode
that makes campaign attribution useless in practice.
Three new reference clusterers in fixture_harness:
* c2_callback_clusterer — union-find on overlapping C2 callback
sets across an attacker's sessions. Pass-clusterer for fixture 5
where two operators with distinct tooling share a C2 endpoint as
the campaign signal.
* shift_clusterer — deliberately-bad reference that buckets
attackers by majority session-start hour into night/day/swing.
Adversarial reference for fixture 5; proves operational schedule
is NOT a campaign signal.
* composite_signals_clusterer — union-find combining (ja3, hassh)
match OR overlapping C2 callback. Will serve as the pass-
clusterer for fixture 6 (noise_floor) where multiple campaigns
with heterogeneous signal types are scored together.
Also factored a small _union_find helper for the new clusterers
(existing time_window/credential_jaccard left untouched to avoid
mixing refactor with feature work).
Fixture 5 (multi_operator): one campaign, two operators with
distinct UKC roles. Actor A (broker, night shift): Delivery →
Exploitation → Persistence → C2. Actor B (post-ex, day shift):
Discovery → Lateral Movement → Collection → Exfiltration.
Distinct JA3/HASSH/ASN/IPs; shared C2 + payload hash.
Four tests: corpus shape (distinct fingerprints, shared C2,
disjoint shifts), pipeline pass via c2_callback_clusterer,
explicit harness sanity that fingerprint_clusterer cannot resolve
this fixture (documents which signal carries the campaign), and
adversarial shift_clusterer fragmentation.
Phase-handoff edges (the real load-bearing signal per the design
doc) wait for the production clusterer; this fixture will prove
they're needed when it ships.
Adds the actor.active_days primitive to the campaign factory so a
DSL actor can be bound to specific day indexes. Falls back to the
non-paused day pool when absent (existing fixtures unchanged).
Intersects with pause_windows so the campaign-wide silence still
wins if both are set.
Adds time_window_clusterer reference to fixture_harness — union-find
over attackers, edge if their session time-ranges are within
gap_days of each other. Deliberately-bad reference for fixture 4:
multi-day silent stretches fragment a single campaign because the
clusterer has no signal that bridges the gap.
Fixture 4 (paused_campaign): one campaign modeled as two DSL actors
representing the operator's two operational windows (active days
1-2 and 6-7), separated by a silent stretch (days 3-5). Both share
JA3 + HASSH + payload + C2 callback; only their active_days differ.
Five tests: corpus shape (rows in their windows, shared signals),
pipeline pass via fingerprint_clusterer at level=campaign,
adversarial fragmentation via time_window_clusterer (1-day union
threshold cannot bridge the 4-day silence → completeness collapses),
huge-gap sanity (gap_days=10 unions both halves), silent-stretch
invariant (no session leaks into the configured pause window).
Identity-level scoring is fixture 2's job; this fixture is
campaign-level only — modeling caveat documented in the YAML.
One campaign, one DSL actor, ip_pool: rotating + rotation_count: 5
across 5 synthetic private-use ASNs (RFC 6996 64512-64516). Stable
JA3, HASSH, and payload_hash across every rotation — these are the
"signals the attacker can't cheaply rotate" per IDENTITY_RESOLUTION.md
and the load-bearing reason all 5 observation rows must resolve to
one identity / one campaign.
Two new reference clusterers in fixture_harness.py:
* fingerprint_clusterer — groups by (ja3, hassh). Un-fingerprinted
rows stay singleton so it doesn't trivially fuse all noise into one
mega-cluster. Approximates the stable-signal arm of the planned
similarity graph.
* asn_clusterer — deliberately-bad reference for fixture 2's
adversarial test. Group-by-ASN shatters the campaign into 5
singletons; completeness collapses to 0.
Four tests in test_vpn_hopping_fixture.py: corpus shape (5 rows, 1
identity, 1 campaign, 5 distinct ASNs/IPs, stable fingerprints),
pass at campaign level, pass at identity level (asserts ARI exactly
1.0), asn_clusterer breaches the completeness floor.
Open Question 1 (merge revocability): adopted. The clusterer will
clear merged_into_uuid on contradicting evidence and publish a new
identity.unmerged topic alongside the existing three identity.* topics
so subscribers on identity.> get it from day one.
Open Question 2 (AttackerDetail UX on identity_id change): adopted
SSE over refresh-on-focus. New endpoint will mirror the existing
topology mutator SSE (bus subscription on identity.>, JWT via ?token=,
snapshot-on-connect + live forward).
Risk 2 (API URL stability for soft-merged loser UUIDs): struck —
already shipped in commit dc3d08d (read-only API follows
merged_into_uuid and surfaces the canonical winner).
Fifth and final commit of the identity-resolution substrate. Unblocks
fixture 2 (vpn_hopping) by making the synthetic factory match
production shape: an actor rotating across N IPs produces N
SyntheticAttacker rows that share fingerprints + truth_identity_id but
differ on ip / asn — exactly the shape the future clusterer needs to
recover via JA3/HASSH match.
Factory:
* SyntheticSession + SyntheticAttacker gain truth_identity_id field.
* DSL: ip_pool: rotating + rotation_count: N produces N observation
rows per actor. Optional rotation_asns: [...] cycles ASN per row;
defaults to the actor's primary asn.
* Sessions distribute round-robin across the actor's rotated rows.
* Noise scanners get truth_identity_id == truth_actor_id ==
truth_campaign_id (each is its own singleton at every level).
* GeneratedCorpus.truth_labels(level=) accepts "campaign" (default,
back-compat), "identity", or "actor" — picks the oracle the
metric harness scores against.
Harness:
* assert_fixture_bounds gains truth_level kwarg (default "campaign")
so identity-resolution fixtures can score against truth_identity_id
without churning the campaign-clustering test files.
Tests: 9 new (rotation_count emits N rows, shared identity +
fingerprints, distinct IPs, rotation_asns distribution + cycling,
round-robin session distribution, identity-level truth labels,
sticky default unchanged, sessions inherit identity label).
598 tests green across clustering / factories / db / web / bus /
profiler / correlation.
Fourth of the five-step identity-resolution substrate. Constants and
builder ship now; no publishers exist yet — they land with the
clusterer worker. Subscribers (webhook worker, dashboard SSE relay)
can register against identity.> from day one.
* decnet/bus/topics.py — IDENTITY root + IDENTITY_FORMED /
IDENTITY_OBSERVATION_LINKED / IDENTITY_MERGED leaves; identity()
builder mirroring the attacker() / system() helpers. Module
docstring topic-tree updated.
* tests/bus/test_topics.py — assert builder produces the expected
three topic strings + rejects empty event_type.
Wiki Service-Bus.md and a new Identity-Resolution.md page land in the
companion wiki-checkout commit.
Third of the five-step identity-resolution substrate. Frontend hooks
into the empty /api/v1/identities/* surface from commit 2; renders
nothing visible when identity_id is null (which is the universal state
until the clusterer ships).
* decnet_web/src/components/IdentityDetail.tsx — new page. Header with
uuid + optional CAMPAIGN / MERGED-INTO badges, stats row
(observations / JA3 / HASSH / payloads / C2), fingerprint tag lists
parsed from the JSON-in-TEXT columns, observations table that links
back to AttackerDetail, conditional analyst-notes panel.
* decnet_web/src/components/AttackerDetail.tsx — IDENTITY badge
inserted in the header row alongside TRAVERSAL. Clicking navigates
to /identities/<uuid>. AttackerData interface gains the optional
identity_id field.
* decnet_web/src/App.tsx — /identities/:id route + lazy-loaded chunk.
Verified by `tsc --noEmit` (clean) and `vite build` (clean — produces
IdentityDetail-*.js as its own lazy chunk). The repo has no JS test
harness; build + type-check are the gate.
Second of the five-step identity-resolution substrate. Ships the API
surface against the empty AttackerIdentity table from commit 1 — every
endpoint returns empty/404 cleanly until the clusterer populates rows.
Routes (auth-gated, viewer role):
* GET /api/v1/identities — paginated list, excludes merged-out rows
* GET /api/v1/identities/{uuid} — detail; transparently follows
merged_into_uuid to surface the canonical winner
* GET /api/v1/identities/{uuid}/observations — Attacker rows FK'd
to the (resolved) identity uuid
Repository (BaseRepository abstract + SQLModelRepository concrete):
* get_identity_by_uuid (with merge-chain following, hop-bounded)
* list_identities / count_identities (excluding merged-out)
* list_observations_for_identity / count_observations_for_identity
Tests: 12 new (empty-table behavior, seeded data, merge-chain
resolution, repo-level smoke against real SQLite). Also fixes the
pre-existing test_base_repo_coverage failure (DEBT-041 added abstract
methods without updating the DummyRepo stub) — included here because
this PR adds 5 more abstract methods, fixing it as a bonus.
474 db/web/profiler/correlation tests green.
Schema-only commit, first of the five-step substrate for identity
resolution. The clusterer that populates identities lands later; this
ships the table empty and the FK uniformly NULL on existing rows.
* decnet/web/db/models/attackers.py — new AttackerIdentity SQLModel
(uuid PK, schema_version, fingerprint summary lists, kd_digraph_simhash,
merged_into_uuid self-FK, all clusterer-populated fields nullable).
Attacker grows a nullable indexed identity_id FK + docstring marking
it as the per-IP observation row.
* decnet/web/db/models/__init__.py — re-exports AttackerIdentity.
* tests/db/test_identity_schema.py — 9 schema invariants: table exists,
identity_id nullable + indexed, FK targets attacker_identities.uuid,
schema_version defaults to 1, attacker rows inserted with NULL
identity_id, FK constraint blocks orphans.
463 unrelated db/web/profiler/correlation tests still green. See
development/IDENTITY_RESOLUTION.md for the full design.
Pre-implementation design for the observation/identity/campaign
three-level hierarchy. Sibling-add approach (not rename) — keep the
attackers table name, add attacker_identities as a sibling, nullable
attackers.identity_id FK. Documents the rationale, schema, bus
topics, API surface, and the 5-commit implementation sequence.
Companion to development/CAMPAIGN_CLUSTERING.md. Substrate for the
clusterer worker designed there; ships empty so the campaign
clustering fixtures can encode honest multi-row-per-actor scenarios.
Two campaigns sharing a credential wordlist; everything else (ASN, IPs,
JA3, HASSH, active hours) divergent. Pass condition: clusterer must NOT
merge. Protects against the "credential overlap is identity" failure
mode that commodity wordlists invite.
* tests/clustering/fixture_harness.py — shared assert_fixture_bounds
helper + identity_clusterer (placeholder, trivially correct on
all-singleton fixtures) + credential_jaccard_clusterer (deliberately-
bad reference used to PROVE the fixture catches what it should).
* tests/clustering/test_shared_wordlist_fixture.py — bounds pass with
identity, bounds FAIL (homogeneity → 0) with the bad credential
clusterer. The latter is the proof the fixture earns its keep.
* tests/fixtures/campaigns/shared_wordlist.{yaml,expected.yaml}.
* tests/clustering/test_lone_wolf_fixture.py — refactored onto the
shared harness. No behavior change.
Pre-implementation scaffolding for campaign clustering. The simulator is
the spec — algorithm code follows once fixtures + metrics are stable.
* decnet/clustering/ukc.py — UKCPhase enum (19 phases across In/Through/Out
stages), OBSERVABLE_PHASES set, stage_of() helper. Vocabulary aligns
with future MITRE ATT&CK tagging so synthetic data and runtime phase
inference don't need renaming when TTP-tagging lands.
* tests/factories/campaign_factory.py — YAML DSL parser + deterministic
generator emitting truth-labeled SyntheticAttacker / SyntheticSession
records. Validates phase names, warns on unobservable phases, supports
multi-campaign + noise corpora.
* tests/clustering/metrics.py — pure-Python ARI / homogeneity /
completeness / singleton_recall (no sklearn dep). Decided before any
algorithm exists, on purpose.
* tests/fixtures/campaigns/lone_wolf.{yaml,expected.yaml} — fixture 3
from the design doc; simplest of the six, exercises the full pipeline
with an identity-clusterer placeholder.
* development/CAMPAIGN_CLUSTERING.md — design spec for the feature.
* development/DEVELOPMENT_V2.md — note on DSL evolution path
(concurrent phases, multi-actor per phase) deferred post-v1.
The threat-intel surface was IP-keyed on day one as an expedient — the
worker is woken by IP-bearing bus events. ANTI's call: don't carry that
debt. NO IPs as primary keys anywhere on the attacker-intel surface.
Schema:
- attacker_uuid is now the canonical key — UNIQUE + FK to attackers.uuid.
- attacker_ip stays as a denormalised, indexed, NON-UNIQUE value column.
Updated on every upsert; useful for SIEM payloads and audit lookups,
but explicitly NOT a key. Model docstring says so.
- Pre-v1, no Alembic migration needed. SQLModel.metadata.create_all()
builds the new shape on fresh DBs.
Repo:
- upsert_attacker_intel now keys on attacker_uuid.
- get_attacker_intel_by_ip → get_attacker_intel_by_uuid.
- get_unenriched_attacker_ips → get_unenriched_attackers, returning
[{uuid, ip}] tuples so the worker writes by UUID and dispatches
provider calls by IP without a second round-trip.
Worker:
- _enrich_one(uuid, ip, ...) — UUID lands on the row, IP rides for
provider egress.
- attacker.intel.enriched bus payload gains attacker_uuid alongside
attacker_ip — webhook → SIEM consumers benefit; no removal.
API:
- GET /api/v1/attackers/{ip}/intel deleted outright (rip-and-replace,
never deployed beyond dev).
- GET /api/v1/attackers/{uuid}/intel is the only public route, matching
every other /attackers/* route.
Frontend:
- <IntelPanel uuid={id!} /> uses the URL param directly, fetches in
parallel with the rest of AttackerDetail rather than waiting on
attacker.ip.
Tests: re-keyed in place, 39 passed (same coverage as before the
refactor). Provider-impl tests untouched.
DEBT-041: closed in DEBT.md (entry preserved as historical rationale,
summary table flipped to ✅, remaining-open list shortened by one).
Read-only IP-keyed intel surface on the attacker detail page. Renders
the aggregate verdict (color-coded MALICIOUS/SUSPICIOUS/BENIGN/NO SIGNAL)
plus a per-provider row with verdict, queried-at timestamp, and
provider-specific detail (GreyNoise classification, AbuseIPDB
0-100 score, Feodo C2 listing + malware family, ThreatFox IOC match
+ malware family). 404 from the API renders as 'NO INTEL CACHED YET'
with a hint that decnet enrich will populate it on the next pass —
TTL drives the refresh, no manual button.
DEBT-041 documents the API/UI IP-keying as a v1 expedient that will
need a UUID-keyed sibling endpoint before federation lands. NAT
collisions, attacker.uuid consistency across attacker routes, and the
sequential-fetch UX are all callouts on that ticket; the migration
sketch is laid out so the v1.x followup is unambiguous.
Frontend build: clean (55.58 kB AttackerDetail bundle, +~5kB for the
panel). Note: not browser-tested in this session — recommend a manual
smoke against a deployed master before tagging.
Mirrors decnet-reuse-correlator.service.j2: same hardening posture
(NoNewPrivileges, ProtectSystem=full, etc.), same restart policy, same
log file convention. The decnet init renderer picks it up automatically
via the decnet-*.service.j2 glob.
Also reconciles a naming inconsistency I shipped earlier: the heartbeat
name was 'intel' (the package) but the CLI command and unit are 'enrich'
(the action). Renamed the heartbeat to 'enrich' so the workers panel
displays the same string the operator types and the same string in the
systemd unit file. Convention across the project: heartbeat name =
registry key = unit basename = CLI command name.
Registers 'enrich' in worker_registry.KNOWN_WORKERS and in the
start-all preferred order. The decnet.target Wants= list also picks
up the new unit so 'systemctl start decnet.target' brings everything
up together.
CLI command mirrors the reuse-correlate shape (--poll-interval, --ttl-hours,
--daemon). Run it under systemd as a sibling worker.
The API endpoint returns the most recent cached row for an attacker IP
or 404. Auth-gated via require_viewer like every other attacker route.
Also extends the worker test with a real FakeBus so the
attacker.intel.enriched publish path is exercised end-to-end (no longer
a no-op against NullBus).
Four concrete IntelProvider impls — three per-IP queries plus one bulk
feed:
* GreyNoiseProvider — community endpoint, optional API key for higher
rate limit. 404 = unknown (cache the absence so we don't re-query).
* AbuseIPDBProvider — score threshold mapping (>=75 malicious, >=25
suspicious, else benign). Self-disables with a clear error when no
API key is configured rather than burning quota.
* FeodoProvider — fetches the bulk botnet C2 IP feed once per refresh
window and answers every lookup from an in-memory set. Listed = C2.
* ThreatFoxProvider — POST /api/v1/ search_ioc query, optional Auth-Key
header. Match in data[] = malicious; no_result = absence-not-benign.
Every provider routes through decnet.net.http.stealth_client so the
egress UA never leaks 'DECNET'.
run_intel_loop fans out across configured providers per IP, writes the
aggregate row, and publishes attacker.intel.enriched. Mirrors the
correlation/reuse_worker.py wake-on pattern: subscribes to
attacker.observed and attacker.scored for sub-second latency, falls back
to a 60s poll when the bus is unavailable. Heartbeat + control-listener
wired so the workers panel sees it like every other supervised worker.
Aggregate verdict picks the strongest provider tier (malicious >
suspicious > benign > unknown). Provider-level errors land in
IntelResult.error and are logged without poisoning the row — partial
success is the expected case for free-tier providers under their daily
caps.
Concrete provider impls land in follow-up commits; the worker is fully
exercised here against fake providers so the framing is locked in.
Outbound calls to 3rd-party services (threat-intel providers, future TI
lookups) MUST NOT advertise 'DECNET' in their user-agent — operators
running honeypots want their reconnaissance dependencies to look like
generic infra. New decnet.net.http.stealth_client() returns a fresh
httpx.AsyncClient with a curl-shaped UA (pinned to a single constant so
future siblings — browser-shaped, Go-shaped — sit next to it cleanly).
Internal egress (webhook → operator's own SIEM, swarm worker → master)
keeps its DECNET-tagged UA; the docstring is explicit about not routing
those through this client.
IntelProvider is async-first (every concrete provider does HTTP), bounded
by a per-provider asyncio.Semaphore, and contractually never raises —
errors land in IntelResult.error so a single provider's outage doesn't
poison the worker pass for an entire IP.
Factory returns a list (not a singleton like geoip) because intel
enrichment fans out across all enabled providers per IP, with row-level
partial-success handling. Lazy imports keep the module dependency-free
when intel is disabled.
Concrete providers (greynoise/abuseipdb/feodo/threatfox) land in
follow-up commits — factory references them via lazy import so tests
covering the disabled and unknown-name paths pass on their own.
New TTL-cached threat-intel row keyed by attacker IP, with per-provider
verdict/raw/queried_at columns for GreyNoise, AbuseIPDB, abuse.ch Feodo
Tracker and ThreatFox. Carries schema_version from day one (federation
wire-format precedent set by SessionProfile). Repo gains
upsert_attacker_intel, get_attacker_intel_by_ip, and a
get_unenriched_attacker_ips backfill primitive that picks fresh + stale
rows for the forthcoming 'decnet enrich' worker.
Also documents the open-source intel-source backlog in DEVELOPMENT_V2.
The CredentialReuse table only stores the sha256+kind hash of the
secret; the printable + b64 forms live on the underlying Credential
rows. The dashboard drawer was therefore showing only the hash, which
defeats most of the value of having a reuse view in the first place.
Repo helpers list_credential_reuses + get_credential_reuse_by_id now
issue one batched SELECT against credentials keyed on the sha256s in
the result page and graft secret_printable + secret_b64 onto each row
before returning. The drawer renders the same printable/b64 code-block
the credentials inspector uses.
Adds the systemd template for the credential-reuse correlator daemon
and wires it into decnet.target so `decnet init` installs it
automatically (the unit installer globs decnet-*.service.j2). Mirrors
the mutator template: bus-woken Type=simple service with the standard
hardening + on-failure restart.
Also registers `reuse-correlator` in the in-process worker registry
(so the dashboard panel surfaces its heartbeat instead of dropping it
as unknown) and slots it into the start-all preferred order between
mutator and webhook.
Library shape (decnet/correlation/) consumed by profiler + reuse
correlator is the right model. The `decnet correlate` CLI helper has
been removed in the previous commit.
The CLI was a day-one debug helper that read a log file or stdin and
printed a traversal table. It hadn't been wired to the live data path
since the engine moved into the profiler worker (DEBT.md:218). No
deploy unit, no caller, no doc relied on it. Removed the command and
its two tests; `decnet/correlation/` stays as a library consumed by
the profiler and the reuse correlator.
Adds a CREDS/REUSE tab segment on the Credential Vault page. The REUSE
tab lists CredentialReuse rows (paginated 25 per page) ordered by
target_count desc; row-click opens a drawer mirroring the credentials
inspector with a deckies x services grid, attacker links, and a
PROFILING PENDING placeholder when attacker_uuids has not been
backfilled yet.
The CREDS tab gains a REUSE column showing a clickable target-count
badge for credentials whose (sha256, kind, principal) tuple matches a
reuse row; clicking the badge fetches and opens that row's drawer.
Section header gains a manual refresh button (no SSE/polling).
Ticks the credential-reuse line in DEVELOPMENT.md and notes the
vectorstore scaffold.
Read-only routes for the credential-reuse findings produced by the
correlator. Mirrors the /credentials route shape: JWT-gated via
require_viewer, paginated with optional secret_kind /
min_target_count filters, and a 404-on-missing detail route.
No POST/PUT/PATCH (and no body parsing) so no 400 contract is
documented.
Adds CorrelationEngine.correlate_credential_reuse + the
`decnet reuse-correlate` long-running worker. The worker mirrors the
mutator's bus-wake + slow-tick pattern: wakes on credential.captured
and attacker.observed for sub-second latency, falls back to a 60s
poll if the bus is unavailable, and publishes
credential.reuse.detected once per new or grown CredentialReuse row
(group-deduped so a 5-cred reuse doesn't emit 5 partial events).
The web ingester now publishes credential.captured after every
successful Credential upsert; bus + new repo helper
find_credential_reuse_candidates feed the engine pass.
Credential capture runs before the profiler mints an Attacker, so
Credential.attacker_uuid is nullable on write. The profiler now
backfills the FK after each successful upsert_attacker. Soft-fail
posture matches the surrounding behavior + smtp rollups so a backfill
error never blocks the next attacker.
Lays the storage and bus substrate for the "credential reuse patterns"
task in DEVELOPMENT.md and scaffolds decnet/vectorstore/ as the future
substrate for statistical attacker re-identification over behavioral
fingerprints. No correlator, profiler, API, or dashboard wiring in
this commit — see TODO.md for the handoff.
Schema:
- Credential.attacker_uuid (nullable FK to attackers.uuid),
backfilled by the profiler post-write to avoid coupling the
capture path to the profiler's ordering.
- CredentialReuse table — UUID PK, JSON list columns for the
accumulating attacker_uuids/ips/deckies/services, target_count
(the discriminative scalar), confidence reserved for a future
fuzzy-credential pass.
Repo:
- upsert_credential_reuse / list_credential_reuses /
get_credential_reuse_by_id / update_credential_attacker_uuid.
- Renamed pre-existing get_credential_reuse(secret_sha256) to
get_credential_attempts_for_secret(secret_sha256) — the new
findings table needs the cleaner name.
Bus topics:
- credential.captured (one per Credential upsert)
- credential.reuse.detected (correlator-emitted on insert/grow)
Vectorstore subpackage (decnet/vectorstore/, flat layout mirroring
decnet/bus/):
- BaseVectorStore ABC keyed by (kind, id) — kind discriminator
means new feature families are additive, no schema migration.
- FakeVectorStore (in-memory L2 KNN), NullVectorStore (no-op for
DECNET_VECTORSTORE_ENABLED=false), SqliteVecVectorStore (lazy
sqlite_vec extension load, one vec0 virtual table per kind).
- get_vectorstore() env-driven dispatch with graceful fallback
to FakeVectorStore when the sqlite-vec extension isn't on the
host, so workers don't crash on a missing optional dep.
Tests: 26 new (11 cred-reuse repo, 15 vectorstore). Existing
credentials and base-repo tests updated for the rename. Total: 34
passing on the touched files.
The events watcher's start-event filter previously called
_load_service_container_names(), which reads decnet-state.json on
every event. decnet deploy writes that state file out-of-band
with docker compose up, so a container's start event could
arrive before the state was committed — the watcher then dropped
the event silently and never tailed the container's stdout. The
visible symptom was an empty Credentials view (and Logs/Bounty)
after a fresh deploy until the collector was manually restarted.
Fix: stamp decnet.fleet.{service,decky,service_name} labels on
every fleet service container at compose-time, and let the
collector recognize either the fleet or topology label without
touching the state file. The state-file name match remains as a
fallback for legacy containers that predate the new labels.
New /credentials page mirroring the Bounty Vault pattern: list view
with search, dynamic service segment chips, plaintext vs hashed
secret rendering, and an inspector drawer with copyable SHA-256 +
service-fields JSON. Sidebar entry uses the Lock icon to keep
Bounty's Archive/Key visual identity distinct.
Surfaces the Credential table (deduped attacker auth attempts) via
a new /api/v1/credentials route. Mirrors the Bounty cache pattern
(5s TTL on the unfiltered default page) and reuses the existing
get_credentials / get_total_credentials repo methods + the already
defined CredentialsResponse DTO. Filters: search, service, attacker_ip.
When RDP_ENABLE_NLA=true (service_cfg.nla=true on the topology side),
confirm PROTOCOL_HYBRID on the X.224 Connection Confirm, upgrade the
socket to TLS using a self-signed cert generated at first start by
the entrypoint, then drive a tiny CredSSP loop:
- Read inbound TSRequest DER (bounded to MAX_TSREQUEST_LEN).
- Scan for the NTLMSSP signature, dispatch on message type:
Type 1 -> respond with a hand-built TSRequest carrying our Type 2
challenge. Type 3 -> parse_type3() and emit auth_attempt with the
universal credential SD shape (secret_kind = ntlmssp_v2).
- Hand-built DER: no pyasn1 dependency.
Also folds in a small fix-up to commit 1: SMB SERVER_CHALLENGE was
hardcoded to 0x11..0x88 across the fleet, which would let a scanner
fingerprint every DECNET decky by its NTLM challenge. Both SMB and
RDP now derive the 8-byte challenge from
instance_seed.random_bytes(8, "ntlm_challenge"), giving each decky a
deterministic-but-distinct value. SMB Dockerfile gets the
instance_seed.py copy too (was synced into the build context but not
COPYed into the image).
- decnet/services/rdp.py: optional service_cfg.nla bool flips
RDP_ENABLE_NLA in the compose env.
- decnet/templates/rdp/Dockerfile + entrypoint.sh: openssl install +
per-decky cert generation gated on RDP_ENABLE_NLA.
- 9 NLA unit tests cover the DER reader/builder, _handle_nla round-
trip with Type 1 / Type 3, oversized-DER rejection, and per-
NODE_NAME challenge divergence.
- DEBT.md: DEBT-040 closed; full TS_INFO_PACKET capture documented as
a follow-up if attacker telemetry justifies it.
Replace Twisted-based connection logger with an asyncio handler that
parses the X.224 Connection Request, extracts the mstshash routing
cookie (universal across mstsc / FreeRDP / Hydra / ncrack / MSF
rdp_login), records the rdpNegRequest.requestedProtocols flags, and
answers with a well-formed X.224 Connection Confirm selecting
PROTOCOL_RDP.
Scope-down vs. the original DEBT-040 plan: full TS_INFO_PACKET
extraction would require either Standard-RDP-Security RC4 stream-
cipher implementation (with our own RSA pair + MS-RDPBCGR signing) or
a complete MCS+GCC ASN.1/BER stack for the SSL path — both far
exceed the 150 LoC budget the DEBT cited. The mstshash cookie is the
only piece of credential information that flows in plaintext on the
wire when the attacker speaks RDP, so capturing it is the highest-
value-per-byte signal available without going down either rabbit
hole. Phase 3 (CredSSP/NLA, next commit) is where actual NTLMv2
hashes land.
- Drops Twisted dependency from rdp/Dockerfile; adds ntlmssp.py copy
ahead of the NLA path that consumes it.
- 7 unit tests cover cookie capture, requestedProtocols recording,
CC framing, no-cookie path, and oversized/non-TPKT drops.
Replace impacket's SimpleSMBServer with a hand-rolled asyncio SMB2
framer that walks Negotiate -> SessionSetup(Type1) -> SessionSetup(Type3)
just deep enough to extract the inner NTLMSSP Type 3 via the shared
parse_type3() parser. Always returns STATUS_LOGON_FAILURE; the
attacker's hash lands in the Credential table, the attacker doesn't
land on the host.
- decnet/engine/deployer.py: _sync_ntlmssp_sources() mirrors the
auth-helper / sessrec sync pattern, copies _shared/ntlmssp.py into
smb/ and rdp/ build contexts before docker compose up.
- Dockerfile: drop impacket dep, copy ntlmssp.py.
- 7 unit tests drive the asyncio handler in-process via
StreamReader.feed_data; assert dialect, MORE_PROCESSING_REQUIRED on
first SessionSetup, NTLMSSP Type 2 carriage in SPNEGO, credential
capture with universal SD shape, STATUS_LOGON_FAILURE on Type 3,
oversized-NBSS / SMB1 / short-PDU drops.
Ships the load-bearing primitive both Phase 5 (SMB) and Phase 7
(RDP NLA) need: a standalone NTLMSSP Type 3 (AUTHENTICATE_MESSAGE)
parser per MS-NLMP §2.2.1.3.
Surface:
parse_type3(blob) -> dict | None
find_ntlmssp(buf) -> int # locate NTLMSSP\\0 inside SPNEGO outer
Returns the universal Credential SD shape:
username + domain (decoded UTF-16-LE or ASCII per NEGOTIATE_UNICODE)
principal = "DOMAIN\\\\username"
secret_kind = "ntlmssp_v1" (24-byte fixed) or "ntlmssp_v2" (variable)
secret_b64 = base64 of NtChallengeResponse — canonical hashcat input
(-m 5500 v1, -m 5600 v2)
Bounds-checked for untrusted-input safety. Anonymous binds (empty NT
response) return None — no credential to record.
7 unit tests cover NTLMv1/v2 distinction, ASCII vs Unicode strings,
empty-domain shape, malformed signature/type rejection, and SPNEGO-
wrapped find_ntlmssp() lookup.
DEBT-040 opens to track the three remaining protocol framers that
will consume this parser:
- SMB: hand-rolled SMB2 + Session Setup framer (~200 LoC) replacing
Impacket's opaque SimpleSMBServer
- RDP basic auth: TPKT/X.224/MCS framer for legacy plaintext path
(~150 LoC)
- RDP NLA: TLS upgrade + CredSSP TSRequest parser, reuses parse_type3
via the SPNEGO inner blob (~250 LoC)
These are substantial protocol implementations each — landing them
inline with Phase 1-3+6's cred coverage rollout would have inflated
the session beyond reasonable scope. Cred-reuse analytics already work
across the 12 services covered in this session; the deferred three
just round out the fleet.
Plugs the cred-coverage gap for MongoDB. The template previously
parsed only the wire opcode + length and discarded the BSON body
entirely, so SCRAM-SHA-{1,256} client-proofs flowed straight through
without ever landing in the Credential table.
Adds an inline minimal BSON walker (~100 LoC) covering the 7 type
codes auth commands actually use: string, doc, array, binary, bool,
int32, int64. Hand-rolled rather than pulling pymongo as a runtime
dep — the parser is bounds-checked for untrusted-input safety
(won't loop on malformed length fields).
Wire flow MongoDB clients use for auth:
- OP_MSG body section (kind=0) → BSON doc with `saslStart` field
carrying mechanism + payload (SCRAM client-first-message:
"n,,n=<user>,r=<nonce>"). Username extracted, pinned to the
per-connection _sasl_username + _sasl_mechanism state.
- Subsequent OP_MSG with `saslContinue` → SCRAM client-final-message
("c=biws,r=<combined>,p=<base64 client-proof>"). The `p=` value is
the credential — emitted as secret_kind=scram_sha256 (or _sha1 /
_unknown depending on the prior saslStart's mechanism), principal
= the pinned username, secret_b64 = base64 of the decoded proof.
Reuse semantics: same client-proof across two auth attempts only
matches when both server salt and password were identical (proofs
include the salt). So cross-session reuse correlates only on
credential reuse against the same MongoDB account on the same decky
— honest, non-misleading signal.
680 tests pass across services, service_testing, db, web/ingester,
and core/fingerprinting (the broader scope my recent commits
touched). Phases 4, 5, 7 still pending (RDP basic-auth, SMB
NTLMSSP, RDP NLA).
Login forms (wp-login.php, phpMyAdmin, Joomla, etc.) ship a
`Content-Type: application/x-www-form-urlencoded` body with field
names like username/user/email/log/pwd/password. The HTTP/HTTPS
templates already captured the body as opaque bytes; now they parse
common login-form shapes into the universal credential SD shape.
Adds canonical templates/syslog_bridge.py:
extract_form_credentials(body, content_type) -> dict | None.
Field-name matching is case-insensitive and covers:
Principal: username, user, email, login, userid, account, log,
user_login (WordPress), uname / pma_username (phpMyAdmin)
Secret: password, pass, pwd, passwd, passwort, mot_de_passe,
user_password (WordPress), pma_password (phpMyAdmin)
The HTTP/HTTPS log_request handlers now call:
cred = classify_authorization(...) or extract_form_credentials(...)
— Authorization wins when present (current session credential beats
a follow-up form change), but POSTs to /wp-login.php with no Auth
header still surface their cleartext creds.
Secret-without-principal is intentional: a reset-confirm or auto-
fill abuse may carry a password without any field that maps to our
principal list. The cred row writes with principal=None — the
sha256 still correlates across services for reuse analytics.
The body capture cap bumped from 512 → 4096 chars so reasonable
form bodies aren't truncated before the cred extractor sees them;
the body stored in fields.body stays at 512 chars (display-friendly).
36 helper + emitter tests pass. Phases 4-7 still pending.
Closes the cred-coverage gap for two database services that had been
capturing only the username:
- MySQL — extends _handle_packet to read the auth-response after the
null-terminated username. mysql_native_password puts a 1-byte
length followed by 20 bytes: SHA1(password) XOR SHA1(salt +
SHA1(SHA1(password))). Plaintext irrecoverable, lands as
secret_kind="mysql_native_password" with the 20 hash bytes in
secret_b64. Hash is canonical for "hashcat -m 11200" if an operator
ever wants to crack offline.
- MSSQL — fixes a pre-existing bug AND adds password capture. The
prior _parse_login7_username read offsets 36/38, which is actually
ibHostName/cchHostName in the Login7 layout — username sat at
40/42 and was never touched. Replaced with _parse_login7_creds()
reading the correct offsets (40 username, 44 password). Login7
password is XOR-then-nibble-swap obfuscated against 0xa5;
_deobfuscate_login7_password reverses it. Plaintext-recoverable,
lands as secret_kind="plaintext".
The pre-existing test_login7_auth_logged_and_closes only verified the
error response ships and the connection closes; it didn't validate
the parsed username, so the hostname-as-username bug was silent. New
tests cover both the deobfuscation algorithm directly and the full
ingester round-trip for both services.
Sync: copies the canonical syslog_bridge.py into mysql/ and mssql/
template build contexts so service_testing tests load the version
with classify_authorization + encode_secret available.
37 tests pass in the touched scope. Phases 3-7 still pending.
Closes the cred-coverage gap for 7 services that already had the data
on the wire but never landed it in the Credential table:
- SNMP — community string lands as secret_kind="snmp_community",
principal=None (v1/v2c has no per-user identity, the community IS
the auth).
- SIP — Digest response hash, previously buried in the auth= header
dump, now classify_authorization()-extracted.
- HTTP / HTTPS — Authorization header was in the headers JSON but
never extracted. Now Basic decodes to plaintext, Bearer →
http_bearer (principal=None), Digest → http_digest_md5.
- K8s — already extracted Authorization but didn't normalize. Service-
account JWTs flow through as Bearer.
- Docker API — headers absent entirely. Adds the headers JSON dump
and runs Authorization through the classifier.
- Elasticsearch — five distinct request handlers; each gains a
per-handler _cred_fields() helper.
Adds canonical templates/syslog_bridge.py:classify_authorization().
Recognised: Basic / Bearer / Token / Digest. Unknown schemes (NTLM,
AWS4-HMAC, Negotiate) return None; the header still rides in the
ambient SD-block but isn't normalized as a credential. The SD shape
on the wire collapses sip_digest_md5 into http_digest_md5 — same
algorithm, so cross-protocol reuse correlates correctly when (rare)
nonce collisions allow.
Drive-by repair of tests/core/test_fingerprinting.py:
- The pre-existing `test_http_useragent_extracted` asserted both that
add_bounty was called exactly once AND that the UA payload carried
`path` and `method` fields. Both wrong since this session opened:
the http_quirks fingerprint added later fires too, and the UA
payload never actually included path/method despite the assertion.
- Adds `path`/`method` to the UA fingerprint payload (real operator
value: "Nikto hit /admin" beats "Nikto seen on this decky").
- Replaces `assert_awaited_once` with a `_find_ua_bounty()` helper
that filters add_bounty calls by `fingerprint_type`. New fingerprint
families landing later won't retroactively break old tests.
- Updates the two credential-bearing tests to use the post-DEBT-039
native shape (`secret_b64` / `principal`) and `upsert_credential`,
not the deleted legacy `username+password` adapter.
Also rebuilds the per-service fake `syslog_bridge` modules in
tests/service_testing/{conftest,test_imap,test_pop3,test_snmp,test_mqtt,test_smtp}.py
to expose `encode_secret` + `classify_authorization`. Service templates
that import either now no longer fail at test collection.
173 tests pass in the touched scope. Phases 2-7 still pending.
Honest correction to the "every cred-emitting service" claim. Audit
of templates/* found three gaps:
1. MQTT — was working through the legacy adapter, silently dropped
when Phase 3 (e696c2b) deleted it. Now migrated to encode_secret()
alongside the others.
2. Postgres — `auth, pw_hash=…` event captures the MD5
challenge-response the attacker sent. Plaintext irrecoverable, so
it never fit the (principal, secret_b64=raw_bytes) shape. Lands
in Credential as secret_kind="postgres_md5_challenge".
3. VNC — `auth_response, response=…hex` event captures the 16-byte
DES-encrypted challenge. Same situation as Postgres: plaintext
irrecoverable. Lands as secret_kind="vnc_des_response".
Adds a `secret_kind` discriminator column to Credential (default
"plaintext", indexed). The dedup tuple gains secret_kind so two
credentials with the same sha256 but different kinds are
fundamentally different rows — different challenges produce
different bytes for the same plaintext password, so cross-kind
reuse matches are meaningless and would only confuse analytics.
The model now genuinely covers every cred-emitting service in the
fleet:
plaintext SSH, Telnet, FTP, POP3, IMAP, SMTP, Redis, LDAP,
MQTT
postgres_md5_* Postgres
vnc_des_response VNC
Username-only services (MySQL/MSSQL — TDS pre-encryption captures
the user but never sees the password byte) intentionally don't feed
Credential — they're recon signals, not cred attempts.
40 tests pass in the touched scope. New cases: secret_kind dedups
independently in the repo; Postgres MD5 + VNC DES emitters thread
through; MQTT round-trips through the native branch.
Phase 3/3 of DEBT-039. Now that all six cred-emitting services
(SSH, Telnet, FTP, POP3, IMAP, SMTP, Redis, LDAP) emit the universal
`secret_b64`-bearing SD shape, the ingester's legacy fork has no
live emitters to handle. Deletes:
- `_ingest_credential_legacy()` — synthesized native fields from
username+password
- The `elif _fields.get("username") and _fields.get("password")`
branch in `_extract_bounty`
- `_printable_filter()` — only the legacy adapter called it; the
native branch trusts the emitter (encode_secret() in Python or
sd_escape() in C) to have already sanitized
- The legacy-adapter test cases in tests/web/test_ingester.py;
their coverage moved to tests/services/test_cred_emitters.py
per-service in Phase 2
The cred path is now single-shape end-to-end. A pre-migration log
row carrying only username+password silently produces no Credential
write — by design, since no current emitter writes that shape and
keeping a code path alive for theoretical legacy data risks masking
emitter regressions. Pre-v1: any historical Bounty cred rows from
before commit 2f47f67 stay untouched.
DEBT-039 marked resolved with summary of the three commits and the
silent-loss bug fix for Redis + LDAP that fell out of execution.
Phase 2/3 of DEBT-039. Switches FTP, POP3, IMAP, SMTP, Redis, and
LDAP from the legacy `username=` + `password=` SD-block shape to the
universal credential shape (`principal=` + `secret_printable=` +
`secret_b64=`) the new Credential storage model expects.
Pattern is uniform across all six services:
_log("auth_attempt", username=u, principal=u, **encode_secret(pw))
Each service emits the canonical SD keys. The ingester's native-shape
branch (introduced in 2f47f67) now writes their cred attempts
directly without going through the legacy adapter. Once Phase 3
removes the adapter the contract becomes single-shape.
Per-service notes:
- POP3 / IMAP — `status="success"|"failed"` renamed to
`outcome="success"|"failure"` to match Credential.outcome's
vocabulary; the ingester reads outcome directly.
- SMTP — AUTH path migrated; in addition the existing mail_from
event now exposes a parsed `domain=` field alongside the original
`value=` so future "what domains do attackers spoof from" analytics
have an indexed field. Not stored in Credential — regular Log row.
- Redis — was silently dropped by the legacy adapter (no `username`
field). Native branch handles `principal=None` correctly. BONUS
FIX: the Redis 6+ ACL syntax `AUTH <user> <pw>` now captures the
ACL username as principal (was previously discarded).
- LDAP — was silently dropped by the legacy adapter (no `password`
recognition for the `bind` event). Now lands as
`principal=<dn>`. BONUS FIX.
Tests (tests/services/test_cred_emitters.py, 9 cases):
- per-service native-shape ingest path produces correct Credential
rows; outcome maps for POP3/IMAP; principal=None for legacy Redis
AUTH; principal=dn for LDAP.
- mail_from event does NOT trigger a credential write (it's a
Log-only observation, not auth).
- 0xff/NUL/ANSI bytes in passwords survive losslessly through
secret_b64 even when secret_printable is sanitized.
Phase 3 deletes the legacy adapter once all migrations land — the
adapter has no live emitters to handle anymore.
Phase 1/3 of DEBT-039. Adds the Python emitter-side counterpart to
auth-helper.c's sd_escape + base64 logic so service templates can
emit the universal credential SD shape with a single spread:
_log("auth_attempt", principal=user, **encode_secret(password))
secret_printable mirrors the C helper's [0x20, 0x7f) → '?' contract;
secret_b64 preserves the ORIGINAL utf-8 bytes losslessly so non-ASCII
or control-byte payloads survive as fingerprinting signal even when
the printable form sanitizes them.
The canonical syslog_bridge.py is what _sync_logging_helper()
propagates into per-template build contexts at deploy time, so any
service that imports its local syslog_bridge picks this up
automatically on next rebuild.
Phase 2 migrates the six cred-emitting service templates (FTP, POP3,
IMAP, SMTP, Redis, LDAP) onto this helper. Phase 3 deletes the
ingester's legacy adapter once nothing emits the old shape.
Replaces the opaque Bounty.bounty_type='credential' path with a
dedicated `credentials` table whose schema is forward-compatible
across every auth-bearing service in the fleet. Hoisted indexed
columns (secret_sha256, principal, service, attacker_ip) carry the
universal reuse-analytics signal; service-specific JSON keys ride
in `fields`. Cross-service reuse queries become an indexed lookup
on secret_sha256 instead of JSON_EXTRACT scans.
Schema decisions baked in (per ANTI):
- New `Credential` table, not extension to Bounty
- Hoisted `principal` column for cross-service principal-reuse
- Standardized JSON keys: every payload carries secret_b64 +
secret_printable + principal universally; service-specific extras
(user, domain, dn, mech, …) ride alongside
The auth-helper SD-block emits the new shape natively. The ingester
forks at _extract_bounty:
- Native shape (SSH/Telnet, future emitters): secret_b64 present →
direct upsert_credential
- Legacy shape (FTP/POP3/IMAP/SMTP today): username + password →
adapter synthesizes secret_{b64,sha256,printable} on the fly,
upserts into the same Credential table. Tracked as DEBT-039;
one-shot bridge until those service templates migrate.
Defense-in-depth across five layers (input validation):
- C helper: bytes outside [0x20, 0x7f) collapse to '?', RFC 5424
escape rules for \\, ", ]; b64 preserves exact bytes
- Ingester native branch: rejects malformed secret_b64 (regex), drops
the credential row but keeps the underlying Log
- Ingester legacy adapter: same printable-ASCII filter as the C
code; sha256 + b64 over the original utf-8 bytes (lossless, even
when secret_printable is sanitized)
- DB column caps with truncation warning; sha256 always over the
full pre-truncation bytes so reuse queries match across truncation
- JSON serialized with ensure_ascii=True so utf8mb4 columns stay
safe even with non-ASCII service-specific keys
Bounty.bounty_type='credential' is no longer written. Pre-v1: no
historical backfill; existing rows stay untouched but unused.
595 tests pass; new tests cover the model + repo (upsert dedup,
null-principal independence, cross-service reuse, filters), both
ingester branches, b64 validation, sanitization preserving the
fingerprinting signal in b64.
Promotes auth-helper.c to decnet/templates/_shared/auth-helper/ and
adds _sync_auth_helper_sources() — mirrors the existing sessrec sync
pattern that keeps shared sources in step with per-template build
contexts.
Telnet's image grows the same multi-stage musl build, COPY of the
static helper into /usr/sbin/auth-helper, and prepended pam_exec line
in /etc/pam.d/login. Pulls in the `login` package (real Debian
PAM-aware /bin/login, replacing busybox's PAM-less applet) and
libpam-modules transitively for pam_exec.so.
Verified inside the rebuilt telnet image:
- /bin/login is the real 53KB Debian binary (PAM-aware)
- /etc/pam.d/login top line is the auth-helper hook
- pam_exec.so present at /usr/lib/x86_64-linux-gnu/security/pam_exec.so
- helper smoke-run emits correct RFC 5424 line for `telnetpw` →
password_b64="dGVsbmV0cHc="
SSH Dockerfile updated to read auth-helper.c from auth-helper/
subdirectory so both templates use the synced layout. The canonical
source lives in _shared/; per-template copies are tracked in git AND
synced at deploy time so a drift on either side rebases on the next
deploy.
Closes the telnet half of DEBT-038's #5 follow-up.
Real OpenSSH doesn't log attempted passwords — only success/failure
with username — leaving SSH the sole auth-bearing service in the
fleet that contributes nothing to the cred corpus FTP/MySQL/RDP/
VNC/etc. populate. Closes that gap with a tiny pam_exec shim.
A static C helper (~80 LoC, musl, ~38KB stripped) is wired into
/etc/pam.d/sshd as `auth optional pam_exec.so expose_authtok stdout
/usr/sbin/auth-helper`. pam_exec writes the attempted password to
the helper's stdin NUL-terminated; the helper formats an RFC 5424
line in the exact shape templates/syslog_bridge.py produces
(facility local0, PEN 55555, MSGID auth_attempt — same MSGID FTP
uses) and writes it to /proc/1/fd/1 so the existing collector
stdout-reader pipeline picks it up.
Two password fields ride in the SD-block:
- password= RFC 5424 escaped, ASCII-printable only, ? for non-
printables. FTP-compatible — existing dashboard
rendering picks up SSH attempts unchanged.
- password_b64= base64 of the exact PAM_AUTHTOK bytes. Preserves
NUL/0xff/control-byte fingerprinting signal that the
plain field necessarily drops.
Fail-open by design: the PAM line is `optional` so a malfunctioning
helper never blocks sshd auth. Better to miss a cred than break the
honeypot.
Verified end-to-end inside the rebuilt image:
- 38KB static ELF, runs without a dynamic linker
- correct RFC 5424 line for `hunter2` → b64 `aHVudGVyMg==`
- NUL truncation matches pam_exec's contract
- 0xff bytes survive losslessly through password_b64
- empty password produces a well-formed line (e.g. pubkey auth path)
Attacker list cards gain an AS<number> chip with the AS description
on hover. Attacker detail page adds an AS row beside ORIGIN — same
shape as the existing country/source pair so operators can read
"this attacker is in DE on AS24940 Hetzner" at a glance instead of
having to grep the IP into a separate tool.
Both fields collapse to "unknown" when the IP isn't BGP-announced
(CGNAT, dark space, RFC1918), matching the existing pattern for
country resolution.
Adds asn (int), as_name (varchar 128), asn_source (varchar 16) to
the Attacker SQLModel — direct columns, no _migrate_* helper per
feedback_no_new_migrations_prev1.
Profiler worker now calls decnet.asn.enrich_ip alongside the existing
geoip enrich_ip; both feed the upsert payload. Failure is total — if
either lookup throws or the IP is private/unannounced, the field stays
None and the row still writes.
Both lookups are independent: a CGNAT address can have a country (RIR
allocation) but no ASN (no BGP origin), and vice-versa for unrouted
RIR-allocated space. Storing them separately preserves that signal.
Mirrors decnet/geoip/ end-to-end: paths/base/factory/lookup at the
package level, iptoasn/ subpackage holds the data-source-specific
fetch+parse+provider. AsnLookup is bisect-indexed over (start, end,
AsnInfo) ranges with a pickled cache invalidated on raw-file mtime
bump.
Why iptoasn (and not bgp.tools / Team Cymru): public-domain dump,
zero attribution, no UA mandate, daily refresh — keeps DECNET stealth
intact (the geoip/rir module's "never identify as DECNET" comment
applies the same way here). bgp.tools' ToS would have required an
identifying UA, conflicting with feedback_stealth.
Public surface: decnet.asn.enrich_ip(ip) -> (asn, name, source) or
all-None on miss/disabled. Same shape as decnet.geoip.enrich_ip so
the profiler can compose them in one call site.
Renders the swarm host (or "master") that a topology is deployed to,
both as a meta line on each topology list card and in the war-map
header. Operators can now distinguish master-local from agent-targeted
topologies at a glance — previously the only signal was the abstract
"mode: agent" label, with no hint of which agent.
Adds useSwarmHosts() hook for the uuid → host lookup. Falls back to a
short uuid prefix when the hosts list is unavailable so the UI never
hard-fails on a missing /swarm/hosts response.
TopologySummary gains target_host_uuid in the frontend type so the
field actually narrows when checked.
Adds resolve_lan_host(lan, topology) and partition_lans_by_host(h)
in topology.persistence — the single source of truth every per-host
caller (deployer, mutator, validator) consults to decide where a LAN
belongs. Resolution: lan.host_uuid → topology.target_host_uuid →
None (master).
Adds validator rule BRIDGE_HOST_SPLIT: a multi-homed (bridge) decky
attached to LANs that resolve to different hosts is rejected at
deploy-time. A bridge decky is one container with NICs into multiple
LANs; under the co-locate constraint (no overlay network), all those
LANs must share a host.
Adds nullable LAN.host_uuid (FK swarm_hosts.uuid). Resolution order
when deploying a LAN: lan.host_uuid → topology.target_host_uuid →
master. A LAN is one Docker bridge so the bridge cannot span hosts;
this pin forces every decky in the LAN onto the named host.
LANCreateRequest / LANUpdateRequest accept host_uuid; both validate
that the host exists, returning 400 on unknown UUIDs. PATCH still
gated by the existing pending-only guard, so reassignment of a live
LAN is not yet possible (deferred to mutator support).
LANRow surfaces the field so the frontend can render per-host badges.
AgentClient now verifies the worker's TLS cert fingerprint against
SwarmHost.client_cert_fingerprint at __aenter__ time, on top of CA
validation. Required before fanning master-orchestrated topology
deploys out across multiple swarm hosts: CA pinning alone allows any
cert signed by the master CA, which is too coarse once a single
deploy can target N hosts.
Mismatch raises FingerprintMismatchError so callers can distinguish
"wrong worker on the wire" from a transport hiccup.
Pre: optimistic placeholders for enqueued LAN-add mutations were
indistinguishable from regular not-yet-deployed nets — same dim
mono chrome, same dotted border. User couldn't tell whether a drop
had been queued or had silently failed and re-stacked over an
existing LAN.
Tag the placeholder with `pending: true`, render it in the same
amber the REAP button uses (var(--warn, #e0a040)) with a 'PENDING'
chip-mini in the head. Visual is loud enough that there is no
chance of confusion with INACTIVE (dimmed) or regular pending-state
LANs (mono).
Reconciliation is the existing refetch pumping setNets(h.nets) on
SSE — no extra plumbing needed; placeholders disappear naturally
when the mutator's applied event lands and the canvas re-hydrates
from the server.
Two bugs sharing the same root cause: Net only carried a label
string, set to lan.name.toUpperCase() everywhere. Backend mutator
ops look up LANs by canonical lowercase name, so passing the
uppercase label through attachEdge / detachEdge / addDeckyToLan /
deleteLan failed with 'LAN \\'SUBNET-XXXX\\' not found'.
Add Net.name (canonical, lowercase) alongside Net.label (display).
Every backend call site now passes name; toasts and drag ghosts
keep label.
Second bug — new LANs stacking on top of each other on live
topologies — fell out of the same UX path: createLan returns
'enqueued' when the topology is active/degraded, the existing
early-return skipped local-state insertion, so the next drop
recomputed the same grid index. Now we drop a placeholder Net
with id 'pending-lan-<name>' immediately on enqueue. Grid index
advances and the user gets a visual ack right away; SSE replaces
the placeholder by canonical id when the mutator applies it.
MySQL ERROR 1093 forbids referencing the UPDATE target inside a
subquery; the existing UPDATE ... WHERE id = (SELECT id FROM
topology_mutations ...) form blew up on every mutation claim under
the MySQL backend, so no mutation ever progressed past pending.
Wrap the inner SELECT in a derived table (SELECT id FROM (...) AS
_next). MySQL materialises the derived rowset before applying the
UPDATE, sidestepping 1093. SQLite accepts both forms, so the
single-statement atomic claim semantics are preserved on both
backends — racing watchers still serialise correctly.
deploy_topology and teardown_topology are async, but every
_compose_with_retry / _compose call inside them was running in the
main event loop via subprocess.run — which means a multi-minute
docker compose --build froze the entire API: other endpoints,
mutator events, SSE streams, status polls. The user noticed when a
2-decky deploy blocked everything else for the duration of the build.
Wrap both calls in anyio.to_thread.run_sync. Same pattern the
mutator engine has been using at engine.py:104 since forever.
Per-LAN bridge create/remove docker SDK calls are still synchronous
in the loop — they're individually fast (~50-200ms per LAN) and
the loops are bounded by topology size, so they don't dominate.
Worth revisiting if a 200-LAN deploy turns out to stall noticeably.
The api unit's ProtectHome=read-only made the user's HOME read-only
inside the unit's namespace. docker compose --build then tried to
write ~/.docker/buildx/activity/* and got EROFS — which we'd been
misdiagnosing as a buildx wedge for the last few iterations.
Real fix: set DOCKER_CONFIG and BUILDX_CONFIG in the unit's
Environment= to a path inside ReadWritePaths. Hardening stays on,
docker CLI writes to install_dir/.docker instead of /home/<user>/.docker.
The wedge classifier now detects this case (count==0 + /home/ in
the stderr path) and emits a recipe pointing at the env-var fix
instead of the driver-rebuild path. Test added.
Wiki gets the new branch first since it's the most common cause
on systemd-managed installs.
'docker buildx create --name default' errors with 'default is a
reserved name and cannot be used to identify builder instance'.
The bundled builder always exists under that name; the recipe
should switch to it (buildx use default), not try to recreate it.
For the count==0 driver-rebuild branch, the new builder needs a
non-reserved name — using 'decnet-builder' as the example.
The hint was one-size-fits-all and pointed at prune+restart even
when zero mounts were leaked — a false positive caused by matching
any stderr containing the activity-dir path.
Two changes:
1. Tighten the wedge classifier. Both the buildx-specific phrase
('failed to update builder last activity time') AND the EROFS
marker ('read-only file system') must appear in stderr. Either
alone is now treated as a normal transient error and retried.
2. Branch the recipe on _count_leaked_buildkit_mounts():
* count > 0 → unmount loop + daemon stop + umount -l
(prune+restart alone doesn't evict held mounts)
* count == 0 → rebuild the buildx driver (rm builder state,
buildx create --use, inspect --bootstrap)
Original compose stderr is now preserved in the hint as
'Original error: ...' so the user sees both the recipe and what
compose actually said.
Tests cover both branches plus a negative case (unrelated EROFS).
str(CalledProcessError) is just 'Command ... returned non-zero exit
status N' — the stderr (where the buildx recovery hint lives) was
being silently dropped from both the deploy log line and the
persisted 'failed' status reason.
New _format_subprocess_error helper appends .stderr when the
exception is a CalledProcessError. Applied to transition_status
reason and the background-deploy log message so operators and the
UI see the real failure, not just the exit code.
This is what makes the buildx preflight hint from 86b9dec actually
reach the user.
When Docker's buildx leaks bind-mounts from a failed build it starts
reporting 'read-only file system' on its own activity file, even
though nothing is actually read-only. The user's host had 20+
leaked mounts before we noticed — each retry compounds the leak.
_compose_with_retry now:
* Pre-flight counts /var/lib/docker/tmp/buildkit-mount* entries in
/proc/self/mounts; if >= 10 and the command is a build, refuses
to start and returns a clean recovery recipe instead of retrying.
* On mid-build failures that match the wedge signature
('failed to update builder last activity time' or the activity-dir
path in stderr), short-circuits the retry loop with the same
recipe. The first occurrence no longer needs a pre-flight; the
pre-flight catches repeat attempts.
Recipe points at 'docker buildx prune -af && sudo systemctl restart
docker', which is what actually clears the leaked mounts.
Tests cover all three paths: wedge preflight blocks builds, non-build
commands (down/stop) ignore the preflight, mid-build signature
detection kills the retry loop. A new autouse fixture stubs the
wedge-detector to 0 so dev-host state doesn't poison the mocked
subprocess tests.
Wiki companion commit adds Troubleshooting → 'Buildx leaked mounts'.
Port-to-port edges previously lived only in the editor's local state
— the backend's edge model is decky<->LAN membership, so the deploy
validator still saw cross-LAN pairs as orphans. Drawing a line from
dmz-gateway to a decky in subnet-d6b2 did nothing that a later
DMZ_ORPHAN check could see.
Now onAddEdge inspects endpoints: same-LAN stays visual (no bridge
to create), cross-LAN calls attachEdge with the source decky and
the target LAN, multi-homing the decky so the validator's LAN
adjacency scan threads through it. The viz edge stores the returned
backendEdgeId; removeEdge detaches that membership before dropping
the local edge. Observed entities (attacker-pool) are read-only and
never bridge.
A toast ("BRIDGED <decky> -> <lan>") surfaces the backend-persistent
side of the gesture so the user knows it's not just a cosmetic line.
POST /topologies raised a 500 with a raw SQLAlchemy IntegrityError
traceback when the name collided with an existing topology. Catch
the error at the router, verify it's the ix_topologies_name
constraint (so unrelated integrity failures still surface as 500s
with their real traceback), and return 409 with a helpful detail.
Test covers the create-then-duplicate-create flow.
The .maze-edge-dash CSS animation invalidates each path's bounding
box every frame. Inter-LAN paths span the viewport so invalidations
overlap, and past ~60 edges the compositor spends every frame
repainting — the dominant cost on the 12+ LAN screenshot, even
dwarfing pan-drag overhead.
Drop the animation class when edges.length > 60. Edges stay fully
visible and traffic-tinted, just static. A MOTION: OFF segment in
the status bar surfaces the auto-disable so it doesn't look like a
broken animation.
Threshold is a constant in Canvas.tsx; if it needs to become a
user toggle later, lift it to state + localStorage in one place.
A 30-LAN generate request already fits in 172.20.0.0/16, but trees
with depth/branching that multiply past 256 (e.g. depth=6,
branching=4 ≈ 5k LANs) hit AllocatorExhausted before the first
write.
SubnetAllocator now accepts a full CIDR base ("172.16.0.0/12" →
4096 /24s) in addition to the legacy two-octet shorthand ("172.20",
auto-lifted to /16). The parent must be ≤/24; a /24 base yields
exactly one slot. Iteration order is preserved for /16 bases so
existing topologies keep their third-octet sweep; /12 adds a
second-octet dimension underneath.
Defaults bumped to 172.16.0.0/12: TopologyConfig.subnet_base_prefix,
/next-subnet query param, and the mutator's add-LAN fallback. The
field pattern widens to accept CIDR. create-blank and manual LAN
CRUD still use "10.0" (lifts to /16) — one DMZ LAN per topology,
256 is plenty.
Pan/zoom previously drove a full Canvas re-render on every mousemove
via setPan() — at 30 LANs that's ~1000 SVG paths and div cards
re-evaluating 60 times a second while you drag. The browser screamed.
Three fixes, one surgical pass:
1. Pan drag writes the translate/scale transform directly to the
pan-layer DOM ref inside requestAnimationFrame; setPan is deferred
to mouseup. Grid pattern attributes (x/y/width/height) get the
same treatment so the backdrop stays glued to the canvas content.
Wheel zoom, resetPan, and zoomBy also sync refs + fire a write so
React-driven changes land in one frame.
2. Edge rendering swaps the nodes.find() inside .map() for a
Map<id, node> built once per render — O(E) instead of O(E·N).
NetBox + NodeCard are now wrapped in React.memo; Canvas hoists
the setSelection closures into useCallback so memo can actually
short-circuit instead of seeing a fresh prop every render.
3. Drag-a-single-node still mutates state and re-renders, but now
only the moved node rerenders — the other 89 skip via memo.
Everything that reads panRef.current (toWorld, context menu, drop
targeting) still sees the live value during drag because we mutate
the ref synchronously on each mousemove; only React state is lazy.
Route all lucide-react icon usage through a single src/icons.ts
re-export that imports each icon from its own per-icon module
(lucide-react/dist/esm/icons/<name>) instead of the barrel.
Bundle-size impact: none (29kB icons chunk unchanged — tree-shaking
was already effective with sideEffects:false). Dev-experience win:
Vite transforms 247 modules instead of 1848 because the dep
optimiser no longer pre-bundles the full lucide barrel — faster
cold start and HMR.
Ambient d.ts declares the wildcard module so TS accepts per-icon
imports; lucide ships .d.ts only for the barrel.
Seven icons were renamed upstream and still work through the barrel
via aliases (AlertTriangle -> triangle-alert, BarChart3 -> chart-column,
CheckCircle -> circle-check-big, Filter -> funnel, PlusCircle ->
circle-plus, Sliders -> sliders-vertical, UploadCloud -> cloud-upload,
Fingerprint -> fingerprint-pattern). Component call sites stay on
the legacy names; the renames live only in icons.ts.
Switch all navigable route components to React.lazy() and wrap
<Routes> in <Suspense>. Dashboard/Login/Layout stay eager since
they're the shell.
Initial index bundle drops 246kB -> 34.67kB (gzip 10.5kB). Each
route becomes its own 8-51kB chunk, loaded on demand.
Nav hover/focus triggers prefetchRoute(path) which fires the same
dynamic import() specifier the bundler dedups against React.lazy,
so the chunk is warm by the time the user clicks. Avoids the
Suspense flicker that would otherwise show on every first nav.
Single-bundle build was tripping vite's 500 kB warning per chunk and
forcing every user to re-download the entire app on every deploy.
Manual chunks split the bundle along natural library boundaries so:
- Rarely-changing vendor libs (react-dom, react-router, lucide-react,
asciinema-player) cache across deploys.
- App code lives in its own `index-*.js` that's the only chunk that
changes when we ship feature work.
Split shape (manualChunks fn in vite.config.ts):
- charts — recharts + d3-*
- player — asciinema-player
- icons — lucide-react
- router — react-router / react-router-dom
- react-dom, react
- vendor — everything else in node_modules
Resulting bundle sizes (gzip):
index (app): 246 kB (gz 63)
react-dom: 182 kB (gz 57)
player: 176 kB (gz 65)
router: 42 kB (gz 15)
vendor: 36 kB (gz 14)
icons: 29 kB (gz 10)
Every chunk under the 600 kB ceiling we now set explicitly. The old
~705 kB single-chunk deploy is gone. No code changes — config only.
Five was still too loud on AttackerDetail when rotation is in play.
One inline is enough to read at a glance; everything else goes
behind the expand button. Rotation tag keeps carrying the count so
no signal is lost.
`for i in $(seq 1 100); do curl -H "X-Forwarded-For: 191.100.20.$i" ...`
was dumping 100 distinct IPs into AttackerDetail's LEAKED IPs row,
drowning the rest of the ORIGIN section. The 100-IP wall is itself a
signal (WAF-bypass-list probing) that deserves a short badge, not a
flood.
Backend:
- get_attacker_ip_leaks gains `limit: int = 10` parameter — caller
only ever needs a sample, not the full set.
- New count_attacker_ip_leaks() returns the unbounded COUNT(*) via
one cheap SQL aggregate.
- Detail endpoint returns {ip_leaks: [first 10], ip_leaks_total: N}
so the UI can render a rotation badge independent of list length.
UI:
- New LeakedIPsRow component. First 5 distinct IPs rendered inline
with hover tooltips (unchanged). When > 5, a `+ N more` expand
button reveals the rest of the sample; when total exceeds the
10-row cap, a subtle `(+M beyond sample)` note appears.
- When total ≥ 20, a red `ROTATION · N` tag renders leading the
row with a tooltip explaining the semantic: "almost certainly
XFF-rotation / WAF-bypass probing, not a real attribution leak."
DB churn is deliberately not capped — 100k rows × ~500 B is tolerable.
If it becomes a problem we can add an ingester-side count-and-skip;
for now the UX fix is the whole story.
Added test_ip_leaks_total_reported_separately_from_list asserting
the endpoint shape matches what the UI consumes.
Every http_useragent bounty now carries a `category` label plus an
optional tool name and a signals list. The main analytic win is the
`nonstandard` bucket — UAs like "FUCKYOU/1.0" or custom one-off
scanner labels that don't match any known pattern, which today
silently blend into the generic fingerprint list.
Buckets (priority order):
- scanner: nmap, nuclei, sqlmap, gobuster, nikto, masscan, zgrab,
ffuf, wpscan, katana, burp, acunetix, nessus, openvas, arachni,
whatweb, wappalyzer, etc.
- cli: curl, wget, httpie, xh, fetch.
- library: python-requests, aiohttp, httpx, urllib, Go stdlib, Java,
okhttp, Apache HttpClient, axios, node-fetch, got, undici, PHP,
Guzzle, Ruby stdlib, Faraday, .NET, PostmanRuntime, Insomnia, etc.
- bot: anything containing bot / crawler / spider / slurp / monitor
(catches Googlebot, bingbot, Baiduspider — many of which ship a
Mozilla/5.0 prefix, so the bot check runs BEFORE the browser
regex).
- browser: Mozilla/5.0-prefixed UAs that aren't bots.
- nonstandard: anything else. The interesting bucket.
- empty: literal empty User-Agent header.
Side signals computed regardless of category: suspicious_short (<8
chars), suspicious_long (>512 chars), nonprintable (control chars),
injection_like (SQLi / XSS / path-traversal / Log4Shell markers).
A sqlmap UA with a literal SQL-injection payload embedded fires
category=scanner + injection_like — the combination tells the
analyst the tool is being operated manually vs. on default config.
Classification is deterministic (same UA string → same tuple) so
add_bounty's payload-hash dedup continues to collapse repeat rows.
UI renderer upgraded from FpGeneric to a dedicated FpUserAgent that
colours the category tag by risk (scanner=alert-red,
nonstandard=warn-yellow, browser=accent-green, etc.) and renders
each signal as its own chip. Makes the interesting rows pop in the
fingerprints panel.
Also fixed: the ingester was using `_headers.get("User-Agent") or
_headers.get("user-agent")`, which short-circuits away empty-string
UAs. An explicit empty UA is itself a signal (real clients always
send something) — now captured.
An attacker hitting /admin with `X-Forwarded-For: 127.0.0.1` was
previously flagged as an IP leak. It isn't — that's the classic
IP-allowlist / WAF-bypass payload ("treat me as localhost and skip
your auth checks"). Misclassifying it as "LEAKED IPs" in the UI
confuses analysts and burns trust in the signal.
Split by claim category. After pulling the left-most claimed IP
from the proxy header, classify:
- public (routable) → bounty_type=ip_leak (real attribution leak;
the attacker's upstream proxy forwarded their real IP).
- loopback / private / link-local / multicast / reserved /
unspecified → bounty_type=fingerprint, fingerprint_type=
spoofed_source (WAF-bypass / allowlist-probing attempt; the
attacker is telling us they know what XFF does).
- unparseable → dropped.
Same extraction pipeline; diverges only at the last step. A new
shared _classify_proxy_header_claim returns (kind, payload);
_detect_ip_leak keeps its public-only contract for backward-
compat; _detect_spoofed_source is the new sibling.
UI renderer FpSpoofedSource shows the claimed IP in warn color with
the claim_category tag (LOOPBACK / PRIVATE / ...) and a WAF-BYPASS
ATTEMPT badge — distinct visual from the "LEAKED IPs" row which
stays reserved for genuine public-IP leaks.
Test addresses updated: RFC 5737 doc ranges (198.51.100.0/24,
203.0.113.0/24) are flagged `is_reserved` in Python's ipaddress
module, so they now correctly belong to the spoof bucket — tests
that meant to exercise real public IPs now use 8.8.8.8 / 1.1.1.1 /
Cloudflare DNS. Added eleven new tests locking the classifier +
the two detectors' mutual exclusion.
add_bounty dedups on (attacker_ip, bounty_type, full payload JSON).
Three fingerprint-family bounties (http_useragent, ip_leak,
http_quirks) were including method/path / header_count in their
payloads — fields that vary per request — so a scanner hitting 100
paths produced 100 rows instead of 1, which is what was swelling
AttackerDetail.
Payloads now carry identity-only fields:
- http_useragent: {fingerprint_type, value}. UA + path combinations
no longer collide; one row per distinct User-Agent string.
- ip_leak: {source_ip, real_ip_claim, source_header, headers_seen}.
One row per distinct (proxy source, leaked IP, leaking header)
triple; repeat hits with the same header on different paths dedup.
- http_quirks: {fingerprint_type, order_hash, order, casing_hash,
casing_category, stable_count, tool_guess}. No more header_count
(included volatile headers; Cookie-presence variance broke dedup).
Per-request context (path, method, etc.) was never load-bearing for
analysts — the logs table already answers "when + where" at
per-event resolution. The bounty table is for stable identity.
UI:
- FpHttpQuirks renderer drops the method/path footer line and the
header_count/duplicates tags; shows stable_count instead.
- LEAKED-IPs tooltip on AttackerDetail swaps "X on GET /path" for
"Leaked via X; source 203.0.113.42" — same information, stable.
Tests add a "payload stable across paths and methods" assertion on
http_quirks — locks the contract so a future regression that sneaks
a per-request field back in fails loudly.
Existing duplicate bounty rows don't retroactively collapse.
Dev: `decnet db-reset --i-know-what-im-doing drop-tables` and
restart. Prod: one SQL pass to dedup by (attacker_ip, bounty_type,
payload) — trivial but not automated.
Per-request HTTP fingerprint derived from the header dict we already
log. Captures:
- order_hash: SHA-256 prefix (16 hex) over the lowercased header-name
sequence, minus volatile/per-request headers (Content-Length,
Cookie, Authorization, XFF family, trace IDs). Stable identity for
a given client stack regardless of which target / path is hit.
- casing_hash: same shape but over the per-header casing category
(Title-Case / lower / UPPER / mixed). Attackers frequently spoof
User-Agent but forget their stack sends `user-agent` while browsers
send `User-Agent`.
- tool_guess: prefix match against curl / python-requests /
Go-http-client / nmap-nse signatures. Cheap, best-effort — the
hash is the hard signal.
- duplicates: reserved for when the HTTP template switches from
dict(request.headers) to a list form; today it always fires empty
because dict() collapses duplicates.
Payload is a fingerprint bounty (bounty_type="fingerprint",
fingerprint_type="http_quirks"). Bounty dedup collapses identical
hashes per attacker — one row per distinct fingerprint — so a chatty
scanner doesn't spam the vault, but a tool-chain change from the
same IP surfaces as a new row.
UI renderer (FpHttpQuirks) shows the two hashes, tool guess badge in
violet, casing/count tags, and a collapsible header-order list.
Added to the passiveTypes group so it nests with JA3/JA4L/etc. in
the AttackerDetail fingerprints panel.
One library note: the naive "title-case" classifier failed on tokens
like `X-Forwarded-For` because Python's "".islower() returns False
so `p[1:].islower()` rejects single-letter tokens like the `X`.
Fix: explicitly accept single-char tokens when uppercase.
Attackers routinely front their scanners with VPNs/proxies, so the
TCP source we log is the proxy egress, not the real host. But a
surprising number of attacker setups are misconfigured: the proxy
forwards the real IP in an X-Forwarded-For (or Forwarded / X-Real-IP
/ CDN-variant) header. From our side that's a free attribution leak.
New _detect_ip_leak extractor in decnet/web/ingester.py fires at
ingest time per HTTP request. Logic:
1. Require service=http, source_ip present, headers present.
2. If source_ip ∈ DECNET_TRUSTED_PROXIES (comma-separated IPs or
CIDRs) → legitimate reverse-proxy forwarding, skip.
3. Walk proxy-family headers in priority order: Forwarded (RFC 7239)
→ X-Forwarded-For → X-Real-IP → True-Client-IP → CF-Connecting-IP.
4. Extract the left-most parseable IP from the winning header.
5. If that IP differs from the TCP source → emit a bounty with
bounty_type="ip_leak" carrying {source_ip, real_ip_claim,
source_header, headers_seen, path, method}.
Storage is the existing Bounty table — no schema change; de-dup is
handled by Bounty's (attacker_ip, bounty_type, payload_hash) key, so
repeat requests with the same leaked IP don't spam.
AttackerDetail renders a warn-accent "LEAKED IPs:" row under ORIGIN
listing distinct real_ip_claim values; hover tooltip shows the source
header + path of the most recent leak. Only shown when at least one
ip_leak bounty exists.
RFC 7239 Forwarded parser handles the full vocabulary — bare IPv4,
IPv4:port, quoted, IPv6 in brackets, IPv6 with port — returning only
IPs that actually parse.
Closes DEVELOPMENT.md "Network Topology Leakage → X-Forwarded-For
mismatches". Phase 3 of the three-phase Attacker Intelligence series
(phases 1: scanned-vs-interacted, 2: PTR records already shipped).
DECNET_TRUSTED_PROXIES env shape matches THREAT_MODEL DA-08's
"revisit when verified-proxy config lands" note — same token set
future rate-limit work will consume.
Resolve each attacker IP's rDNS name once at first sighting, store on
Attacker.ptr_record, render on AttackerDetail under ORIGIN. Many
attackers run infrastructure with forgotten rDNS that instantly
identifies them once surfaced: scan-node-42.shodan.io,
shady-vps.leasecloud.net, etc.
Resolver lives in decnet/geoip/ptr.py — colocated with enrich_ip
because the shape matches (take an IP, return supplementary
metadata, never raise). Uses the OS resolver via socket.gethostbyaddr
offloaded to the default executor, wrapped with asyncio.wait_for
timeout=2s so a slow authoritative NS can't stall the profiler tick.
Profiler side: _WorkerState grows a ptr_attempted: set[str] bounding
resolution to once per worker lifetime. Cold-start batches resolve
concurrently (Semaphore(_PTR_CONCURRENCY=10)) so a backlog doesn't
serialize 2s ceilings. _build_record gains a keyword-only ptr_record
parameter that, when _UNSET, omits the key from the record dict —
upsert_attacker's attribute-merge loop then preserves whatever's
stored on the row. Explicit None is a "fresh failed attempt" signal
and gets written through.
Env kill-switch DECNET_PTR_ENABLED=false for locked-down deploys
where egress DNS is forbidden. Private / loopback / link-local /
multicast / reserved addresses short-circuit before any DNS call.
IPv6 reverse DNS works transparently through the stdlib resolver.
Schema change — run once on upgrade:
ALTER TABLE attackers
ADD COLUMN ptr_record VARCHAR(256) NULL DEFAULT NULL;
Or drop-and-recreate on dev boxes (db-reset's SQLModel.metadata-driven
table discovery now picks it up automatically since ba155b7).
tests/conftest.py disables DECNET_PTR_ENABLED globally for the same
reason it disables DECNET_GEOIP_ENABLED — unit tests must never hit
the network. tests/geoip/test_ptr.py re-enables explicitly via an
autouse fixture.
Adds a new card on AttackerDetail: SCANNED · N services | INTERACTED
WITH · M services. Distinguishes port-scanners (N high, M=0) from
actual engagement (M>0) at a glance — the analyst's first question
when triaging a new attacker row.
Classifier lives in decnet/correlation/event_kinds.py, a single
source of truth for the event-type vocabulary:
- INTERACTION_EVENT_TYPES — command-family (command/exec/query/...),
SMTP engagement (mail_from/rcpt_to/message_accepted), file/payload
activity (file_captured/upload/download_attempt/retr), pub/sub
(publish/subscribe), recorded TTY sessions.
- NOISE_EVENT_TYPES — DECNET-internal (startup/shutdown/parse_error/
unknown_*).
- Everything else defaults to scan. Conservative by design: new
template verbs show up as "scanned" until explicitly promoted.
Bucket logic: a service is "interacted" if ≥1 of its events
classifies as interaction; otherwise "scanned" if ≥1 scan event;
noise-only services drop. Disjoint by construction.
Deliberate no-schema path: compute on-the-fly in the detail endpoint
via SELECT DISTINCT service, event_type FROM logs. Small result set
(tens of pairs per attacker), cost is trivial vs. the existing
behavior/commands queries. Trade-off: one more DB round-trip per
detail view in exchange for zero ALTER TABLE migration pain and
immediate classifier-change feedback loop.
Profiler's _COMMAND_EVENT_TYPES stays as-is (strict subset of
interactions that carry executable text), with a comment pointing at
the new canonical module.
Closes DEVELOPMENT.md "Attacker Intelligence §Service-Level Behavioral
Profiling — Services actively interacted with".
test_lifespan_db_retry patched decnet.web.api.asyncio.sleep to skip the
DB-retry backoff. Problem: asyncio is a shared module — the patch leaks
to every caller that looked up asyncio.sleep via `import asyncio`,
including run_health_heartbeat's own sleep loop. That heartbeat task
spawns inside the same lifespan; with its sleep mocked, the while-loop
spins tight, starves cancellation, and leaves an orphan task that
pytest-timeout eventually signals — surfacing as the 'Task exception
was never retrieved' warnings the user saw when running the suite.
Fix: give decnet.web.api a local binding `_retry_sleep = asyncio.sleep`
for the DB-retry wait, and have the test patch that instead. Narrowly
scoped, no impact on asyncio.sleep callers elsewhere.
Test timing before: 12s with --timeout=10 (interrupted by signal).
Test timing after: 0.58s. Full tests/web slice: 27s → 7.1s with the
spurious warnings gone.
Before: if the bus was unreachable at worker start, we logged
"running in idle mode" once and parked on shutdown forever. systemd
doesn't guarantee bus is fully up before the webhook worker starts,
so a race on boot left the worker permanently dead until restart.
Now: wrap the whole bus-use in an outer reconnect loop.
while not shutdown:
try: connect()
except: sleep(RECONNECT_SECS) ; continue
try: run_with_bus(...) # heartbeat + dispatch
except: log+close ; reconnect on next iter
Clean consequence: if the bus dies mid-operation the dispatch loop's
subscriptions raise inside the consumer tasks, `_run_with_bus` exits,
the outer loop closes the stale connection and reconnects. No partial
state leaks across epochs — fresh bus, fresh subs, fresh heartbeat.
Interval is 60s by default, overridable via
DECNET_WEBHOOK_BUS_RECONNECT_SECS. Shutdown wakes the wait so
systemctl stop doesn't hang for a minute.
Test added: flaky get_bus that fails once, then returns a live
FakeBus — asserts retry + successful delivery.
get_app_bus() in decnet/bus/app.py already has a 2s backoff retry so
the FastAPI hot path self-heals; this commit brings the standalone
webhook worker in line with the same posture.
Add "webhook" to KNOWN_WORKERS + the start-all preferred order so the
Config → Workers panel picks up the row automatically: heartbeat
subscription, start/stop controls via the existing systemd helper
(decnet-webhook.service.j2 already lands via decnet init's unit
glob), and the status-dot lifecycle all come for free.
Placed between mutator and the swarm-only agent/forwarder/updater
trio — matches the intended startup sequence (bus → api → data-plane
workers → egress → swarm management).
No frontend change needed; Config.tsx reads the worker list
dynamically from GET /api/v1/workers.
The hardcoded _DB_RESET_TABLES tuple had drifted — session_profile,
smtp_targets, and webhook_subscriptions were all missing, so
`decnet db-reset --i-know-what-im-doing drop-tables` silently left
them behind. Running it on a post-webhook install then letting
SQLModel.metadata.create_all() re-create tables produced a partial
schema: old rows survived, new columns didn't land, and endpoints
500'd on the missing columns (e.g. auto_disabled_at after the
circuit breaker merge).
Replace the hardcoded list with `SQLModel.metadata.sorted_tables`,
reversed for DROP safety (children first). Any future model addition
is auto-enrolled — no manual step, no more drift.
No behavior change on reset semantics; the SET FOREIGN_KEY_CHECKS=0
fence still covers any edge case the sort order misses.
After DECNET_WEBHOOK_CIRCUIT_THRESHOLD (default 5) consecutive failed
deliveries, the worker calls trip_webhook_circuit(uuid, ts) which
flips enabled=False and stamps auto_disabled_at. The worker sets its
reload flag so the next dispatch epoch stops consuming events for the
tripped sub entirely — one dead receiver can't poison the shared
egress pool anymore.
Operator clears the trip via PATCH — setting enabled=True when the
sub was previously disabled clears auto_disabled_at, zeros
consecutive_failures, and clears last_error. Admin-pause → re-enable
hits the same path harmlessly.
Three observable states now distinguishable in the UI:
- Active enabled=True, auto_disabled_at=NULL
- Admin-paused enabled=False, auto_disabled_at=NULL
- Tripped enabled=False, auto_disabled_at=<ts>
UI surfaces a TRIPPED · <ts> chip on the row (red, alert-styled) and
a "N TRIPPED" count in the page header. Hover tooltip tells the
operator how to reset ("Re-enable via Edit").
record_webhook_failure now returns the new consecutive_failures count
so the worker can compare against the threshold without a second
roundtrip. trip_webhook_circuit is idempotent — re-tripping just
re-stamps auto_disabled_at.
Closes THREAT_MODEL WH-02 and DEBT-037 §1.
The per-row test-delivery action already existed as an icon-only ⚡
zap in the ACTIONS column — backed by POST /webhooks/{uuid}/test,
which fires a synthetic test.ping event through the normal HMAC-
signed delivery path with retries disabled. Too easy to miss.
Replace the icon-only button with a labeled [⚡ FIRE] violet-accented
button so it reads as an emphasized dev-tool action right next to
edit/delete. Tooltip now spells out the backend endpoint and "fire
a synthetic test event" intent.
No backend change. Widens the actions column to 180px to accommodate
the label.
Python stdlib ThreadingHTTPServer that accepts any POST path, optionally
verifies HMAC against --secret / $DECNET_MOCK_SECRET, and pretty-prints
each delivery with topic / event-id / signature status. Pass --fail 503
to exercise the worker's retry/backoff path.
Point a webhook at http://localhost:8765/ and you'll see every delivery
land with color-coded HMAC OK / MISMATCH / UNVERIFIED badges. No deps.
The webhooks page used a bespoke .webhooks-header wrapper that didn't
line up with the rest of the dashboard (Fleet / Logs / Swarm all use
the .<page>-root + .page-header + .page-title-group + .actions
pattern). Swapped to that convention:
- .webhooks-root wrapper, matching .logs-root / .fleet-root spacing.
- H1 "WEBHOOKS" in .page-title-group; subtitle shows
`N CONFIGURED · M ENABLED [· K FAILING] [· L INSECURE]` in
.page-sub, same voice as the LOGS stream summary.
- Actions (CREATE WEBHOOK, DELETE SELECTED) sit in .actions.
- Table lives in a proper .logs-section shell with a .section-header
carrying the Webhook icon + "SUBSCRIPTIONS" title.
- All scoped button overrides (violet/alert/warn/ghost) copied from
the LiveLogs scope so theme switches behave identically.
Also improve error messaging: extractErrorDetail now maps 401 to
"Session expired" and 403 to "Insufficient permissions (admin only)"
instead of falling through to the generic "Failed to load webhooks".
Helps users who hit the page as viewer or with a stale token see why
it failed.
New /webhooks admin page with table-based subscription management:
- CREATE WEBHOOK (inline form row — no modal) with simple-event
checkboxes (AttackerDetail / DeckyStatus / SystemStatus) that
expand to bus-topic patterns server-side, and an advanced-mode
textarea for raw NATS-style patterns.
- Bulk-select + DELETE SELECTED with two-click arm pattern.
- Per-row test-ping (zap), pencil edit, and delete actions.
- Last-fired timestamp column.
- Yellow banner surfacing insecure_url warnings (WH-03): http:// is
allowed but flagged so operators see it on every page load.
- Post-create secret modal — the secret is shown exactly once with
a COPY button and a clear "won't see this again" notice.
Sidebar nav regrouped: /live-logs and /webhooks now live under a new
ALERTS NavGroup (Bell icon). The alertCount badge rides the Live
Logs sub-item. Command palette gains a "Webhooks" GO TO entry with
the `G W` chord.
Side-fix: useFocusSearch.ts was failing the build under
verbatimModuleSyntax (pre-existing, unrelated). Split the React
import to satisfy tsc; no behavioural change.
The webhook MVP shipped with deliberate deferrals; this entry names
them so future PRs know exactly what's left to close: circuit
breaker, dead-letter table, delivery audit log, batch/coalescing,
per-subscription rate limiting, payload templates per destination,
and secret encryption at rest.
Non-negotiable even at MVP scope (HMAC signing, bus-off degraded
mode, jittered retry backoff) is called out explicitly to prevent
future contributors from weakening it under the banner of
"simplification."
WebhookResponse now carries a `warnings: list[str]` field. When the
subscription's URL starts with http://, an `insecure_url` advisory is
surfaced on every GET/CREATE without blocking the request. HMAC still
detects tampering regardless of transport — only read-confidentiality
is lost over plaintext — and test/dev environments without TLS stay
usable.
Matches the operator-trust posture already established by DA-06
(admin-on-admin protection is out of scope). The alternative — hard
rejection at admin time — was considered and declined; warning-plus-
visibility is the right shape.
THREAT_MODEL WH-03 accepted risk registered; revisit triggers are
multi-admin delegation, a regulated customer, or an operator ticket
asking for a DECNET_WEBHOOK_REQUIRE_HTTPS enforcement knob.
- DEVELOPMENT.md: tick the "Real-time alerting" roadmap item with a
note that Slack/Telegram-specific senders remain per-destination
follow-ups (they accept generic webhook payloads already).
- THREAT_MODEL.md: new Component 2 — DECNET↔External webhook
destination. DFD, full STRIDE table, WH-01 (secret at rest) and
WH-02 (half-dead-receiver retry waste) registered as accepted
risks pointing at DEBT-037 for post-MVP hardening. Checklist lists
two open items: OpenAPI schema omits `secret`, and http:// URL
rejection at admin time.
Introduces the `decnet webhook` long-running worker that consumes the
internal bus and POSTs matching events to configured subscriptions.
Design: one task per (subscription, pattern) pair. Each task opens
its own bus subscription, iterates events, and dispatches via the
shared deliver() client. No intermediate queue, no in-memory filter
matching — the bus's own pattern matcher is the filter. Reloads on
`system.webhook.subscriptions_changed` signals from the CRUD router,
with a 60s fallback timer in case a signal is lost.
Shutdown propagates via CancelledError on the outer task; all inner
subscription tasks are cancelled and awaited in a finally block.
Bus unavailable → worker stays up in idle mode per the DEBT-031
pattern, logging one warning.
Registered as a master-only CLI command (agents don't configure
webhooks — the subscription store lives on master). systemd unit
mirrors the profiler template; added to decnet.target Wants= list so
`systemctl start decnet.target` brings it up alongside everything
else. `decnet init` auto-picks up the new .service.j2 via its
existing `glob("decnet-*.service.j2")` sweep.
Introduces the webhook egress foundation — a new WebhookSubscription
table, admin-gated CRUD under /api/v1/webhooks, and the shared
delivery client that both the test-ping route and the upcoming worker
will use. No worker yet; this commit is API + model + client only.
Simple-mode enum (AttackerDetail / DeckyStatus / SystemStatus) expands
to bus-topic patterns at the router layer; storage is always the raw
pattern list. Advanced mode lets admins supply raw NATS-style patterns
directly. Filter-at-subscribe: the worker (next commit) will subscribe
to the union of patterns across enabled subscriptions.
Delivery client handles HMAC-SHA256 signing (X-DECNET-Signature),
retry on 429/5xx/network errors with jittered backoff, no-retry on
4xx. Secrets never leave the server on GET/LIST — only the create
response carries the secret for copy-out.
CRUD routes publish WEBHOOK_SUBSCRIPTIONS_CHANGED on the bus after
every mutation so the (future) worker can hot-reload.
Opens DEBT-037 for the deferred items (circuit breaker, dead-letter,
batch delivery, payload templates, secret-at-rest).
New decnet/web/sse_limits.py provides sse_connection_slot, an async
context manager that counts live SSE connections per user UUID and
raises 429 when a per-user cap is exceeded (default 5, override via
DECNET_SSE_MAX_PER_USER). Wired into both SSE generators as their
first async with, so the cap check fires before any stream data is
yielded.
The cap must sit inside the generator — StreamingResponse returns
before the generator body runs, so a handler-level wrapper would
release the slot immediately. Put prefetch + slot + loop all under
the one async with.
Also documents F6/I (role leakage) as mitigated-by-construction via
handler docstrings: every event type on both streams wraps data
already reachable via viewer-gated REST, so no per-event filter is
needed until a new event family is introduced. The invariant is
written into the handler docstrings so a future PR can't silently
add admin-only events.
Resolves THREAT_MODEL F6/I and F6/D.
Every mutation route that returned an untyped dict now declares
response_model at the decorator. MessageResponse covers the eight
{"message": ...} envelopes (change-password, mutate-decky, mutate-
interval, update-deployment-limit, update-global-mutation-interval,
delete-user, update-user-role, reset-user-password). Purpose-built
models cover the richer shapes (DeployResponse for /deckies/deploy,
PurgeResponse for /config/reinit, ReapReportResponse for /reap-orphans,
UserResponse for /config/users). 204-No-Content and Response/
ORJSONResponse routes stay as-is.
The wire shape for clients is unchanged — the envelopes already only
shipped a message field. What changes is that a handler which
accidentally returns a richer dict (e.g. a full user row including
password_hash) would be silently stripped to the declared fields at
serialization time.
Also flips F4/D "expensive LIKE" to accepted (new DA-09) — the /logs
and /attackers search routes LIKE-scan unbounded columns, but both are
admin-gated, limit-capped, and operator rate-limit scope per DA-04.
FTS5 stays a performance TODO, not a security blocker.
The other five query endpoints (/logs, /attackers, /attacker-commands,
/bounties, /topologies/{id}) already declared le=2147483647 on offset;
these two were inconsistently uncapped. Bring them in line to close
the F4/D deep-pagination row.
Also resolves F4/T (ORM sort injection — already mitigated by the
regex pattern on /attackers sort_by, no other route accepts a column
name) and F4/D (limit cap — already universal) with code pointers.
New test walks app.routes, classifies each APIRoute as admin/viewer/open
by identity-matching require_admin / require_viewer closures inside the
route's dependency tree, then asserts:
- admin routes return 403 to a viewer JWT
- viewer routes return neither 401 nor 403 to a viewer JWT
SSE routes skipped (separate scope under F6). Role hints deliberately
NOT encoded in the OpenAPI spec — classification stays server-side so
/openapi.json can't be used to enumerate admin routes.
Resolves THREAT_MODEL F2/I + F5/E; paired with the existing
test_schemathesis.py::test_auth_enforcement (401-half coverage).
Harden the attacker-controlled artifact download path (F7) with explicit
response headers instead of relying on Starlette's defaults (which only
emit attachment for non-ASCII filenames and never set nosniff). Also
resolves the THREAT_MODEL F7 path-traversal row (containment check was
already in _resolve_artifact_path) and the fleet-deploy detail=str(e)
audit (all four sites are admin-gated deliberate validator UX or
structured worker-response fields).
The ~30-signature hand-rolled p0f-lite table in decnet/sniffer/p0f.py
misses most real-world attackers (yesterday's SLOW SCAN being a
textbook case — 9 hours of events, 19 hits, os_guess = NULL). The
375-sig vendored p0f v2 DB was already there; this commit actually
calls it.
New resolution chain in sniffer_rollup:
1. Enabled OS-fingerprint providers (p0f-v2 default, via
DECNET_OSFP_PROVIDERS) tried in declared order. Provider with
highest-confidence match across all enabled sources wins.
2. Modal os_guess label from the sniffer's hand-rolled p0f.py.
Kept as fallback because v2's DB predates post-2006 kernels.
3. TTL bucket (linux / windows / embedded). Coarse but never wrong.
Wiring details:
- _match_via_osfp_providers: never raises — factory / provider
failures collapse to None and the chain falls through to the
old modal-label / TTL path. A corrupt .fp file or misconfigured
DECNET_OSFP_PROVIDERS must never wedge a profile rebuild.
- tcp_fp_context tracks whether the LATEST tcp_fp snapshot came
from a passive SYN ('syn' → p0f.fp) or an active prober probe
('synack' → p0fa.fp). Routes to the right sig list.
- initial-TTL normalisation via decnet.sniffer.p0f.initial_ttl.
Observation's TTL may be N hops below the OS's initial; v2
signatures match on the canonical bucket.
Soft-field semantics on Signature.score(): df and total_len are now
skip-checked when the observation is missing them. Sniffer doesn't
currently emit either SD field; a literal-constraint sig
shouldn't hard-reject a match solely because of upstream
incompleteness. Hard fields (window, ttl, options_sig, quirks)
still hard-reject on absent/mismatched input — those are the real
discriminators. Promote df / total_len back to hard the moment the
sniffer starts emitting them.
+2 integration tests on TestSnifferRollup, +2 soft-field tests on
test_signature. Full regression: 166 tests across tests/prober/osfp
+ tests/profiler all green.
- decnet/prober/osfp/p0f/provider.py: P0fV2Provider loads the four
vendored .fp files into per-context signature lists (syn / synack /
rst / stray) and matches via highest-specificity score across the
relevant list. Also auto-picks up p0f-decnet.fp if present (GPL-3.0
additions land there later, empty for now).
- decnet/prober/osfp/factory.py: get_provider / get_all_providers /
reset_cache, mirrors decnet/geoip/factory exactly. Env-dispatched
via DECNET_OSFP_PROVIDERS (default "p0f-v2"). Reserved names
"nmap-osdb" (pending Fyodor's grant) and "decnet-observed" (our
future curated DB) raise NotImplementedError — visible on the
factory surface so a typo doesn't silently fall through.
- decnet/prober/osfp/__init__.py now re-exports the public API so
callers use `from decnet.prober.osfp import get_provider` without
reaching into submodules (upholds the provider-subpackage rule).
15 new provider+factory tests covering:
- All four DB contexts load (262/61/46/6 sigs per inventory).
- Known-good Linux 2.6 SYN + Linux 2.2 SYN-ACK match end-to-end.
- Unknown observations / contexts return None, not raise.
- Factory memoises, env override honoured, unsupported names raise.
- Reserved names raise NotImplementedError (not silent None).
`sniffer_rollup` wiring lands in the next commit.
First code layer of the OS-fingerprinting work on top of yesterday's
vendored p0f v2 database. Three new modules, all pure (no I/O outside
of the parser's file read):
- decnet/prober/osfp/base.py — Provider protocol + OsMatch dataclass
matching the established Provider convention in decnet/geoip and
decnet/bus. Docstring spells out the never-raise invariant: malformed
input returns None, so a single bad event can't wedge a whole
attacker-profile rebuild.
- decnet/prober/osfp/p0f/signature.py — Signature dataclass + three
predicate helpers (WindowSpec / IntSpec / OptionToken) encoding the
p0f v2 DSL's wildcard / modulo / MSS-multiple / MTU-multiple
semantics. Scoring is our extension on top of upstream p0f's
first-match-wins policy: each signature carries a precomputed
specificity in [0, 1] so the factory can pick the most-specific
match when multiple signatures fire against one observation.
- decnet/prober/osfp/p0f/format.py — .fp line parser. Every shipped
field variant from the DSL spec at the top of p0f.fp is covered
(Snn / Tnn / %nnn / * for window; T0 vs T; -/@/* os-genre prefixes;
quirks as concatenated single-letter flags; '.' sentinels for
no-options / no-quirks). Malformed lines log a warning and skip
instead of aborting the whole file — 1 bad row must not cost the
other 374.
20 parser tests + 14 scoring tests. Full vendored-DB smoke tests
confirm all 375 signatures parse round-trip (262 SYN + 61 SYN-ACK +
46 RST + 6 stray) and every computed specificity lands in [0, 1].
Ships the p0f v2.0.8 signature database for passive + active OS
fingerprinting. 375 total signatures across four probe contexts:
- p0f.fp (262 sigs) — passive SYN fingerprints
- p0fa.fp ( 61 sigs) — SYN-ACK response, for active probes
- p0fr.fp ( 46 sigs) — RST response quirks
- p0fo.fp ( 6 sigs) — "stray" packet fingerprints
Replaces reliance on the 10-signature hand-rolled p0f-lite table in
decnet/sniffer/p0f.py for any match job the upstream DB covers.
Keeping the hand-rolled table as a fallback for modern kernels the
v2 DB pre-dates — v2 froze in 2006 so post-Win10 / post-Linux-3.x
kernels won't match against upstream directly. DECNET-authored
additions will go in a sibling p0f-decnet.fp under GPLv3 (not yet
committed; added as the ingester observes real honeypot traffic).
Provenance (full chain in data/README.md):
- Source: Debian snapshot of p0f_2.0.8.orig.tar.gz
- SHA1 matches Debian-recorded 7b4d5b2f24af4b5a299979134bc7f6d7b1eaf875
- Files byte-identical to upstream tarball (verified by hash)
License chain:
- Upstream: LGPL-2.1 (doc/COPYING preserved verbatim as
data/LICENSE.p0f-upstream, Michal Zalewski's copyright intact).
- DECNET uses the LGPL-2.1 §3 explicit permission to convert to any
version of the GPL. These files, as consumed in DECNET, are
effectively GPL-3.0. Chain documented in data/README.md so an
auditor sees the full reasoning.
- LGPL-2.1 → GPL-3.0 §3 conversion is a settled compat path; same
mechanism the kernel uses for LGPL userland glue and many other
projects apply daily.
Rejected path — nmap-os-db under NPSL — because NPSL adds
restrictions GPLv3 §7 prohibits us from accepting. An email is out
to Fyodor requesting an open-source-author exception grant, but we
don't block on it: p0f v2 is a genuine accuracy improvement in
its own right, and adding nmap-osdb later (if granted) plugs into
the same provider interface with zero refactor.
Directory layout mirrors the established provider-subpackage pattern
(see decnet/geoip/, decnet/bus/) per the feedback_provider_
subpackages memory: base + factory + impl/ subpackages, no flat
files. Parser + matcher + factory wiring land in the next commit
sequence.
DECNET had no LICENSE file and no license metadata in pyproject.toml
despite intent being GPLv3. Legally that meant the code was "all
rights reserved" by default, so anyone distributing it (including via
GitHub clones, mirrors, or the forthcoming swarm enroll bundles) was
technically in violation even though the operator's own intent was
copyleft.
- Add canonical GPL-3.0 text from gnu.org/licenses/gpl-3.0.txt as
LICENSE (verbatim, 674 lines).
- Add license = "GPL-3.0-or-later" and license-files = ["LICENSE"]
to pyproject.toml [project] (SPDX identifier per PEP 639).
- Add the matching OSI classifier plus a few other standard ones
(Python 3.11, Linux, Security, Network Monitoring, Beta) that
pyproject was silently missing.
Prereq for the forthcoming p0f-db vendoring: establishing DECNET's
own license explicitly closes the first question an auditor would
ask about any third-party data we embed.
Follow-ups on 9232031 per review:
- Module-level constants KD_PAUSE_BURST_MAX_S (0.2s),
KD_PAUSE_THINK_MAX_S (1.5s), KD_START_OF_ACTION_IDLE_S (2.0s).
Docstrings reference them by name; future calibration against real
session data only has to touch one place. Threshold for "started
a new action" raised from 1s → 2s — 1s catches too much
mid-command hesitation to be empirically bimodal.
- New column kd_max_pause_gap (seconds). The distracted bucket count
alone can't distinguish one 3s pause from three 60s pauses;
max-gap carries that signal in one cheap scalar (vs widening the
histogram to a fourth bucket).
- Scope-framing docstring above the whole kd_* section: intended
use is session clustering / tooling attribution, explicitly NOT
biometric identity, admission decisions, or ML-driven user ID.
Keeps a future well-intentioned contributor from walking the
project into legal/ethics territory by accident.
- TODO comment on kd_top_bigrams: v1's JSON-in-TEXT is fine for
"show the top digraphs on the attacker page". If bigram-similarity
queries become hot, promote to a session_bigram_stats(sid, bigram,
count, mean_iat_s) table or Postgres JSONB + GIN. Neither changes
the write-side ingester materially.
No new migration helper — pre-v1 schema additions go through
create_all on fresh DBs; the existing _migrate_session_profile_table
stays but does not get extended. Alembic lands at v1 and sweeps all
the ad-hoc migrations at once.
Adds the three signal columns motivated by the manual keystroke
analysis in DEBT-036 directly to the SessionProfile table. Pre-v1 so
we modify the schema in place — Alembic arrives at v1.
Columns:
- kd_top_bigrams (TEXT) — JSON of top-N most-common digraphs with
mean IAT per bigram. Complements kd_digraph_simhash ("same typist?")
with "same typist in same mental state?" (tired / rested / distracted
shifts bigram-specific IATs measurably).
- kd_start_of_action_latency (REAL/DOUBLE) — median IAT of the first
keystroke after an idle gap > 1s. Separates "initiating a command"
from "executing a remembered one"; real humans have measurable
start-of-action latency, bots don't.
- kd_pause_hist_burst / _think / _distracted (INT) — three-bucket
histogram (counts, <0.2s / 0.2-1.5s / >1.5s). More discriminating
than the existing flat burst_ratio / think_ratio pair: C2 operators
concentrate in burst with a thin tail; opportunistic humans have a
fat think bucket and a long distracted tail.
Both backends get an idempotent ADD COLUMN migration
(_migrate_session_profile_table) wired into initialize() alongside
the existing _migrate_attackers_table path — guards on PRAGMA
table_info (SQLite) / information_schema.COLUMNS (MySQL) so reruns
are safe.
PII discipline comment on kd_digraph_simhash and kd_top_bigrams:
both operate on bigram CHARACTERS, never on raw input stream content.
Attacker passwords typed over SSH must not land here.
Test updated for the MySQL initialize() migration-order contract.
The SessionProfile SQLModel table has shipped with every column
nullable since session-recording v1 landed — because the ingester
that populates them from the [t,"i",d] events in the transcript
shards does not exist yet (known as gap #2 in SIGNAL_CAPTURE_AUDIT).
A manual keystroke-dynamics pass over one real session (wget scanme.
nmap.orgh) trivially recovered CoV ≈ 0.74 (human band), a 467 ms
semantic pause before the URL argument, tight intra-word bigrams
(ge 79 ms, t<space> 83 ms), and slow start-of-action latency (w→g
225 ms) — all signals the existing schema columns were designed to
hold. So the missing piece is purely the ingester.
Entry captures:
- the manual case as the motivating + sanity-check target
(ingester should produce CoV ≈ 0.74 ± 0.05 on the same shard),
- three schema extensions the manual analysis suggests beyond what
the table carries today: kd_start_of_action_latency_ms,
kd_pause_hist_{burst,think,distracted}, kd_top_bigrams,
- a non-PII discipline line: raw keystroke content (including
captured passwords) MUST NOT land in SessionProfile columns —
only timing and frequency aggregates.
Poll-driven ingestion can ship first; the bus-trigger path
piggybacks on DEBT-031's deferred session-boundary topics.
The drawer used onClick={onClose} on the backdrop + onClick={e =>
e.stopPropagation()} on the panel to stop inside-clicks from closing
the drawer. That pattern is fine for most React trees, but React's
stopPropagation() also aborts the NATIVE DOM event — and asciinema-
player wires its click-to-play handler via document-level event
delegation. So every click inside the drawer (including the big
play button) died at the panel boundary and never reached the
player's dispatcher. Confirmed end-to-end by calling window.__ap.
play() directly from DevTools: playback started, cast rendered in
full, ended event fired.
Swap to the idiomatic target===currentTarget guard on the backdrop
so only genuine backdrop clicks close the drawer; everything inside
(including native-delegated handlers) gets its events untouched.
All the debug instrumentation from b5c6b8a, 4424138, 6d031ae, and
f032ece (cast logging, lifecycle listeners, window.__ap) is
reverted here — symptom root-cause is known, it was event delegation
not the parser or the cast.
The parse path works (metadata event fires with duration: 24.58s,
idle event fires); next unknown is whether clicking play even
reaches core.play(). Stash the player on window so the operator can
call __ap.play() from DevTools to diff UI-click vs direct-call
behaviour and see whether 'play' / 'playing' events fire.
To be reverted once we pin the failure.
The original short subscribe list missed 'metadata' — which is the
one that carries the parsed duration + theme + marker info AFTER
_initializeDriver (the step that actually parses the cast). Without
it we only saw 'ready' (= UI mounted, parse not yet run) and jumped
to conclusions about the parser.
Add the full lifecycle set so the next repro pins which step the
player is actually getting stuck at.
Without preload:true the player only parses the recording when the
user first clicks play. Any parse error during that lazy step
bypasses our lifecycle instrumentation (we only see "ready", which
just means UI mounted), and from the user's POV the play button
stays black because they never see the actual failure.
Forcing preload makes the driver's init() run synchronously-ish with
the "ready" dispatch, so getDuration() resolves to a real number
(or we see an "errored" event with a payload that tells us why).
The sync try/catch around AsciinemaPlayer.create() misses async
failures in the player's internal init() promise — those land as
unhandled rejections and are invisible from the component's POV.
Subscribe to every lifecycle event (ready / play / pause / ended /
error / errored / loading) and log the resolved duration. If the
parser produces zero events despite a well-formed cast, duration
resolves to 0 / NaN / rejected — one of those signals will point at
whichever frame the render path is silently failing at.
Diagnostic for the persistent "player mounts with chrome but plays
black" symptom after the blob-URL fix. The player now gets
{data: cast} correctly and parses at least enough to render the
control bar, but duration shows --:-- and the terminal stays blank.
Log the first 400 chars of the built cast + event/cols/rows so the
operator can confirm in DevTools whether the malformed input is the
cast itself or something downstream in the asciinema parser.
SessionDrawer built a cast blob, pushed it through URL.createObjectURL,
and passed the blob URL to AsciinemaPlayer.create(). That's racy with
useEffect's cleanup: each new page of events re-fires the effect, the
cleanup revokes the URL, and the player's already-in-flight async
loadRecording() lands on a dead URL with no visible error — result was
a centered play button with an empty black pane, playback never starts.
asciinema-player v3's recording driver accepts {data: <string>} as a
first-class source (see core-DnNOMtZn.js:905-930 doFetch — string/
ArrayBuffer data is wrapped in `new Response(value)` and handed to the
parser). Skip the blob detour entirely, pass the cast text inline.
Also filter events to valid asciicast channels (o/i/r) before feeding
so a future stray SD field can't derail the parser, and log mount
errors to console for next-time debugging.
Tracks the durable follow-up to 323077b. The transcripts soft-fail
shipped in that commit keeps the API from 500-ing on
/var/lib/decnet/artifacts/** permission mismatches, but the real
issue is that decoy containers write artifacts under a uid the API
can't read — today's workaround is a manual `sudo chown -R` after
every new deploy.
Three design options documented (container-runs-as-host-uid, setgid
+ shared group, inotify sidecar) with a recommendation, plus an
acceptance criterion: fresh init + deploy + record session → the
API can read the transcripts with no manual chown.
sessrec.c emits the session_recorded SD blob with sid/service/src_ip/
duration_s/bytes/truncated — it never emitted shard_path. The web
handler still asked for fields.shard_path, got "", tripped the
sessions-YYYY-MM-DD.jsonl basename regex and returned
400 "invalid shard name" for every legitimate transcript request.
Handler now:
- Fast-paths when fields.shard_path IS present and validates
(for any future emitter or ingester that backfills it).
- Otherwise enumerates sessions-YYYY-MM-DD.jsonl shards under
ARTIFACTS_ROOT/{decky}/{service}/transcripts/ (newest first) and
returns the first one whose per-sid index contains our sid.
- Security invariant preserved: only files whose basename matches the
_SHARD_BASENAME_RE are ever opened, and they always resolve inside
ARTIFACTS_ROOT. A forged fields.shard_path is silently ignored.
- Soft-fails OSError/PermissionError on the transcripts dir (decky
containers often write it with a uid the API can't read) — returns
404 instead of a 500 traceback.
test_forged_shard_path_blocked updated to match the new semantics:
forgery is ignored, the real shard is served via fallback. The
invariant (no /etc/passwd access) is still asserted by the fact
that status is 200 with data from the test shard.
decnet-bus.service.j2 ran with User={{ user }} / Group={{ group }}
but the actual bus CLI invocation hardcoded --group decnet. The bus
chowns /run/decnet/bus.sock to that group at 0660 — so when an
operator ran `decnet init --group anti`, the socket ended up
owned by decnet:decnet while every worker (agent, api, collector,
forwarder, prober, updater) ran as anti and got EACCES on connect().
Each worker's bus-wiring catches the error, logs a warning, sets
bus=None, and carries on — which is correct for the data-plane but
silently kills Workers-panel heartbeats (run_health_heartbeat(None,
...) no-ops). So half the worker grid showed UNKNOWN even though
systemctl confirmed the processes were alive.
Swap the hardcoded --group decnet for --group {{ group }} so the
socket is owned by the same group the workers run under.
polkit rule 50-decnet-workers.rules hardcoded isInGroup("decnet"),
so when 'decnet init --group anti' installed systemd units as
User=anti / Group=anti, the API (running as anti) could no longer
systemctl start/stop decnet-*.service — polkit fell back to
'interactive authentication required', which in a daemon context is
a hard fail:
START FAILED · COLLECTOR — Failed to start decnet-collector.service:
Access denied as the requested operation requires interactive
authentication.
Rename the rule to .j2, parameterise the group on {{ group }}, and
route _install_polkit through _render_template /
_write_rendered_if_changed. Now the polkit rule matches whatever
group was passed to 'decnet init'.
Test fixture updated to seed the .j2 variant.
Four templates use backslash line-continuation on ExecStart
(decnet-bus, decnet-forwarder, decnet-listener, decnet-updater). My
earlier sed inserted StandardOutput= and StandardError= right after
the first ExecStart= line, which split the command and systemd fed
those two lines back to the binary as extra positional arguments —
the bus in particular crashed with:
Got unexpected extra argument
(StandardOutput=append:/var/log/decnet/decnet.bus.log)
Walk the ExecStart block (follow \-continuation lines) and insert
the two Standard* directives AFTER the last continuation line. The
nine single-line ExecStart templates are unaffected in shape but
re-written through the same path to keep the whole set uniform.
_configure_logging opened InodeAwareRotatingFileHandler against
DECNET_SYSTEM_LOGS (default: relative decnet.system.log) without
guarding OSError. Under systemd with ProtectSystem=full +
ProtectHome=read-only and no writable path baked into the unit, the
first import of decnet.config raised OSError and the daemon died
before it could even print a useful error — the root-cause log line
showed up in journalctl as a stack trace rather than a warning.
Wrap the handler attachment in try/except OSError and log a single
WARNING via the already-installed stream handler. stderr is always
attached, so losing the file handler means operators tail
journalctl / docker logs instead — the daemon keeps running.
The agent-side enroll-bundle templates (decnet/web/templates/*) always
set DECNET_SYSTEM_LOGS + StandardOutput/StandardError to a per-unit
file under /var/log/decnet. The master-side init templates (deploy/*)
never did, so every 'decnet init'-installed service:
- inherited the default DECNET_SYSTEM_LOGS=decnet.system.log — a
relative path, landing in the unit's WorkingDirectory. All 13 units
shared the same cwd and fought for the same file, or more often
just failed to write it under ProtectSystem=full,
- emitted stdout/stderr to the journal by default, which is fine for
uvicorn's INFO banter but makes per-service grepping a pain when
you're chasing a single worker's trace.
Mirror the agent-side wiring on all 13 master templates:
- Environment=DECNET_SYSTEM_LOGS=/var/log/decnet/decnet.<name>.log
- StandardOutput=append:/var/log/decnet/decnet.<name>.log
- StandardError=append:/var/log/decnet/decnet.<name>.log
/var/log/decnet is already in ReadWritePaths so ProtectSystem=full
stays compatible. Operators now get a dedicated
/var/log/decnet/decnet.<unit>.log per service, both from the app's
structured logger and from any stray stderr — journalctl still
works too, but no longer the only option.
Key:value chips in the live-feed event cell used the default .chip
style, which is white-space: nowrap + inline-flex. A long cmd: value
(attacker-controlled shell strings, URLs, base64 payloads) stretched
the chip horizontally past the column, pushing the whole table into
horizontal scroll and clipping subsequent columns off-screen.
Add a chip-kv variant that allows the value to wrap inside a
max-width: 100% chip (word-break: break-word, overflow-wrap: anywhere
for dense strings with no natural break). The key-label stays on the
first line via flex-shrink: 0. Short values (uid: 0, user: root)
stay tight; long ones wrap onto multiple lines inside the chip.
Also set minWidth: 0 on the EVENT td + nested flex containers so
flex children honour the column width instead of growing to fit
content. Added title={k: v} on each chip for full-value hover in
case the wrap is still clipped.
The API lifespan unconditionally spawned log_collector_worker,
appending every container line to DECNET_INGEST_LOG_FILE. On hosts
that also run decnet-collector.service (installed by 'decnet init')
that's two tailers writing the same events to the same file — the
ingester then inserts each event twice and the dashboard shows every
command duplicated.
Add DECNET_EMBED_COLLECTOR (default false), matching the existing
DECNET_EMBED_PROFILER and DECNET_EMBED_SNIFFER pattern directly
above this block. Single-process dev setups without systemd can flip
it on to restore the all-in-one behaviour; multi-process production
gets the single-writer invariant by default.
Every plain `decnet deinit` ran userdel + groupdel unconditionally. In
dev the operator may pass `--user $USER --group $USER` to avoid file
ownership churn against a source checkout — at which point deinit
would cheerfully delete their own login account.
Move user/group removal behind --purge, matching the existing
behaviour for /var/lib/decnet + /var/log/decnet. Help text updated:
--purge now clearly advertises that it also wipes the service
user/group, with an explicit warning to only run it when `decnet init`
created the account in the first place.
Test updated: plain --deinit must NOT invoke userdel/groupdel;
--deinit --purge must.
Every decnet-*.service.j2 hardcoded User=decnet / Group=decnet. The
init CLI accepted --user / --group and used them for useradd,
chown, /etc/decnet ownership and ReadWritePaths — but the Jinja
context omitted them entirely, so
sudo decnet init --install-dir $PWD --user anti --group anti
rendered
User=decnet
Group=decnet
into every unit, which at best ran the workers as a user that didn't
match the files (fails to read the venv / config), and at worst spun
a parallel system user the operator never asked for.
Swap the hardcoded lines to {{ user }} / {{ group }} across all 13
templates and add both to the Jinja context in _install_units.
The systemd unit templates hardcoded {{ install_dir }}/venv/bin/decnet.
On production hosts enroll_bootstrap.sh creates exactly that path so it
worked. On dev boxes where the operator runs `sudo decnet init` against
a source checkout with a differently-named venv (.venv, .311, .312),
every decnet-*.service looped forever in auto-restart with:
Failed at step EXEC spawning .../venv/bin/decnet: No such file or
directory
Templates now use {{ venv_dir }} as an independent Jinja2 var. `decnet
init` adds --venv-dir (explicit override), otherwise autodetects:
1. $VIRTUAL_ENV (only when inside --install-dir, so a user-home venv
never gets baked into a root-owned unit),
2. {install_dir}/venv (production default; what enroll_bootstrap
creates),
3. {install_dir}/{.venv,.311,.312,.313} (common dev conventions).
Init aborts before any file writes if nothing resolves — an
operator-friendly error beats journalctl spam on every unit restart.
python3-venv doesn't set a persistent system variable — $VIRTUAL_ENV
lives in the activated shell only — so this has to be decided + baked
in at init time; there's no way for systemd to "inherit the current
venv" at unit start.
Test mode (--prefix) skips venv validation so the existing test suite
doesn't need to stub up a venv tree per case.
'decnet status' used to psutil-scan for cmdlines matching hand-coded
service launch args. That worked on dev boxes running workers via
'python -m decnet.cli ...' but missed the systemd reality on real
hosts: units may be installed but not started, failed, or in
auto-restart — all invisible to a cmdline grep.
New behaviour: status calls `systemctl list-units --type=service --all
--output=json 'decnet-*.service'` and renders the unit/load/active/
sub/description matrix. One view works for masters, agents, and
mixed hosts — iterates over whatever 'decnet-*' units were installed
by 'decnet init' / the enroll-bundle. Agent/master mode filtering is
no longer needed in the CLI; the host literally does not have
master-only units installed if it enrolled as an agent.
The psutil path survives as a fallback for boxes without systemd
(dev laptops, CI containers, minimal init systems) so the command
stays useful there. Clearly labelled 'psutil fallback' in the table
title so operators know which view they're looking at.
Locust spawns N virtual users (default 1000), all from 127.0.0.1 as
admin. /auth/login is rate-limited 10/5min per-IP AND per-username, so
the 11th on_start() got 429 and a RuntimeError. A @task(2) login in
the task weights turned the whole run into a 429 factory even after
ramp-up. And _login_with_retry treated 429 as non-retryable, so there
was no graceful degradation path.
Three changes, one root cause:
- decnet/web/limiter.py: read DECNET_LIMITER_ENABLED (default true).
When false, slowapi's Limiter(enabled=False) makes @limiter.limit a
no-op. Default ships unchanged; nobody should ever release with this
off.
- tests/stress/conftest.py: set DECNET_LIMITER_ENABLED=false in the
uvicorn subprocess env. Stress tests measure throughput, not rate
limiting.
- tests/stress/locustfile.py: drop the @task(2) login — it added zero
coverage (every user already logs in at on_start) and only generated
contention. Teach _login_with_retry to honour 429 + Retry-After so a
Locust pointed at a limiter-enabled server degrades gracefully
instead of crashing on_start.
Three unrelated test-correctness fixes exposed by running tests/live:
- test_mqtt_live: honeypot defaults to auth-required (post-2018
realistic broker). Anonymous CONNECT is rejected with CONNACK rc=5,
which the "accept" / "subscribe" tests misread as a failure. Pass
MQTT_ACCEPT_ALL=1 via a new env= override on the live_service factory
so only those two tests opt into accept-all.
- test_postgres_live::test_auth_hash_logged: connected with
dbname='prod', which isn't in the honeypot's per-instance DB list, so
Postgres (correctly) rejected at startup before asking for a
password — blowing past the auth event the test asserts on. Target
'postgres' (always in _BASE_DBS) to reach the auth stage.
- test_mysql_backend_live: the module-scoped mysql_test_db_url fixture
is bound to the module loop, but function-scoped tests default to
their own per-function loops. Any reuse of the asyncmy pool then
tripped "Future attached to a different loop". Pin the whole module
with pytest.mark.asyncio(loop_scope='module').
MySQL can't index a BLOB/TEXT column without a prefix length, so
create_all() on a fresh MySQL schema blew up with "BLOB/TEXT column
'kd_digraph_simhash' used in key specification without a key length".
SimHashes are a fixed 8 bytes — the variable-length type was a
SQLAlchemy-side auto-mapping from 'Optional[bytes]', not an actual
schema requirement. Switch to BINARY(8), which is portable: MySQL gets
a fixed-width indexable BINARY, SQLite treats it as BLOB and doesn't
care about key length.
- .311/ and .3[0-9][0-9]/ + .venv*/ — cpython-version-suffixed venvs
(common convention) now covered alongside the existing .venv/.
- wiki-checkout/ — local nested clone of the wiki; never a submodule.
- hang.log / schem / *.pytest.log — scratch dumps from saved pytest
output redirections.
- deps.txt — pydeps-style dependency graph from local analysis runs.
No tracked files affected; just stops new working-tree noise from
showing up in git status.
- SIGNAL_CAPTURE_AUDIT.md: end-to-end walkthrough of what attacker
signals DECNET captures at each pipeline stage, where the gaps are
(session profile ingestion, keystroke dynamics), and what ships for
v1 vs what lands post-v1.
- api-audit.md: FastAPI /api/v1 route audit — surface area, auth
requirements, status-code coverage, and where schema drift would bite
the schemathesis suite.
Both are operator/engineering reference docs, not user-facing.
Adds instance_seed.py to every service template (conpot, docker_api,
imap, k8s, llmnr, pop3, rdp, sip, smb, snmp, ssh, telnet, tftp, vnc).
Derives a stable per-instance seed from NODE_NAME (+ optional
INSTANCE_ID) and exposes deterministic helpers for the boring details
scanners would otherwise use to fingerprint the whole fleet as one
machine: cluster UUIDs, auth salts, uptime fixtures, minor version
strings. Connection-time jitter is intentionally NOT seeded — two hits
to the same decky must not replay the same latency curve.
Identical source across every template; lives next to each service so
the Docker build context picks it up without a shared package-data hop.
Exclude lists fail open — anything new at the master's repo root (venvs,
logs, dev notes, .env.local, local DB dumps) silently leaks into every
agent bundle. On this box a stray .311 venv (335 MB) + logs/ (220 MB)
bloated the tarball to ~150 MB and blew test_enroll_bundle timeouts.
Replace _EXCLUDES + _is_excluded with _INCLUDED_ROOT_FILES +
_INCLUDED_DIRS + _EXCLUDED_DECNET_SUBTREES and iterate via os.walk with
in-place dirnames[:] pruning so master-only subtrees (decnet/web,
decnet/mutator, decnet/profiler) and __pycache__ aren't descended into
at all.
Bundle contents are now strictly: pyproject.toml + the decnet/ package
minus the three master-only subtrees. Synthetic entries (INI, certs,
systemd units) unchanged — they were always added inline, not from the
tree walk.
test_enroll_bundle.py: 20/20 pass in 24s (was timing out at 15s/test).
Groups every flat test_*.py under the module it exercises, matching the
existing tests/{profiler,sniffer,prober,collector,correlation,cli,web,
topology,swarm,bus,updater,api,docker,geoip,...} layout. New folders:
services/, fleet/, config/, logging/, db/ (+ db/mysql/), telemetry/,
mutator/, core/.
Path-dependent __file__ references bumped an extra .parent in three
files that moved one level deeper:
- tests/sniffer/test_sniffer_ja3.py (template path)
- tests/services/test_ssh_capture_emit.py (template path)
- tests/cli/test_mode_gating.py (REPO root)
- tests/web/test_env_lazy_jwt.py (repo var)
Also drops two SQLite runtime artifacts (test_decnet.db-{shm,wal}) that
were leaking into the repo from a previous test run.
Fixes two test_service_isolation cases that patched asyncio.sleep (no
longer on the profiler main-loop hot path — same pre-existing bug I
fixed earlier in test_attacker_worker.py) by patching asyncio.wait_for
and passing interval=0.
- Attackers list: small country-code chip next to the IP on each card,
title-tooltip shows the source (e.g. "rir")
- AttackerDetail: country-code tag next to the IP in the header plus an
ORIGIN field in the TIMELINE section for always-visible origin
- TypeScript interfaces updated with country_code/country_source
Since the event-driven shutdown refactor (0fbb07c), the profiler main
loop is asyncio.wait_for(shutdown.wait(), timeout=interval) — no sleep
on the hot path. The four worker tests that patched asyncio.sleep to
raise CancelledError on the Nth call were silently no-op'ing and
hanging on the real 30 s wait_for timeout.
Replace the sleep patches with a shared _cancel_after helper that
patches wait_for itself. Pass interval=0 so the loop ticks without
delay between iterations.
Populates Attacker.country_code + country_source (MVP) using the five
RIR delegated-stats files (ARIN/RIPE/APNIC/LACNIC/AFRINIC). Offline,
license-free, no outbound traffic that could burn honeypot stealth.
- decnet.geoip package with factory/base/lookup + rir/ subpackage
(fetch/parse/provider) mirroring the db + bus factory convention
- Profiler._build_record calls enrich_ip on every upsert
- Idempotent ALTER TABLE migrations for both SQLite and MySQL
- decnet geoip refresh/lookup CLI (master-only)
- /var/lib/decnet/geoip seeded by decnet init
- DECNET_GEOIP_ENABLED=false kill-switch; set in tests/conftest.py so
unit tests never trigger the first-access fetch
The config file `decnet init` dropped at /etc/decnet/config.ini was a
stub with a single [decnet] header saying 'reserved for future structured
settings.' Admins who wanted to tune DECNET_API_HOST, DECNET_DB_URL,
DECNET_BATCH_SIZE, etc. had to hunt env.py for the exact variable name
and drop it in .env.local.
Changes:
- decnet/config_ini.py — adds a _DOMAIN_MAP translation table covering
[api], [web], [database], [bus], [swarm], [logging], [ingester],
[tracing]. Loads regardless of mode; unknown keys inside a known
section log a WARNING (operator typos shouldn't be silent).
Explicit key map (not auto kebab-to-snake) so [web] admin-user lands
in DECNET_ADMIN_USER without silently renaming the env-var contract
consumers import from decnet.env.
- decnet/cli/init.py — renames the placeholder target config.ini →
decnet.ini (unifies with the name already used by load_ini_config and
the enroll bundle's _render_decnet_ini). Placeholder body now shows
every domain section as a commented example so admins learn the
shape by reading. Deinit removes both decnet.ini and the legacy
config.ini so upgrading hosts leave no orphan file.
Precedence is unchanged: real env > INI > built-in default in env.py.
os.environ.setdefault means systemd EnvironmentFile= and one-off
DECNET_FOO=bar decnet ... invocations always win.
Secrets explicitly NOT moved to the INI:
- DECNET_JWT_SECRET
- DECNET_ADMIN_PASSWORD
- DECNET_DB_PASSWORD
They stay in .env.local / EnvironmentFile= — never in a group-readable
INI, never in a diff, never on the dashboard.
Dev/profiling flags (DECNET_DEVELOPER, DECNET_EMBED_*, DECNET_PROFILE_*)
also stay env-only per maintainer direction — dev knobs shouldn't
be one 'I'll flip this for tonight' away.
Tests: +5 in test_config_ini.py (domain sections load regardless of mode,
env beats INI for domain keys, unknown key warns, absent section is
no-op, role section beats domain section via setdefault precedence). +1
in test_init.py (placeholder writes decnet.ini with every section
header present as commented guidance).
31 tests pass across the two files (was 26).
Distros reserve /opt for different things (some package managers own it
outright), and a DECNET install that wants to live at /srv/decnet or
/usr/local/decnet had to hand-edit 13 service files post-install.
Converts every deploy/decnet-*.service to a .j2 template keyed on
{{ install_dir }}, rendered by `decnet init` at install time. All other
paths (log_dir, state_dir, runtime_dir, user, group) stay standard —
only install_dir varies.
Changes:
- deploy/decnet-*.service → deploy/decnet-*.service.j2 (13 files).
- decnet init gains --install-dir (default /opt/decnet, preserves
existing behaviour byte-for-byte). Validates absolute-path at the
CLI boundary. Threads through useradd --home-dir and the dir-creation
list so the filesystem layout matches the rendered templates.
- _install_units renders via Jinja2 with StrictUndefined (typo → loud
error, not a silent broken unit). SHA over rendered output so
operators with a custom install_dir get idempotent re-runs.
- decnet.target, tmpfiles.d, polkit rule stay static — they don't
reference install paths.
- 4 new tests: custom install_dir renders into units, default remains
/opt/decnet, relative paths rejected, second run with same custom
dir is idempotent.
Worker bus instances (collector, ingester) close their private buses
in finally blocks on shutdown, but stream threads holding closure
references kept calling publish after close — one `RuntimeError:
publish on closed bus` per stream line, caught by publish_safely
and logged per call, flooding server logs.
Changes:
- `UnixSocketBus.publish()` now drops post-close calls. First drop
WARNs loudly (bus is critical infra — silent drops would hide real
problems); subsequent drops on the same instance log at DEBUG to
prevent the flood. Sticky `_closed_publish_warned` flag, reset
naturally per new bus instance.
- `make_thread_safe_publisher` short-circuits on a closed bus before
marshalling a coroutine onto the loop. Avoids the wasted scheduling
work in the hot shutdown path.
Degradation is safe: callers go through `publish_safely`, which
already treats exceptions as 'dropped notification, DB is source of
truth.' We just stop manufacturing the exception in the first place
for a known-benign condition.
A startup race between `decnet bus` being ready and the API's lifespan
hitting `get_app_bus()` at api.py:135 would set `_tried = True`
permanently, poisoning the singleton for the rest of the process: the
dashboard shows BUS OFFLINE, topology SSE falls into the bus-is-None
snapshot-only branch, mutator publish calls no-op. Only an API
restart recovered.
Replaces the one-shot veto with a time-gated retry keyed on a
`_last_failure_ts` monotonic timestamp plus a 2 s backoff. Publishers
on the hot path still pay at most one connect attempt every 2 s when
the bus is down, but the singleton auto-recovers within 5 s (one
dashboard poll) once the bus comes up.
The asyncio lock still serialises concurrent callers so the bus server
doesn't get stampeded with parallel connect attempts on startup.
Registers a generic @app.exception_handler(Exception) that catches anything
uncaught in route handlers / dependencies. Prod response is opaque:
{detail: 'Internal Server Error', error_id: <uuid4 hex>}. Dev mode
(DECNET_DEVELOPER=True) adds exception_type and traceback fields so
failures are debuggable without tailing server logs.
The error_id is logged alongside the full traceback server-side, letting
operators correlate a user's 500 report with the exact exception via
`grep <error_id> /var/log/decnet.log`.
FastAPI's own HTTPException routing and the existing
RequestValidationError / ValidationError / RateLimitExceeded handlers
still take precedence — this handler only fires on genuinely-uncaught
exceptions.
Flips threat model F1/I 'traceback / stack trace leakage' from ? to M
and logs a follow-up checklist entry for 4 detail=str(e) sites in the
fleet deploy router (admin-gated, different threat class, separate
audit).
Adds slowapi two-bucket rate limit on /auth/login — 10 attempts per
5 minutes per-IP AND per-username, tripping either → 429. Per-IP
catches botnets hitting one account; per-username catches distributed
credential stuffing against one account. In-memory storage: dashboard
API is single-process, Redis is disproportionate for v1.
X-Forwarded-For is deliberately NOT trusted (spoofable); reverse-proxy
deployments get one shared bucket per proxy IP. Logged in the threat
model as accepted risk DA-08, to be revisited when a verified-proxy
config lands.
Also scaffolds development/THREAT_MODEL.md with STRIDE-per-element
methodology, system-context DFD, and Dashboard↔API as the first fully
worked component (7 sub-flows, ~50 threat entries). F1 Authn ships
with 3 threats mitigated: rate limit (new), uniform 401 (verified
already in place), bcrypt length clamp (verified already in place via
Pydantic max_length=72).
Adds GET /attackers/{uuid}/smtp-targets (viewer) and GET /attackers/{uuid}/mail
(admin) endpoints, plus two new sections on the attacker detail page:
VICTIM DOMAINS rollup (aggregate-only, federation-gossip-safe) and STORED MAIL
with a drawer that decodes headers, lists attachments, and downloads the raw
.eml via the existing artifact endpoint (?service=smtp).
New SmtpTarget table records each (attacker, domain) pair observed via
the SMTP honeypots. Only the domain is stored — local-parts are dropped
at ingestion, so this table holds no user-identifying data beyond the
target organisation's identity.
The profiler worker extracts domains from rcpt_to / rcpt_denied /
message_accepted events, normalizes them (lowercase, strip local-part,
drop blocked TLDs), and upserts one row per pair with a running count +
first_seen / last_seen.
Three repo methods shipped:
* increment_smtp_target(attacker, domain) — upsert + bump
* list_smtp_targets(attacker) — per-attacker view
* smtp_target_seen(domain) — cross-attacker aggregate, shaped as the
federation-gossip RPC that V2 will expose.
The gossip-query shape is load-bearing: each operator can answer
"have any of your attackers targeted corp1.com?" without leaking
which attackers or when — the aggregate returns a bool + total count
+ first/last seen, nothing else.
SMTP template now writes each accepted DATA body as a .eml file into a
bind-mounted per-decky quarantine dir and emits a `message_stored` log
with sha256, size, decoded headers, and an attachment manifest
(filename + sha256 + size + content-type). Attachment hashing uses the
*decoded* payload so operators can match against VT / MalwareBazaar
directly. Body accumulator is capped at SMTP_MAX_BODY_BYTES (default
10 MB, matching the EHLO SIZE advert) so a streaming client can't OOM
the container.
The existing /api/v1/artifacts/{decky}/{stored_as} endpoint now takes
an optional ?service= query param (defaults to ssh for back-compat)
and can serve .eml files out of the smtp subdir. Forensic metadata
rides the normal log pipeline, same as SSH file_captured.
decnet/web/db/models.py was approaching 1000 lines across User/Log/
Attacker/Swarm/Topology/Workers/Updater/Health domains. Split into a
package with one module per domain; __init__.py re-exports every symbol
so all 52 call sites keep importing from decnet.web.db.models
unchanged.
New purpose-built table with schema_version column committed from day one
so V2 federation gossip can cluster sessions across operators without
retrofitting. Ships with the empty write path (upsert_session_profile);
ingestion of keystroke features (IKI moments, control-char rates, digraph
SimHash) is tracked as V2 work.
Closes gap #2 from SIGNAL_CAPTURE_AUDIT.md.
Parse RFC 4253 §4.2 identification strings from the first attacker→decky
data segment on TCP/22; emit ssh_client_banner syslog events and bus
fan-out. Profiler's sniffer_rollup dedupes observed banners into a new
AttackerBehavior.ssh_client_banners JSON column.
Closes gap #3 from SIGNAL_CAPTURE_AUDIT.md.
Prober already emits kex_algorithms in hassh_fingerprint syslog events, but
the raw ordered list was only queryable via the generic bounty store. Add a
dedicated AttackerBehavior.kex_order_raw column (TEXT, JSON list) so
post-v1 KEX-order fingerprinting has a typed, indexable home.
Pipeline:
- sniffer_rollup() now consumes hassh_fingerprint events and collects
distinct kex_algorithms strings across ports.
- build_behavior_record() JSON-encodes the list (NULL when empty).
- sqlmodel_repo._deserialize_behavior() parses it back into a list.
Closes pre-v1 gap #1 from SIGNAL_CAPTURE_AUDIT.md.
Break the 603-line behavioral.py into timing/classify/tools/phases/fingerprint
sibling modules plus a slim orchestrator. Public API unchanged: behavioral.py
re-exports every previously-exposed symbol, so worker.py and existing tests
keep working with zero import changes.
No behavior change; all 64 profiler tests pass.
- TopologyList header now uses .page-header + .page-title-group +
.page-sub like Dashboard/Attackers/DeckyFleet; title typography and
separator match the rest of the app.
- Pluralisation fix: '0 topologyies' → '0 TOPOLOGIES', singular '1
TOPOLOGY'.
- When the list is empty the EmptyState renders in its own flex
container that fills the viewport so the card is centered both
axes, with bumped icon/title/hint sizing for the hero treatment.
delete_topology_cascade manually deletes status_events, edges, deckies
and lans but overlooked topology_mutations, so deleting any topology
that ever had a mutation enqueued (i.e. edits while active|degraded)
failed with an FK IntegrityError. Add the missing DELETE and extend
the cascade test to seed a mutation row.
MazeNET header now reports '{running}/{total} DECKIES RUNNING' so
operators can see per-topology runtime status at a glance.
Dashboard ACTIVE DECKIES counters used to reflect only the fleet state
file; TopologyDecky rows (MazeNET deployments) are now added in —
deployed_deckies = fleet + all topology rows, active_deckies = fleet
(no runtime field) + topology rows whose state is 'running'.
Hovering the empty-state row in LiveLogs/Dashboard tables briefly lit
the full-width td with the data-row glow. Tag the placeholder tr with
.empty-row and scope the .logs-table hover rule to :not(.empty-row).
Base .empty-state now flex-centers its icon/title/hint/CTA with a
140px min-height so icon-bearing empty states in the Dashboard side
panels (DECKIES UNDER SIEGE, TOP ATTACKERS) stop looking cramped.
Component-scoped rules (attackers-root, bounty-root, logs-root)
remain more specific and are unaffected.
- New ShortcutsHelp modal enumerates global, nav G-chord and palette
bindings; openable via ? (Shift+/) or the command palette.
- / dispatches a global decnet:focus-search event; Attackers, Bounty
and LiveLogs listen and focus their in-page search inputs (pages
without a local search are skipped per plan).
- Respects the existing editable-element guard and Alt+K palette
toggle; no rebinds to prior shortcuts.
Replace ad-hoc empty-state markup across Dashboard, TopologyList,
LiveLogs, Attackers, Bounty, AttackerDetail, SwarmHosts, RemoteUpdates
and CommandPalette with the new <EmptyState> component. Themed icons
+ hints improve discoverability; TopologyList and SwarmHosts gain
CTAs to their respective creation flows.
Each page gets its own scoped stylesheet and is rewritten around the
shared design language: filter bars, paginated lists, empty-state
blocks, BountyInspector drawer. Behavioural surface is unchanged —
same API calls, same routes, same RBAC gating.
Rewrites Dashboard.tsx around three stacked panels — live interactions,
deckies-under-siege, and top-attackers — each with its own header,
empty state, and status accents. Dashboard.css fills in the supporting
grid + type system.
- CommandPalette (Alt+K): fuzzy action launcher with keyboard nav.
- Toasts: ephemeral notification stack + provider.
- useGlobalHotkeys: Alt+K palette toggle, G-chord navigation
(G D/F/M/L/B/A/S/U/E/C), respects editable-element focus.
- Layout/App: wire ToastProvider at root, mount the palette inside the
authed shell, introduce the global search box in the top bar.
- MazeNETRoute now renders TopologyList inline when no ?topology is
present, instead of bouncing through a redirect.
- index.css: a few global token tweaks consumed by the new chrome.
Fixes a latent breakage: Config.tsx and MazeNET already imported
./Toasts/useToast but the directory was never committed.
The DELETE path on a topology whose containers are still up is a
footgun — even if the backend rejects the delete, surfacing the
button invites mistakes. Gate it so DELETE only shows for pending,
failed, and torn-down topologies. Active/degraded/deploying topologies
must be torn down first, which then reveals DELETE again.
POST /topologies/{id}/lans previously called _auto_attach_gateway()
whenever a non-DMZ LAN was created, which wired the DMZ gateway decky
to every new subnet. That's why a deployed gateway ended up with
eth0..ethN on every LAN regardless of what the user drew in MazeNET.
Drop the auto-attach helper entirely. The DMZ_ORPHAN deploy-time
validator (decnet/topology/validate.py:65-110) stays strict — users
must explicitly wire the gateway to each subnet they want bridged,
which is the whole point of having a topology editor.
useMazeApi.ts: drop stale auto-bridge reference from comment.
ArtifactDrawer, SessionDrawer, CreateTopologyWizard all now:
- close on ESC
- trap Tab/Shift+Tab focus within the panel
- lock body scroll while open
- restore prior focus on unmount
Uses the new useEscapeKey + useFocusTrap hooks. No visual changes;
the bespoke CSS shells (ctw-*, inline drawer styling) are preserved.
- Modal: shared backdrop/panel with ESC-close, backdrop-click-close,
focus trap, body scroll lock; supports center + drawer-right variants,
matrix/violet accents, default/wide widths.
- EmptyState: icon + title + hint + optional CTA; compact variant
for tight rails.
- useEscapeKey, useFocusTrap: reusable hooks powering Modal; will also
be adopted by CommandPalette and ContextMenu in follow-up commits.
No retrofits yet — primitives only. tsc clean.
Pan drag previously required mousedown on the bare canvas (target ===
currentTarget). When zoomed in, net-boxes cover most of the viewport
so there was no bare grid to grab. Drop the guard — node/header/port/
resize handlers all call stopPropagation() already, so only net-box
body mousedowns bubble up to start the pan, which is exactly what
we want.
Wheel-to-zoom anchored at the cursor, ZOOM IN/OUT toolbar buttons, and
a live zoom% in the status bar. Pan layer gets transform-origin 0 0 and
a scale(zoom) factor; grid pattern tile scales with zoom; edge SVG is
overflow:visible so long edges don't clip at high zoom. World-space
hit-testing, resize deltas, and palette drops all divide by zoom.
Reset View zeroes pan AND zoom.
Clicking a service tag selects it (stops node drag), extends Selection
discriminant with {type:'service',id,nodeId}, and renders an inspector
panel showing proto/port/subnet/risk chip + REMOVE SERVICE button
(gated off for observed nodes and degraded topologies). Service-tag
styling now pulls `risk` from DEFAULT_SERVICES metadata instead of
node.status alone.
Reverse of init, step-by-step: systemctl disable --now decnet.target,
remove every decnet-*.service + decnet.target unit file, drop the
polkit rule, drop the tmpfiles.d entry, daemon-reload, remove
/etc/decnet + /etc/decnet/config.ini, /run/decnet, /opt/decnet, and
userdel/groupdel the decnet identity.
Preserves /var/lib/decnet and /var/log/decnet by default — those
hold operator data. Pass `--deinit --purge` to rm -rf them too.
Idempotent on a clean host (every step prints [SKIP]). Honours
--dry-run.
5 new tests cover the full-undo path, --purge, idempotent clean-host
deinit, dry-run side-effect-free behaviour, and the --purge without
--deinit guard.
Creates the decnet system user/group, installs every unit file from
deploy/ into /etc/systemd/system, drops the polkit rule, seeds
/opt/decnet + /var/{lib,log}/decnet + /etc/decnet + /run/decnet,
writes a placeholder /etc/decnet/config.ini, applies the new
tmpfiles.d entry so /run/decnet survives reboots, daemon-reloads,
and `systemctl enable --now decnet.target`.
Idempotent (re-runs print [SKIP] on already-configured items),
--dry-run previews the plan without touching anything, --no-start
defers the target start, --force overwrites even matching unit
files. Master-only (added to MASTER_ONLY_COMMANDS).
9 orchestration tests cover the non-root gate, dry-run, useradd/
groupadd argv, SKIP on present user/group, unit-file idempotency,
--force overwrite, --no-start suppression, happy path, and the
"deploy/ not found" error message.
Units + polkit rule + systemd_control helper + start endpoints +
installed flag + UI wiring all landed. SWARM-host start/stop and
crash-quarantine policy stay as named deferrals.
Per-row START button enabled iff `installed && status !== 'ok'`;
tooltip explains why it's disabled ("Unit not installed" /
"Already running"). Transient `starting` state shows `...` on the
button and auto-clears after 15s so the UI never gets stuck if the
heartbeat is slow.
START ALL WORKERS button in the header calls /workers/start-all and
renders the three counts in the toast:
`STARTED · N · ALREADY RUNNING · M · FAILED · K (first failure: …)`.
Tone flips to alert when K > 0.
POST /api/v1/workers/{name}/start — 202 on acceptance, 404 unknown
worker, 503 if the unit file is not installed, 502 if systemctl
returns non-zero (stderr snippet in detail, full stack logged).
Admin only.
POST /api/v1/workers/start-all — best-effort: walks the worker list
in dependency order (bus → api → data-plane), skips already-active
and uninstalled units, aggregates outcomes into
{started, already_running, failed[]}. Returns 200 even on partial
failure; the caller reads the three lists.
Both endpoints delegate to the systemd_control helper, so the attack
surface for "what gets executed" is locked to `decnet-<validated-name>
.service` at two layers (router KNOWN_WORKERS + helper regex).
Ships the backend half of Config → Workers:
* Worker registry aggregates `system.*.health` + `system.bus.health`
heartbeats into a last-seen dict; OK / STALE / UNKNOWN tiers drop
out of a 90s window (3× the 30s heartbeat interval).
* `GET /api/v1/workers` returns the snapshot plus `bus_connected`
(so the UI can explain "all UNKNOWN" when the bus socket is down)
and a per-row `installed` flag populated from
`systemctl list-unit-files decnet-*.service` (cached 30s).
* `POST /api/v1/workers/{name}/stop` publishes a stop intent on
`system.<name>.control`; workers listen via the shared control
listener in `bus/publish.py`.
* Heartbeat + control listener wired into collector / profiler /
sniffer / prober / mutator worker loops. API self-heartbeats too
so the panel always has one ground-truth row.
* Topic helper `system_control(name)` + tests covering builder
validation, control listener shutdown path, and the API surface
(auth gating, bus-connected field, unknown-name 404).
Adds `StartFailure` / `StartAllResponse` models in anticipation of
the upcoming start endpoints (DEBT-034).
Thin async wrapper over `systemctl` — never shell=True, always
create_subprocess_exec. Unit names are built from
`decnet-<validated-name>.service`; the regex check is defence in depth
on top of the router-level KNOWN_WORKERS validation.
Exposes start / stop / is_active / list_installed; last is cached for
30s to keep the Workers panel cheap under REFRESH spam. On non-systemd
hosts list_installed returns an empty set, so the UI renders with
every row marked not-installed instead of 500-ing.
Scoped rule — matches only `decnet-<name>.service` and `decnet.target`.
Any unit outside that regex falls through to the default polkit policy.
Required so the API (running as the `decnet` user) can invoke
`systemctl start decnet-<name>.service` non-interactively.
Adds the five missing worker units plus a grouping target so
`systemctl start decnet.target` brings the whole fleet up in order.
Sniffer gets CAP_NET_RAW for scapy; collector and mutator join the
docker supplementary group for docker.sock access. Repoints
Documentation= across all existing units to the canonical
git.resacachile.cl wiki.
Add tests/service_testing/test_instance_seed.py — pins NODE_NAME to assert
determinism of seeded functions and sweeps NODE_NAMEs to assert cross-fleet
divergence. Conftest gains load_real_instance_seed() so template tests see
the real seeding behavior instead of a stub. Existing template tests updated
to pin NODE_NAME and match seeded outputs.
Every service template now pulls version strings, cluster/node UUIDs, auth
salts, greeting banners, and uptime from the seeded per-instance RNG instead
of hard-coded defaults. Scanners sweeping the fleet now see legitimately
diverging fingerprints per decky while each decky's own responses stay
internally consistent across restarts.
Covers elasticsearch, ftp, http, https, ldap, mongodb, mqtt, mssql, mysql,
postgres, redis, and smtp templates.
Each decky now gets a deterministic-per-instance seeded RNG derived from
NODE_NAME, so cluster UUIDs, version strings, uptime, and credentials diverge
across the fleet while staying stable within one container. The canonical
helper lives at decnet/templates/instance_seed.py; the deployer copies it into
every active template build context alongside syslog_bridge.py. Dockerfiles
COPY it to /opt/ so server.py can import it.
Connection-time jitter intentionally stays unseeded — two hits to the same
decky must not replay the same latency curve.
The ssh and telnet services hard-coded /var/lib/decnet/artifacts as the host
quarantine mount. Read it from DECNET_ARTIFACTS_ROOT with the same default so
dev/rootless deploys can point it elsewhere.
Paging, truncation surfacing, admin gate, path traversal, sid-regex and
decky-mismatch rejection for /transcripts; mirror coverage for
/attackers/{uuid}/transcripts. Flips the Session Recording box in the
roadmap (sessrec pty relay now shipping end-to-end).
Adds asciinema-player dependency, SessionDrawer.tsx that pages the
transcripts API (500 events per request) and rebuilds a v2 .cast blob
for playback, and a Session Transcripts section in AttackerDetail that
deep-links into the drawer. Truncation banner surfaces the 10 MB
per-session cap when it's been hit.
Adds get_attacker_transcripts (mirror of artifacts for session_recorded
logs) and get_session_log for sid→shard resolution. New
/api/v1/transcripts/{decky}/{sid}?offset=&limit= pages asciinema events
out of the shared JSONL day-shard via an mtime-keyed byte-offset index
— never scans the whole shard per request. New
/api/v1/attackers/{uuid}/transcripts lists sessions for drilldown. Both
endpoints admin-gated.
Build login-session into both images as the swapped root shell, add a
quarantine bind mount for telnet (symmetric to SSH), seed transcripts/
dir and service discriminant at entrypoint. Deployer syncs sessrec.c +
Makefile into each build context alongside the existing syslog_bridge
helper. sessrec falls back to /etc/sessrec.service when env is stripped
(busybox /bin/login).
New decnet/templates/_shared/sessrec/ — a small C program installed as the
login shell in SSH / Telnet deckies. Forkpty-relays /bin/bash, records each
chunk as an asciinema v2 event into a shared JSONL day-shard keyed by sid,
and emits one RFC 5424 session_recorded line on exit (direct to PID 1's
stdout, same pattern syslog_bridge.py uses).
Storage: one shard per (decky, UTC day) at
/var/lib/systemd/coredump/transcripts/sessions-YYYY-MM-DD.jsonl. Concurrent
appends are lock-free: each write is chunked below PIPE_BUF so O_APPEND
interleaves atomically. Per-session cap 10 MB with a trunc sentinel; disk-
free precheck (<200 MB) falls through to plain bash with a session_skipped
log event. Attacker src_ip resolves from \$SSH_CONNECTION, getpeername(0),
or utmp in that order. SIGWINCH appends a 'r' resize event so ncurses
replays stay aligned.
Stealth for v1: /etc/passwd shell-swap to /usr/libexec/login-session
(plausible login-machinery path) + prctl comm disguise. Full LD_PRELOAD
argv-zap is deferred — sshd strips LD_PRELOAD from the session env, so
wiring the existing argv_zap.so into this path needs a separate wrapper.
DEBT-033 opened for size-based day-shard rotation; v1's disk-free precheck
covers the worst case but can be blinded by a one-shot disk fill.
Exposes POST /topologies/reap-orphans via an arm-to-confirm button in
the topology list header. Shows a transient status line with removal
counts or the error. Admin-only on the backend; non-admins see the 403.
Topology rows deleted without a proper teardown leave Docker containers
and bridge networks behind, holding IPAM pools that cause 403 "Pool
overlaps" on the next deploy at the same subnet.
- engine/reaper.py walks the local Docker daemon, extracts the 8-char
topology prefix from every decnet_t_* resource, and force-removes
containers + networks whose prefix is not in the repo.
- POST /api/v1/topologies/reap-orphans (admin-only) returns a report
of live/orphan prefixes and what was removed.
- Resources belonging to live topologies are never touched; per-resource
errors are captured without aborting the sweep.
When create_bridge_network or compose-up raised mid-deploy, the
deployer marked the topology FAILED and re-raised — but left every
network it had already created alive. The next deploy attempt tripped
over the orphans with 'Pool overlaps with other one on this address
space' (IPAM conflict).
Track networks created in the current attempt; on exception, tear down
the started compose stack (if any), remove the networks in reverse
order, and delete the compose file before marking FAILED. Rollback
errors are logged but never mask the original failure.
Covered by a new regression test that drives a docker client which
succeeds once then raises, and asserts every created network is also
removed.
useTopologyEditor imported 'UseMazeApi' but the actual exported type
is 'MazeApi'. tsc --noEmit missed it because the file isn't in the
default tsconfig include path; tsc -b (project references, used by
'npm run build') catches it.
apply_attach_decky requires an existing decky, so the MazeNET editor
had no way to grow a live topology: creating a new decky on active
topologies 409'd on the direct-CRUD createDecky call.
- Backend: new apply_add_decky that creates the decky row + its
home-LAN edge atomically, auto-allocating an IP if none pinned.
Post-apply validation still runs. Added to DISPATCH + _MUTATION_OPS
Literal + CLI help text.
- Tests: 3 new ops tests (happy path, duplicate-name rejection,
missing-LAN rejection) plus dispatch coverage update.
- Frontend: useTopologyEditor gains addDeckyToLan() composite. Pending
routes through createDecky + attachEdge as before; active routes
through a single add_decky enqueue. MazeNET.tsx drag-archetype,
duplicate, DMZ-gateway, and ctx-menu add-decky paths all use the
composite so active topologies stop 409'ing on new-decky drops.
useTopologyEditor now branches on topoStatus: pending keeps direct CRUD,
active/degraded routes through enqueueMutation with expected_version.
Every primitive returns a tagged PrimitiveResult; callers skip local
state updates on enqueued and wait for the SSE mutation.applied refetch
to reflect DB truth.
- remove_lan/remove_decky/detach_decky: direct name-keyed enqueues.
- update_decky/update_lan: services/x/y lifted to top-level payload keys,
remainder placed under patch (matches apply_update_* contract).
- attach_decky: enqueued with decky+lan names; requires the decky to
already exist (Phase B step 3 adds the create+attach composite).
- createDecky stays direct-CRUD this pass — no add_decky op yet, so
new-decky drag will 409 on active until a follow-up commit.
- MazeNET surfaces mutation.failed payload.reason/error into actionErr
so the status bar tells the user WHY a queue op was rejected.
apply_update_decky only merged payload.patch into decky_config. Since
services is a separate DB column, there was no way to replace a decky's
services list via a mutation. Add a top-level services key to the op
payload that maps straight onto the services column.
Unblocks the MazeNET editor routing service-add/service-drop actions
through the mutation queue on active topologies.
Phase B step 1 of DEBT-030: introduce a status-aware editor hook that
wraps useMazeApi. Every primitive currently pass-throughs to direct
CRUD and returns {kind: 'applied', data} — behavior is unchanged.
Follow-up commits route active/degraded topologies through
enqueueMutation when status != pending.
Also tighten the SSE LIVE indicator: flip setStreamLive(true) only on
snapshot, mutation.*, or status events, not on any incidental frame.
The mutation-event stream landed this session closes the "deckies are
atomic nodes" gap for service-list changes, but substrate identity is
really ``(service, implementation_fingerprint)``. A base-image
rebuild that rotates OpenSSH 8.4 → 9.2 without changing the service
list is invisible to the correlation graph today because the prober's
dedup set is in-memory and per-run — no cross-run diff, no
"fingerprint changed" event.
DEBT-032 documents the required piece: a per-(decky, service,
probe_type) persistence layer + diff-on-change emission, with the
correlator's existing mutation-marker interleaving pattern as the
model for fingerprint markers. A mutation-vs-fingerprint divergence
detector then falls out of the data model for free — fingerprint drift
without a preceding mutation ⇒ substrate_divergence finding.
Parser now tags ``mutator`` / ``decky_mutated`` lines with ``kind="mutation"``
so the engine can route them into a sibling ``_mutations`` index keyed
by decky name instead of the per-IP attacker index. ``traversals()``
joins the two streams: every attacker gets a ``mutations_during`` list
of markers from touched deckies bounded by their first/last-seen
window. ``AttackerTraversal.to_dict()`` grows a ``mutations_during``
field and a ``timeline`` that chronologically interleaves hops and
markers, so an ``SSH at T5 → mutation at T6 → HTTP at T7`` substrate
transition is visible to UI consumers instead of reading as a silent
discontinuity.
The existing hops-only JSON shape is preserved; old clients that
ignore unknown keys keep working.
Close the lifecycle loop for the correlation graph: every decky now
enters the substrate with an explicit `trigger=creation` event
(old_services=[] ⇒ new_services=<initial>) and leaves it with
`trigger=retirement` (old=<current> ⇒ new=[]). With scheduled/operator
mutations already flowing through emit_decky_mutated, the entire decky
lifecycle is now a well-formed sequence of mutation events — the
correlator can fold substrate_state(t) at any T by replaying them.
Lazy-imports mutator.events to dodge the engine↔mutator circular
dependency. Bus is None at CLI sites; the syslog write is what the
correlator consumes. Emission is soft-failing so a broken log path
never aborts a deploy.
Mutator now emits one decky_mutated event (RFC 5424 + bus) per
successful mutation instead of the inline decky.<id>.state bus
publish. The previous state topic published new_services only;
mutation events carry old/new/trigger, which is what the correlation
engine needs to interleave substrate-change markers into attacker
traversals.
- mutate_decky gains trigger: MutationTrigger = "operator" and
captures old_services before the shuffle; replaces the inline
_publish_safely(decky.<id>.state) with emit_decky_mutated(...).
- mutate_all derives trigger internally: operator when force or
only-filter is set (CLI --all, API mutate-now, UI bus request);
scheduled on interval ticks. Passed through to each mutate_decky
call.
- Tests updated: the old decky.<id>.state assertion is replaced
with decky.<id>.mutation topic + mutation payload shape; 3 new
tests cover trigger derivation for scheduled / force / only paths.
26 tests in test_mutator.py green; 116 across mutator + topology
+ bus.
First step toward making mutation events first-class nodes in the
correlation graph. Today the graph silently reflects post-mutation
state with no marker of the transition; this helper lands the
emitter the mutator and deploy paths will call.
- decnet/mutator/events.py: emit_decky_mutated(bus, *, decky,
old_services, new_services, trigger, actor=None, log_path=None)
writes an RFC 5424 line (service=mutator, hostname=<decky>,
MSGID=decky_mutated, SD params for old/new services + trigger +
optional actor) to DECNET_INGEST_LOG_FILE, then fire-and-forget
publishes on decky.<id>.mutation. Either side failing is soft —
the other path still completes.
- MutationTrigger Literal covers creation, retirement, scheduled,
operator, behavioral, healer, federation. Reserved values for v2/v3
(behavioral + federation) stay nullable so the schema is stable.
- decnet/bus/topics.py: DECKY_MUTATION constant + decky_mutation(id)
builder. Distinct from DECKY_STATE ("current shape") because a
mutation is a transition event, not a steady-state snapshot.
- Empty-set symmetry: creation emits old_services=[], retirement
emits new_services=[]. Every decky lifecycle becomes a well-formed
fold sequence on the correlator side.
- 4 new tests: FakeBus + correlator parser round-trip; creation and
retirement empty-set cases; bus=None still writes syslog;
unwritable log path doesn't block bus publish. 95 tests green
across test_mutator + tests/bus.
The flat-fleet mutator was DB-poll-only and noisy — it logged
"no active deployment found" every 10s on idle hosts and ran
mutate_all at a fixed tick regardless of when the next decky
was due.
- mutate_all returns seconds-until-next-due; watch loop sleeps
min(next_due, poll_interval_secs) with a 1s floor.
- "No deployment" is now idle, not an error: edge-triggered log
on present<->absent transition instead of every tick.
- mutate_decky publishes decky.<name>.state on successful compose
so UIs react in real time.
- New decky.*.mutate_request subscription lets API/CLI/UI force
an immediate mutation of a specific decky without waiting for
its interval; target name feeds mutate_all(only={...}).
- system.mutator.health heartbeat via run_health_heartbeat helper,
bringing the mutator in line with DEBT-031 workers.
Tests: next_due return, only= filter, decky.<name>.state publish
on success, no publish on compose failure. Full mutator+topology-
mutator+bus suite (109) green.
All nine service workers now participate in the host-local bus: sniffer,
prober, correlator (via profiler), profiler, collector, ingester, agent,
forwarder, updater. Pre-bus behavior is preserved end-to-end for
DECNET_BUS_ENABLED=false and get_bus() failures.
Three items intentionally deferred: realism-probe decky.{id}.state
(needs a realism probe path that doesn't exist yet), correlator session
boundaries (needs session state), and bus-wake subscriptions (publishes
landed; wake side wired to no subscriber today).
All three workers now share a run_health_heartbeat helper in
decnet.bus.publish. Each publishes system.<worker>.health on a 30s tick
with {worker, ts} plus optional per-worker extras. Subscribers can
watch system.*.health to see every DECNET worker on a host at once.
- agent: heartbeat runs inside the FastAPI lifespan alongside the
existing master-facing heartbeat; bus-disabled path is a no-op.
- forwarder: heartbeat task spawned at run_forwarder entry, cancelled
in the finally block so a crashed master loop never leaks the task.
- updater: new FastAPI lifespan hosts the heartbeat.
Heartbeat helper swallows extra() failures and is cancellation-safe so
lifespan teardown never hangs on it.
Ingester connects the bus at startup, emits a batch-committed summary
(component/flushed/position) after each successful _flush_batch. Zero-
row flushes are suppressed so the topic stays meaningful.
Complements the collector's per-line system.log publishes: collector
signals ingress, ingester signals DB-persisted progress. Federation
forwarder (worker 8) will subscribe to the batch-committed leaf to
trigger its upstream push.
Bus stays optional: publish_safely swallows failures, get_bus() can
return None, DECNET_BUS_ENABLED=false leaves the ingestion loop fully
functional.
log_collector_worker connects the bus at startup, builds a thread-safe
system.log publisher, and hands it to each container-stream thread
through _stream_container's new publish_fn parameter. Publishing fires
right after the JSON record is written — same rate-limiter path, no
extra parsing, compact payload (decky/service/event_type/attacker_ip/
timestamp) so subscribers can redraw without re-reading the DB.
Bus stays optional: if get_bus() fails or DECNET_BUS_ENABLED=false the
factory returns a no-op publisher and the stream thread calls it
unconditionally. Hook failures are logged and never abort the thread.
The profiler worker threads its bus publisher through _WorkerState so
_update_profiles can emit a compact attacker.scored event for every
upsert. Payload carries the headline counts (event/service/decky/
bounty/credential) plus is_traversal, so the MazeNET attacker pool can
redraw without a round-trip.
Bus stays optional: publish_attacker=None when DECNET_BUS_ENABLED=false
or get_bus() fails, and hook exceptions are logged without breaking the
upsert path.
CorrelationEngine gains an optional publish_fn hook fired once per unique
attacker IP. The profiler worker — sole caller of the engine today —
carries the bus physically, builds a thread-safe publisher, and wraps it
with the attacker.observed topic before handing it in.
Bus stays optional: if get_bus() fails or DECNET_BUS_ENABLED=false, the
engine runs publish_fn=None and the worker degrades to DB-only. Hook
failures log a warning and never break ingestion.
Each successful JARM / HASSH / TCPfp probe fans out an
attacker.fingerprinted event; the probe family goes in event.type so a
single subscription covers all three. Payload carries the attacker IP,
port, and probe-specific hash — enough for the MazeNET live map to
render fingerprint info on observed attackers.
Lifts the thread-safe publisher helper out of the sniffer worker into
decnet/bus/publish.py so the prober (and every future worker with a
to_thread hot path) can reuse it without copy-pasting the
run_coroutine_threadsafe dance. Sniffer rewires onto the shared helper
in passing.
Adds ATTACKER_FINGERPRINTED as a new leaf — distinct from
ATTACKER_OBSERVED (correlator's first-sight signal) because an active
probe result is additional evidence about an already-observed attacker.
Note: the plan's decky.{id}.state realism-probe publish path is
deferred — the current prober fingerprints attackers, not decky
realism. Will revisit when realism probes exist.
SnifferEngine gains an optional publish_fn hook, invoked after the
dedup + syslog write for traffic-summary events only (tls_session,
tcp_flow_timing, tcp_syn_fingerprint) — intermediate parser artifacts
like tls_client_hello stay off the bus.
The sniffer worker wires get_bus() + a thread-safe shim that marshals
sync calls from the scapy sniff thread back onto the asyncio loop via
run_coroutine_threadsafe. Bus failure at startup degrades cleanly to
publish-off mode; publish failures at runtime never escape the sniff
thread.
Shared publish_safely helper at decnet/bus/publish.py so the nine
workers about to be wired into the bus don't each copy-paste the
"never raise back at the caller" contract. Mutator drops its private
copy and imports the canonical one.
topics.py gains the attacker.* hierarchy (observed, scored,
session.started, session.ended) and a system_health(worker) builder
for per-worker health heartbeats — both prerequisites for the worker
rollout under DEBT-031.
Per-worker integration of the service bus shipped in DEBT-029. Publishes
are fire-and-forget; subscribes wake polling loops. Bus stays optional —
if get_bus() fails or DECNET_BUS_ENABLED=false, workers log once and
continue in poll-only mode (mirrors decnet/mutator/engine.py:run_watch_loop).
- scripts/bus/smoke-mutator.sh: boots decnet bus, subscribes to
topology.>, publishes one event per mutation-lifecycle state plus
a topology.status transition, asserts all four land on the
subscriber. Cheap E2E for the topic hierarchy the mutator + SSE
route rely on.
- development/DEBT.md: mark DEBT-030 ✅ resolved (Phase A) with a
summary of what shipped; flag the optimistic staged-buffer editor
as Phase B follow-up, not debt.
- tests/topology/test_mutator.py: reconcile_topologies publishes
applying+applied on success, applying+failed+status on failure; and
stays safe when bus=None. _wake_on_enqueue sets its asyncio.Event
on every matching enqueue event.
- tests/api/topology/test_mutations.py: POST /mutations publishes
mutation.enqueued after a successful DB write, via a FakeBus
injected in place of the app-wide bus singleton.
- tests/api/topology/test_events_stream.py: SSE route returns 401
unauthenticated, 404 for unknown topologies, and (driving the
async generator directly) emits a snapshot on connect plus
forwards a published mutation.applied as an `event: mutation.applied`
SSE frame.
Wire the MazeNET editor to the new /topologies/{id}/events SSE route
so live (active|degraded) topologies reflect mutator state transitions
without reload:
- useTopologyStream hook opens an EventSource against
/topologies/{id}/events?token=<jwt>, with 3s reconnect matching the
dashboard's /stream consumer. Callback refs avoid tearing down the
connection on consumer rerenders.
- useMazeApi gains enqueueMutation(topologyId, op, payload,
expectedVersion?) — thin wrapper over POST /mutations.
- MazeNET.tsx opens the stream only when topoStatus is active|degraded
(pending editors have nothing to stream) and refetches on
mutation.applied|failed|status events. Header shows a LIVE /
CONNECTING… indicator.
Phase A slice — Apply (N changes) with an optimistic staged buffer
lands in a follow-up; the hooks + API method it'll need are already
here.
Wire the mutator and web API into the service bus so live-topology
edits flow sub-second from enqueue to UI:
- Mutator publishes every state transition on the bus (mutation.applying
/applied/failed + topology.status). Fire-and-forget; DB stays source
of truth.
- Mutator watch loop subscribes to topology.*.mutation.enqueued and
wakes early via asyncio.Event — the 10s poll becomes a fallback
heartbeat, not the primary dispatch trigger.
- POST /topologies/{id}/mutations publishes mutation.enqueued after
the DB write succeeds.
- New GET /topologies/{id}/events SSE route: snapshot on connect
(status + in-flight mutations), live forwards topology.{id}.>
bus events, 15s keepalive. ?token= auth mirrors /stream.
- New decnet/bus/app.py — process-wide lazy bus singleton for the
API, closed cleanly on lifespan shutdown.
start.sh boots a local bus on /tmp (no root, no decnet group).
sub.py / pub.py are thin CLIs over UnixSocketBus for manual poking.
smoke.sh is a self-contained end-to-end check — spawns a worker,
subscribes, publishes, asserts delivery, cleans up.
Land the `decnet bus` worker and `get_bus()` factory. Transport is a
host-local UNIX-domain socket (0660, group=decnet); authz is the file
mode. Wire framing is a tiny verb-line + 4-byte-BE length + orjson body.
NATS-style wildcard topics (`*`, `>`). At-most-once, fire-and-forget —
DB stays the source of truth. `FakeBus` / `NullBus` for tests and the
disabled path. Cross-host federation is deferred to a future
`--bridge-tcp` mode; DEBT-030 is master-only and unblocked.
Port the design-handoff layout into a scoped DeckyFleet.css (no more
piggybacking on Dashboard.css). Add an archetype-first creation wizard
that consumes /api/v1/topologies/archetypes, falling back to the
MazeNET ARCHETYPES constant when the endpoint is unavailable.
Canvas grew a deployed prop so nodes can visually distinguish "live in
docker" from "planned". ContextMenu learned nested submenus with
ChevronRight affordance; NetBox renders a ShieldAlert for DMZ LANs;
Palette got additional lucide icons. Dead PendingChange union pulled
out of types.ts — Phase-3 mutation ops are driven by the API layer now,
not a frontend type.
New /topologies page lists topologies; a bare /mazenet now redirects
there since the editor has no meaning without ?topology=<id>. Wizard
picks up a note style + tweaked copy.
test_compose asserts the new decnet.topology.* labels land on both base
deckies (role=base, no service marker) and service fragments
(service=true). The stub docker client in test_deploy grew a filters
kwarg so it keeps matching the real .networks.list(filters=...) call
signature now used by the deployer.
/api/v1/topologies/archetypes returns the archetype registry (slug,
display name, description, preferred services/distros, nmap_os
fingerprint) so the frontend wizard can render a live catalog instead
of hardcoding a copy.
The web bundle proxy handled GET/POST/PUT/DELETE but not PATCH or
preflight OPTIONS, which broke browser calls to PATCH endpoints behind
the static-bundle server. CORS middleware had the same gap.
db reset drops-and-recreates a fixed table set in FK order. Topology
tables weren't in the list, so reset left orphan topology rows behind
and a fresh MazeNET deploy could collide with stale child records.
topology delete cascades children (LANs, deckies, edges, mutations) but
refuses while containers are still running — teardown is prerequisite.
show stopped assuming every decky carried a full decky_config blob;
MazeNET-generated deckies only get hydrated on deploy, so fall back to
top-level name/services when the config isn't there.
Legacy fleet deckies live in decnet-state.json; MazeNET topology
containers don't. Tag them at compose-time with
decnet.topology.service=true and let the collector match on that label.
Spin up the agent's log collector on the first successful /topology/apply
(not in the lifespan — that would break the no-docker-on-boot invariant)
and tear it down with the app. Land log lines in DECNET_AGENT_LOG_FILE,
separate from master-side DECNET_INGEST_LOG_FILE, so a dev box running
both roles can't forward its own ingest back to itself.
When master pushes a topology that differs from whatever is pinned
locally, teardown the predecessor and accept the new one. Refusing with
409 left the agent stranded after partial deploys. record_error now
persists the hydrated blob so a later teardown can still walk the LAN
list — otherwise a half-failed apply strands containers + bridges with
no breadcrumb back to them.
Replaces the single-line name input with a modal that mirrors the
design-handoff DeployWizard shape (backdrop + violet-bordered panel,
wizard-step tabs, card-picker body):
- Step 1 — TARGET: a RUN LOCALLY card plus one card per enrolled
swarm host. Non-routable hosts render disabled with their status as
the tooltip. Selecting an agent pins the topology via
target_host_uuid; local stays unihost.
- Step 2 — TYPE: BLANK (POST /topologies/blank) or SEED-BASED
(POST /topologies/ with depth, branching, deckies-per-LAN, optional
seed). Name is required on both.
Existing navigate-to-editor-on-create behavior is preserved.
Two small observability follow-ups to the phase-1 agent/topology wiring:
TopologySummary now carries needs_resync so operators can see the
heartbeat's resync flag via the topology list/detail API without
dropping into the DB.
TopologyStore.record_error becomes an upsert — when a docker/compose
failure fires during the first materialise (put() never reached), we
still land a marker row so GET /topology/state surfaces the error and
the next heartbeat carries an empty applied_version_hash. That empty
hash is what master's heartbeat check relies on to flag the topology
for resync instead of assuming the apply succeeded.
Four regression tests guarding Step 8 of the agent/topology wiring:
- Lifespan startup must not call docker.from_env even with a populated
topology.db — replace docker with a boom-stub and assert zero calls.
- GET /topology/state returns the cached row verbatim without
re-materialising bridges/containers; live observation is read-only.
- Static guard: TopologyStore must not grow a restore/replay/reapply
method without someone re-reading the module docstring.
- Raw sqlite read + a second TopologyStore instance confirm the store
is passive — nothing scrubs stale rows on open, which is the
behaviour master's resync flow depends on.
Agent heartbeats now carry an applied-topology snapshot. The master
heartbeat handler compares the reported version_hash against what
canonical_hash yields for the hydrated topology pinned to that host
and flags Topology.needs_resync on divergence (or when the agent
reports no topology at all while master expects one).
The mutator watch loop gains reconcile_agent_resyncs, which re-pushes
the current hydrated blob via AgentClient.apply_topology without
touching status, then clears the flag on success. Push failures leave
the flag set so the next tick retries.
deploy_topology and teardown_topology now branch on
target_host_uuid. When set:
- Hydrate the topology locally (validator runs exactly as before).
- Compute canonical_hash; push {hydrated, version_hash} to the
pinned agent through AgentClient.apply_topology.
- Status machine still moves PENDING -> DEPLOYING -> ACTIVE on 2xx,
PENDING -> DEPLOYING -> FAILED on error; master remains the sole
owner of the row.
Teardown flips to TEARING_DOWN, fires /topology/teardown, then
TORN_DOWN — we log a warning on agent error but still settle to
TORN_DOWN so operators can delete the row (agent garbage is cleaned
on the next re-enroll).
Unihost deploys are unchanged — the field defaults to NULL so every
existing flow takes the local path.
Step 6 of the agent <-> topology integration.
Three new RPCs mirroring the existing deploy/teardown/status pattern:
- apply_topology(hydrated, version_hash) — long-timeout (600s) for
image pulls + compose up.
- teardown_topology(topology_id) — 300s timeout; enough for a
stubborn compose-down without hanging a heartbeat.
- get_topology_state() — short control-plane read for reconcile.
The per-call timeout swap uses the same trick as .deploy().
Step 5 of the agent <-> topology integration.
New mTLS-protected routes on the agent:
- POST /topology/apply — master pushes {hydrated, version_hash}.
Validates the hash matches locally (serialisation drift guard),
runs the topology through the same validator/composer pipeline
used master-side, then creates bridges + compose up + records the
apply in topology.db.
- POST /topology/teardown — dismantles compose, removes bridges,
clears topology.db. Idempotent.
- GET /topology/state — returns applied row + live docker
observation for the heartbeat.
Implementation lives in decnet/agent/topology_ops.py; it reuses the
private compose helpers from decnet.engine.deployer so we don't
duplicate compose/project-name plumbing. The apply path is sync
under the hood (docker SDK + subprocess); we hop to a thread so the
event loop keeps servicing other agent traffic.
v1 is one-topology-per-agent; cross-topology apply returns 409.
Step 4 of the agent <-> topology integration.
Single-row sqlite tracking which topology the agent last applied and
its version hash. Sync/stdlib, same pattern as the log-forwarder
offset store. v1 is one-topology-per-agent; attempting to apply a
different topology over a populated row raises AlreadyApplied so the
endpoint can return 409. observed() snapshots live docker state
(decnet-topology-* bridges + decnet-* containers) for the heartbeat.
The store is a cache, not authority — no auto-restore on boot.
Master remains the only source of truth.
Step 3 of the agent <-> topology integration.
Tiny pure helper both master and agent will use to answer "is the
applied state the one we expect?". SHA-256 of canonical JSON with
volatile keys (timestamps, status, version, canvas x/y/w/h) stripped
so the hash only captures deployment-relevant state.
Step 2 of the agent <-> topology integration.
Adds the `target_host_uuid` FK on `Topology` plus wiring through the
two create endpoints (`POST /topologies`, `POST /topologies/blank`).
Validates the mode/host pair: `mode='agent'` now requires a known,
routable host; `mode='unihost'` must leave the field unset.
Surfaced on `TopologySummary` so list/detail responses expose it.
Purely additive at the schema level — existing unihost flows unchanged
(field defaults to `NULL`).
Step 1 of the agent <-> topology integration.
Dragging a LAN or decky, or resizing a NetBox, updates React state
but previously vanished on reload because the grid-layout adapter
rewrote everything from the graph. Add a per-topology localStorage
snapshot (key: mazenet.layout.<topologyId>) that captures net
x/y/w/h and decky x/y; useLayoutPersistor writes it debounced, and
getTopology merges it over adaptTopology's grid so entities without
a stored entry still fall back to a clean auto-layout. Deleting a
topology calls clearLayout to drop its snapshot.
Dropping more than one LAN near the same spot stacked the NetBox
rectangles on top of each other, and multiple deckies in a LAN
landed on identical per-LAN coordinates. Since canvas position
persistence is deferred (localStorage pass), the stored x/y are
not load-bearing — compute layout from the topology graph instead.
adaptTopology now lays LANs out in a 3-col grid with the DMZ first
and stacks deckies 2-wide inside their home LAN. New LAN palette
drops append to the same grid, ignoring the raw drop point.
Active/degraded/failed/deploying topologies cannot be deleted
without first transitioning to torn_down, but the UI had no way
to trigger that. Add POST /topologies/{id}/teardown mirroring the
deploy endpoint (background task, 202 Accepted), and a
click-to-arm TEARDOWN button on the topology list card that shows
whenever the row is in a teardown-eligible state.
MazeNET publishes gateway ports on the host via Docker. With the
default userland-proxy enabled, attacker connections appear to
originate from the bridge gateway instead of the real remote IP.
Log a soft warning at deploy time when the topology publishes any
ports and docker info reports UserlandProxy=true, pointing the
operator at the daemon.json toggle. Best-effort: daemon talk
failures silently no-op.
Rebuild the inspector panel to match the handoff mock: crosshair-titled
header with dim type label and close X, status-dot + archetype-chip
head rows, connection list with directional arrows, member list with
click-to-select, and a pending-diff block at the foot. Carry the
gateway/observed disable titles over from the ctx menu so the 'remove'
action stays honest.
Also prefix the subtitle with 'NETWORK OF NETWORKS' so the purpose of
this editor reads at a glance.
A prior half-torn-down topology can leave a bridge network alive under
a different name that still owns our intended subnet. Docker then
rejects our create with 'Pool overlaps with other one on this address
space', and the topology deploy fails.
Extend create_bridge_network to sweep any unused bridge whose IPAM
subnet matches the one we're about to claim (skipping networks with
running containers — those are live use).
UI-created deckies (api_decky_crud, api_create_blank_topology) write
decky_config as sent by the client — typically just archetype flags,
without the name/ips_by_lan fields compose.py requires. The generator
path populates them at persist() time, so compose worked for generated
topologies but KeyError'd on UI-created ones.
Normalise in hydrate() so every write path feeds the same shape
downstream: mirror decky.name into decky_config.name, and allocate
per-LAN IPs deterministically (reserving the primary decky.ip where it
falls in-subnet, then filling remaining edges with next_free).
Gateway detection in the editor previously matched
archetype === 'host-gateway' (a fictional archetype that never
existed in decnet/archetypes.py). Switch to
decky_config.forwards_l3 — the real runtime marker the composer
already reads — so deletion guards, drag-pinning, context menu
locking, and NodeCard DMZ-gateway styling all line up with what
actually ships at deploy time.
On DMZ palette drop, create the gateway with archetype=deaddeck,
services=['ssh'], forwards_l3=true, and mark the edge
is_bridge=true, forwards_l3=true. attachEdge now accepts those
flags so callers can seed a real bridge attachment.
Add check_no_host_port_collision: enumerate the ports the topology's
gateways will publish (forwards_l3=True × svc.ports), probe live
listeners via psutil, emit a 'warning'-severity PORT_COLLISION
issue per overlap. Live-only — invoked from deploy_topology just
after dry-run branching, so unit tests that exercise validate()
stay hermetic.
Warning rather than error because docker-compose up will hard-fail
on a real collision anyway; this just gives operators a cleaner log
line ahead of the compose failure.
When a non-DMZ LAN is created via POST /lans, look up the topology's
gateway (decky with forwards_l3=True attached to the DMZ) and insert
an edge binding it to the new LAN. The gateway becomes multi-homed
to every internal LAN automatically, so DMZ_ORPHAN cannot arise
from ordinary editor use.
Also fixes delete_lan: the home-decky guard used scalar_one_or_none,
which blew up when the gateway already had >1 'other' LAN edge.
Switch to scalars().first() — we only need to know *some* other
edge exists, not a unique one.
Gateway deckies (forwards_l3=True) are the DMZ's ingress. Their
service containers share the base namespace via network_mode:service,
so any listener inside the gateway is reachable through the base
container's published ports. Emit 'ports: [<p>:<p>, ...]' on the
gateway base from svc.ports across the decky's service list.
This is the principled replacement for the broken network_mode: host
stub — with docker-proxy publishing, the DMZ works on any single-NIC
VPS (no MACVLAN, no promiscuous mode required).
POST /topologies/blank seeded the gateway decky with
archetype=host-gateway + network_mode=host, but neither was wired:
no compose writer reads network_mode and host-gateway is not a real
archetype. Replace with archetype=deaddeck + forwards_l3=true so the
gateway is a normal multi-homed bridge decky, consistent with how
compose.py interprets forwards_l3 (sysctl + NET_ADMIN).
Edge marked is_bridge=true, forwards_l3=true so downstream readers
(generator, compose, validator) see a real bridge attachment.
JA3/JA3S fingerprints are defined by their specs as MD5 digests of
the ClientHello/ServerHello feature tuples — they are identifiers,
not security primitives. Pass usedforsecurity=False at the two call
sites so bandit stops flagging them as B324 High when scanning
outside the templates/ exclude.
DECNET's app-level RequestValidationError handler remaps structural
422→400, including query/path constraint violations (limit bounds,
the next-subnet base pattern, etc.). Schemathesis fuzzing will drive
those code paths and fail response_schema_conformance unless 400 is
declared in responses={}. Adds the entry to every phase-3 read route.
GET /api/v1/topologies — paginated list with status filter. Extends
repo.list_topologies() to accept limit/offset and adds count_topologies()
for the total envelope field.
GET /api/v1/topologies/{id} — hydrated TopologyDetail; 404 if missing.
GET /api/v1/topologies/{id}/status-events — audit trail, limit-capped.
Catalog helpers for the phase-4 canvas UI:
* GET /topologies/services — full service catalog.
* GET /topologies/next-subnet?base=172.20 — wraps SubnetAllocator against
reserved_subnets across non-torn-down topologies.
* GET /topologies/{id}/lans/{lan_id}/next-ip — IPAllocator pre-seeded
with existing decky IPs in that LAN.
All read routes are viewer-or-admin. Sub-routers are included in an
order that keeps literal catalog paths (/services, /next-subnet) from
being shadowed by the /{topology_id} trie branch.
Add Pydantic DTOs in decnet/web/db/models.py covering every phase-3
endpoint shape: TopologyGenerateRequest, TopologySummary/Detail, child
create/update requests, MutationEnqueueRequest (Literal op guard),
MutationRow with JSON-payload decoder, validation/version/not-editable
error envelopes, and the three catalog responses.
Create decnet/web/router/topology/ as an import-safe package exporting
topology_router (prefix /topologies) — sub-routers land step-by-step in
subsequent commits. Mount under the main api router alongside swarm_mgmt.
tests/api/topology/test_models.py pins repo-dict ↔ DTO parity so future
repo-row drift breaks the contract test before the endpoints.
Adds the live-mutation pipeline for active/degraded topologies:
* TopologyMutation table with composite index (state, topology_id)
so the watch-loop guard query stays O(log n).
* claim_next_mutation is a single atomic UPDATE ... WHERE
state='pending' so racing reconcilers deterministically pick one
winner; losers see rowcount=0 and skip.
* reconcile_topologies drains pending rows per live topology, applies
via decnet.mutator.ops.dispatch, and on failure marks the mutation
failed + transitions topology to degraded.
* run_watch_loop gains a gated branch: flat-fleet mutate_all runs
every tick unchanged; the reconciler only enters when the cheap
has_pending_topology_mutation guard returns True.
* apply_* ops re-check hard invariants (names, IP collisions, subnet
overlap, known services, service_config shape) after every mutation
so the repo never lands in an invalid state.
* CLI: 'decnet topology mutate' / 'mutations' subcommands.
MazeNET phase 2 step 6. Equips the repo layer with the CRUD the web
editor needs before deploy.
- TopologyNotEditable exception: raised when a pending-only method hits
a non-pending topology. The intent is "free-form edits stop at deploy;
the mutator (step 7) takes over for live topologies."
- _assert_pending helper checks status inside the session.
- update_lan / update_topology_decky accept enforce_pending=True for
pre-deploy callers (existing internal callers default to False so
behavior is unchanged).
- delete_lan: cascades edges; refuses if any decky has only one edge
(= this LAN is its home) to prevent orphans.
- delete_topology_decky: cascades edges.
- delete_topology_edge: bare-bones removal.
All four mutators accept expected_version for optimistic concurrency.
Existing tests continue to pass (no behavior change for persist/deploy).
MazeNET phase 2 step 5. Pure storage — the generator emits None for
x/y and the web canvas fills them in later. No logic changes; no
compose, deploy, or validator impact.
MazeNET phase 2 step 4. Readies the repo layer for concurrent editors
(web canvas + CLI + mutator) without lost-write races.
- Topology.version: monotonically bumped on supervised child-row writes.
- VersionConflict exception carries {current, expected} for the UI.
- _check_and_bump_version helper reads Topology in the same session,
compares against expected_version, raises on mismatch, bumps on match.
Commit happens in the caller's existing transaction so check+bump+write
are atomic per mutation.
- add_lan / update_lan / add_topology_decky / update_topology_decky /
add_topology_edge accept expected_version=None by default, preserving
every existing caller's behavior.
When expected_version is None, no check runs and version stays put —
internal callers (persist) that don't care about concurrency keep
working unchanged.
MazeNET phase 2 step 3. Blocks deploys of hand-authored topologies that
would fail mid-bring-up (orphan deckies, duplicate IPs, overlapping
subnets, unknown services) with a structured error list instead of a
docker error at startup.
Rules (one function each, composable by the editor for inline hints):
- exactly one DMZ
- every LAN has a bridge chain to the DMZ (BFS via multi-homed deckies)
- no orphan deckies
- unique LAN and decky names per topology
- no IP collisions + IPs inside their LAN's subnet
- no LAN subnet overlaps
- every service in decnet.fleet.all_service_names()
- service_config keys match the decky's declared services
deploy_topology runs the validator after hydrate, before any status
transition or Docker call; errors raise ValidationError and status
stays at pending.
MazeNET phase 2 step 2. Mirrors the flat-fleet service_config pattern
(DeckyConfig.service_config → composer → svc.compose_fragment) into the
topology compose pipeline, so a hand-authored decky can carry overrides
like {"ssh": {"password": "megapassword"}} and the ssh fragment reads
them just like the flat path does.
- _PlannedDecky gains service_config: dict[str, dict].
- persist() stores it under decky_config["service_config"].
- topology/compose.py passes cfg.get("service_config", {}).get(svc, {})
to svc.compose_fragment(service_cfg=...).
Schema unchanged — service_config lives inside the existing
decky_config JSON blob. Zero changes in decnet/services/*.
MazeNET phase 2 step 1. Pulls inline IP/subnet allocation out of the
generator into decnet/topology/allocator.py so the editor + reconciler
can reuse the same primitives without duplicating logic.
- IPAllocator: stateful host-IP handout with reserve/release/is_free.
- SubnetAllocator: /24 handout under a base prefix, skips reservations.
- reserved_subnets(repo): collects claimed subnets across every
non-torn_down topology so concurrent drafts cannot collide.
- generate() accepts reserved_subnets= to skip existing claims.
Generator output is byte-identical under seed (behavior preserved).
Covers dry-run compose emission (no status change), FAILED transition
with reason logged on daemon errors, teardown from FAILED, and a
live-marked end-to-end test that creates/removes bridge networks
against a real docker daemon (skipped on CI).
decnet topology {generate,list,show,deploy,teardown} wraps the new
persistence and deployer APIs. Structured text output, no ASCII art —
visual DAG rendering belongs in the web dashboard. Group is master-only
via MASTER_ONLY_GROUPS and a _require_master_mode guard on each body.
Adds per-topology compose generation (one Docker bridge network per
LAN, multi-homed bridge deckies, ip_forward sysctl for L3 forwarders)
plus async deploy_topology/teardown_topology in the engine. Leaf-first
teardown via BFS-named LAN reverse sort; partial-state safe on failure.
Adds decnet/topology/ with:
- config.TopologyConfig: pydantic model driving generation (depth,
branching_factor, deckies_per_lan_min/max, bridge_forward_probability,
cross_edge_probability, subnet_base_prefix, service selection, seed).
Emits GeneratedTopology dataclass (lans, deckies, edges).
- status.TopologyStatus + assert_transition: seven-state machine with
an explicit legal-transition table. torn_down is terminal; degraded
is schema-reserved for future Healer use.
- generator.generate: deterministic DAG generation under config.seed.
Builds a tree of LANs (DMZ at root), plants deckies in each LAN,
promotes one decky per non-DMZ LAN to a parent bridge, and rolls
cross-edges per cross_edge_probability for DAG shape.
- persistence: persist() writes a plan to the repo as pending;
transition_status() enforces state-machine legality; hydrate() loads
topology + children into a single dict.
Covered by tests/topology/{test_status,test_generator,test_persistence}.
Adds topology CRUD to BaseRepository (NotImplementedError defaults) and
implements them in SQLModelRepository: create/get/list/delete topologies,
add/update/list LANs and TopologyDeckies, add/list edges, plus an atomic
update_topology_status that appends a TopologyStatusEvent in the same
transaction. Cascade delete sweeps children before the topology row.
Covered by tests/topology/test_repo.py (roundtrip, per-topology name
uniqueness, status event log, cascade delete, status filter) and an
extension to tests/test_base_repo.py for the NotImplementedError surface.
Introduces five new SQLModel tables for MazeNET (nested deception
topologies): Topology, LAN, TopologyDecky, TopologyEdge, and
TopologyStatusEvent. DeckyShard is intentionally not touched —
TopologyDecky is a purpose-built sibling for MazeNET's lifecycle
(topology-scoped UUIDs, per-topology name uniqueness).
Part of MazeNET v1 (nested self-container network-of-networks).
Schemathesis was failing CI on routes that returned status codes not
declared in their OpenAPI responses= dicts. Adds the missing codes
across swarm_updates, swarm_mgmt, swarm, fleet and attackers routers.
Also adds 400 to every POST/PUT/PATCH that accepts a JSON body —
Starlette returns 400 on malformed/non-UTF8 bodies before FastAPI's
422 validation runs, which schemathesis fuzzing trips every time.
No handler logic changed.
- tests/**: update templates/ → decnet/templates/ paths after module move
- tests/mysql_spinup.sh: use root:root and asyncmy driver
- tests/test_auto_spawn.py: patch decnet.cli.utils._pid_dir (package split)
- tests/test_cli.py: set DECNET_MODE=master in api-command tests
- tests/stress/conftest.py: run locust out-of-process via its CLI + CSV
stats shim to avoid urllib3 RecursionError from late gevent monkey-patch;
raise uvicorn startup timeout to 60s, accept 401 from auth-gated health,
strip inherited DECNET_* env, surface stderr on 0-request runs
- tests/stress/test_stress.py: loosen baseline thresholds to match hw
The 1,878-line cli.py held every Typer command plus process/HTTP helpers
and mode-gating logic. Split into one module per command using a
register(app) pattern so submodules never import app at module scope,
eliminating circular-import risk.
- utils.py: process helpers, _http_request, _kill_all_services, console, log
- gating.py: MASTER_ONLY_* sets, _require_master_mode, _gate_commands_by_mode
- deploy.py: deploy + _deploy_swarm (tightly coupled)
- lifecycle.py: status, teardown, redeploy
- workers.py: probe, collect, mutate, correlate
- inventory.py, swarm.py, db.py, and one file per remaining command
__init__.py calls register(app) on each module then runs the mode gate
last, and re-exports the private symbols tests patch against
(_db_reset_mysql_async, _kill_all_services, _require_master_mode, etc.).
Test patches retargeted to the submodule where each name now resolves.
Enroll-bundle tarball test updated to assert decnet/cli/__init__.py.
No behavioral change.
Uvicorn's h11/httptools HTTP protocols don't populate scope['extensions']['tls'], so /swarm/heartbeat's per-request cert pinning was 403ing every call despite CERT_REQUIRED validating the cert at handshake. Patch RequestResponseCycle.__init__ on both protocol modules to read the peer cert off the asyncio transport and write DER bytes into scope['extensions']['tls']['client_cert_chain']. Importing the module from swarm_api.py auto-installs the patch in the swarmctl uvicorn worker before any request is served.
DeckyFleet now branches on /system/deployment-mode: in swarm mode it
pulls /swarm/deckies and normalises DeckyShardView into the shared
Decky shape so the same card grid renders either way. Swarm cards gain
a host badge (host_name @ address), a state pill (running/degraded/
tearing_down/failed/teardown_failed with matching colors), an inline
last_error snippet, and a two-click arm/commit Teardown button lifted
from the old SwarmDeckies component. Mutate + interval controls are
hidden in swarm mode since the worker /mutate endpoint still 501s —
swarm-side rotation is a separate ticket.
Drops the standalone /swarm/deckies route + nav entry; SwarmDeckies.tsx
is deleted. The SWARM nav group keeps SwarmHosts, Remote Updates, and
Agent Enrollment.
New decnet.agent.heartbeat asyncio loop wired into the agent FastAPI
lifespan. Every 30 s the worker POSTs executor.status() to the master's
/swarm/heartbeat with its DECNET_HOST_UUID for self-identity; the
existing agent mTLS bundle provides the client cert the master pins
against SwarmHost.client_cert_fingerprint.
start() is a silent no-op when identity env (HOST_UUID, MASTER_HOST) is
unset or the worker bundle is missing, so dev runs and un-enrolled hosts
don't crash the agent app. On non-204 responses the loop logs loudly but
keeps ticking — an operator may re-enrol mid-session, and fail-closed
pinning shouldn't be self-silencing.
swarmctl CLI gains --tls/--cert/--key/--client-ca flags. With --tls the
controller runs uvicorn under HTTPS + mTLS (CERT_REQUIRED) so worker
heartbeats can reach it cross-host. Default is still 127.0.0.1 plaintext
for backwards compat with the master-CLI enrollment flow.
Auto-issue path (no --cert/--key given): a server cert signed by the
existing DECNET CA is issued once and parked under ~/.decnet/swarmctl/.
Workers already ship that CA's ca.crt from the enroll bundle, so they
verify the endpoint with no extra trust config. BYOC via --cert/--key
when the operator wants a publicly-trusted or externally-managed cert.
The auto-cert path is idempotent across restarts to keep a stable
fingerprint for any long-lived mTLS sessions.
The rendered /etc/decnet/decnet.ini now carries host-uuid and
swarmctl-port in [agent], which config_ini seeds into DECNET_HOST_UUID
and DECNET_SWARMCTL_PORT. Gives the worker a stable self-identity for
the heartbeat loop — the INI never has to be rewritten because cert
pinning is the real gate (a rotated UUID with a matching CA-signed
cert would still be blocked by SHA-256 fingerprint mismatch against
the stored SwarmHost row).
Also adds DECNET_MASTER_HOST so the agent can find the swarmctl URL
via the INI's existing master-host key.
New POST /swarm/heartbeat on the swarm controller. Workers post every
~30s with the output of executor.status(); the master bumps
SwarmHost.last_heartbeat and re-upserts each DeckyShard with a fresh
DeckyConfig snapshot and runtime-derived state (running/degraded).
Security: CA-signed mTLS alone is not sufficient — a decommissioned
worker's still-valid cert could resurrect ghost shards. The endpoint
extracts the presented peer cert (primary: scope["extensions"]["tls"],
fallback: transport.get_extra_info("ssl_object")) and SHA-256-pins it
to the SwarmHost.client_cert_fingerprint stored for the claimed
host_uuid. Extraction is factored into _extract_peer_fingerprint so
tests can exercise both uvicorn scope shapes and the both-unavailable
fail-closed path without mocking uvicorn's TLS pipeline.
Adds get_swarm_host_by_fingerprint to the repo interface (SQLModel
impl reuses the indexed client_cert_fingerprint column).
Dispatch now writes the full serialised DeckyConfig into
DeckyShard.decky_config (plus decky_ip as a cheap extract), so the
master can render the same rich per-decky card the local-fleet view
uses — hostname, distro, archetype, service_config, mutate_interval,
last_mutated — without round-tripping to the worker on every page
render. DeckyShardView gains the corresponding fields; the repository
flattens the snapshot at read time. Pre-migration rows keep working
(fields fall through as None/defaults).
Columns are additive + nullable so SQLModel.metadata.create_all handles
the change on both SQLite and MySQL. Backfill happens organically on
the next dispatch or (in a follow-up) agent heartbeat.
The reaper was being SIGTERM'd mid-rm because `start_new_session=True`
only forks a new POSIX session — it does not escape decnet-agent.service's
cgroup. When the reaper ran `systemctl stop decnet-agent`, systemd
tore down the whole cgroup (reaper included) before `rm -rf /opt/decnet*`
finished, leaving the install on disk.
Spawn the reaper via `systemd-run --collect --unit decnet-reaper-<pid>`
so it runs in a fresh transient scope, outside the agent unit. Falls
back to bare Popen for non-systemd hosts.
Decommissioning a worker from the dashboard (or swarm controller) now
asks the agent to wipe its own install before the master forgets it.
The agent stops decky containers + every decnet-* systemd unit, then
deletes /opt/decnet*, /etc/systemd/system/decnet-*, /var/lib/decnet/*,
and /usr/local/bin/decnet*. Logs under /var/log are preserved.
The reaper runs as a detached /tmp script (start_new_session=True) so
it survives the agent process being killed. Self-destruct dispatch is
best-effort — a dead worker doesn't block master-side cleanup.
Teardowns were synchronous all the way through: POST blocked on the
worker's docker-compose-down cycle (seconds to minutes), the frontend
locked tearingDown to a single string so only one button could be armed
at a time, and operators couldn't queue a second teardown until the
first returned. On a flaky worker that meant staring at a spinner for
the whole RTT.
Backend: POST /swarm/hosts/{uuid}/teardown returns 202 the instant the
request is validated. Affected shards flip to state='tearing_down'
synchronously before the response so the UI reflects progress
immediately, then the actual AgentClient call + DB cleanup run in an
asyncio.create_task (tracked in a module-level set to survive GC and
to be drainable by tests). On failure the shard flips to
'teardown_failed' with the error recorded — nothing is re-raised,
since there's no caller to catch it.
Frontend: swap tearingDown / decommissioning from 'string | null' to
'Set<string>'. Each button tracks its own in-flight state; the poll
loop picks up the final shard state from the backend. Multiple
teardowns can now be queued without blocking each other.
Submitting an INI with a single [decky1] was silently redeploying the
deckies from the *previous* deploy too. POST /deckies/deploy merged the
new INI into the stored DecnetConfig by name, so a 1-decky INI on top of
a prior 3-decky run still pushed 3 deckies to the worker. Those stale
decky2/decky3 kept their old IPs, collided on the parent NIC, and the
agent failed with 'Address already in use' — the deploy the operator
never asked for.
The INI is the source of truth for which deckies exist this deploy.
Full replace: config.deckies = list(new_decky_configs). Operators who
want to add more deckies should list them all in the INI.
Update the deploy-limit test to reflect the new replace semantics, and
add a regression test asserting prior state is dropped.
Teardown and Decommission buttons were silently dead in the browser.
Root cause: every handler started with 'if (!window.confirm(...)) return;'
and browsers permanently disable confirm() for a tab once the user ticks
'Prevent this page from creating additional dialogs'. That returns false
with no UI, the handler early-exits, and no request is ever fired — no
network traffic, no console error, no backend activity.
Swap to an inline two-click pattern: first click arms the button (label
flips to 'Click again to confirm', resets after 4s); second click within
the window commits. Same safety against misclicks, zero dependency on
browser-native dialog primitives.
docker compose up is partial-success-friendly — a build failure on one
service doesn't roll back the others. But the master was catching the
agent's 500 and tagging every decky in the shard as 'failed' with the
same error message. From the UI that looked like all three deckies died
even though two were live on the worker.
On dispatch exception, probe the agent's /status to learn which deckies
actually have running containers, and upsert per-decky state accordingly.
Only fall back to marking the whole shard failed if the status probe
itself is unreachable.
Enhance agent.executor.status() to include a 'runtime' map keyed by
decky name with per-service container state, so the master has something
concrete to consult.
Two compounding root causes produced the recurring 'Address already in use'
error on redeploy:
1. _ensure_network only compared driver+name; if a prior deploy's IPAM
pool drifted (different subnet/gateway/range), Docker kept handing out
addresses from the old pool and raced the real LAN. Now also compares
Subnet/Gateway/IPRange and rebuilds on drift.
2. A prior half-failed 'up' could leave containers still holding the IPs
and ports the new run wants. Run 'compose down --remove-orphans' as a
best-effort pre-up cleanup so IPAM starts from a clean state.
Also surface docker compose stderr to the structured log on failure so
the agent's journal captures Docker's actual message (which IP, which
port) instead of just the exit code.
Operators want to know what address to poke when triaging a swarm decky;
the compose-hash column was debug scaffolding that never paid off.
DeckyShard has no IP column (the deploy-time IP lives on DecnetConfig),
so the list endpoint resolves it at read time by joining shards against
the stored deployment state by decky_name. Missing lookups render as "—"
rather than erroring — the list stays useful even after a master restart
that hasn't persisted a config yet.
The nested list-comp `[f"{id}-{svc}" for svc in [d.services for d ...]]`
iterated over a list of lists, so `svc` was the whole services list and
f-string stringified it -> `decky3-['sip']`. docker compose saw "no such
service" and the per-decky teardown failed 500.
Flatten: find the matching decky once, then iterate its services. Noop
early on unknown decky_id and on empty service lists. Regression test
asserts the emitted compose args have no '[' or quote characters.
Agents already exposed POST /teardown; the master was missing the plumbing
to reach it. Add:
- POST /api/v1/swarm/hosts/{uuid}/teardown — admin-gated. Body
{decky_id: str|null}: null tears the whole host, a value tears one decky.
On worker failure the master returns 502 and leaves DB shards intact so
master and agent stay aligned.
- BaseRepository.delete_decky_shard(name) + sqlmodel impl for per-decky
cleanup after a single-decky teardown.
- SwarmHosts page: "Teardown all" button (keeps host enrolled).
- SwarmDeckies page: per-row "Teardown" button.
Also exclude setuptools' build/ staging dir from the enrollment tarball —
`pip install -e` on the master generates build/lib/decnet_web/node_modules
and the bundle walker was leaking it to agents. Align pyproject's bandit
exclude with the git-hook invocation so both skip decnet/templates/.
The docker build contexts and syslog_bridge.py lived at repo root, which
meant setuptools (include = ["decnet*"]) never shipped them. Agents
installed via `pip install $RELEASE_DIR` got site-packages/decnet/** but no
templates/, so every deploy blew up in deployer._sync_logging_helper with
FileNotFoundError on templates/syslog_bridge.py.
Move templates/ -> decnet/templates/ and declare it as setuptools
package-data. Path resolutions in services/*.py and engine/deployer.py drop
one .parent since templates now lives beside the code. Test fixtures,
bandit exclude path, and coverage omit glob updated to match.
Agents now ship with collector/prober/sniffer as systemd services; mutator,
profiler, web, and API stay master-only (profiler rebuilds attacker profiles
against the master DB — no per-host DB exists). Expand _EXCLUDES to drop the
full decnet/web, decnet/mutator, decnet/profiler, and decnet_web trees from
the enrollment bundle.
Updater now calls _heal_path_symlink + _sync_systemd_units after rotation so
fleets pick up new unit files and /usr/local/bin/decnet tracks the shared venv
without a manual reinstall. daemon-reload runs once per update when any unit
changed.
Fix _service_registry matchers to accept systemd-style /usr/local/bin/decnet
cmdlines (psutil returns a list — join to string before substring-checking)
so agent-mode `decnet status` reports collector/prober/sniffer correctly.
The bootstrap installer copies etc/systemd/system/*.service into
/etc/systemd/system at enrollment time, but the updater was skipping
that step — a code push could not ship a new unit (e.g. the four
per-host microservices added this session) or change ExecStart on an
existing one. systemctl alone doesn't re-read unit files; daemon-reload
is required.
run_update / run_update_self now call _sync_systemd_units after
rotation: diff each .service file against the live copy, atomically
replace changed ones, then issue a single `systemctl daemon-reload`.
No-op on legacy tarballs that don't ship etc/systemd/system/.
Previously `decnet status` on an agent showed every microservice as DOWN
because deploy's auto-spawn was unihost-scoped and the agent CLI gate
hid the per-host commands. Now:
- collect, probe, profiler, sniffer drop out of MASTER_ONLY_COMMANDS
(they run per-host; master-side work stays master-gated).
- mutate stays master-only (it orchestrates swarm-wide respawns).
- decnet/mutator/ excluded from agent tarballs — never invoked there.
- decnet/web exclusion tightened: ship db/ + auth.py + dependencies.py
(profiler needs the repo singleton), drop api.py, swarm_api.py,
ingester.py, router/, templates/.
- Four new systemd unit templates (decnet-collector/prober/profiler/
sniffer) shipped in every enrollment tarball.
- enroll_bootstrap.sh enables + starts all four alongside agent and
forwarder at install time.
- updater restarts the aux units on code push so they pick up the new
release (best-effort — legacy enrollments without the units won't
fail the update).
- status table hides Mutator + API rows in agent mode.
Agents never run the FastAPI master app (decnet/web/) or serve the React
frontend (decnet_web/) — they run decnet.agent, decnet.updater, and
decnet.forwarder, none of which import decnet.web. Shipping the master
tree bloats every enrollment payload and needlessly widens the worker's
attack surface.
Excluded paths are unreachable on the worker (all cli.py imports of
decnet.web are inside master-only command bodies that the agent-mode
gate strips). Tests assert neither tree leaks into the tarball.
The bootstrap was installing into /opt/decnet/.venv with an editable
`pip install -e .`, and /usr/local/bin/decnet pointed there. The updater
writes releases to /opt/decnet/releases/active/ with a shared venv at
/opt/decnet/venv — a parallel tree nothing on the box actually runs.
Result: updates appeared to succeed (release dir rotated, SHA changed)
but systemd kept executing the untouched bootstrap code.
Changes:
- Bootstrap now installs directly into /opt/decnet/releases/active
with the shared venv at /opt/decnet/venv and /opt/decnet/current
symlinked. Same layout the updater rotates in and out of.
- /usr/local/bin/decnet -> /opt/decnet/venv/bin/decnet.
- run_update / run_update_self heal /usr/local/bin/decnet on every
push so already-enrolled hosts recover on the next update instead
of needing a re-enroll.
- run_update / run_update_self now log each phase (receive, extract,
pip install, rotate, restart, probe) so the updater log actually
shows what happened.
Agents run deckies locally and need to inspect their own state. Removed
`status` from MASTER_ONLY_COMMANDS so it survives the agent-mode gate.
Useful for validating remote updater pushes from the master.
Three holes in the systemd integration:
1. _spawn_agent_via_systemd only restarted decnet-agent.service, leaving
decnet-forwarder.service running the pre-update code (same /opt/decnet
tree, stale import cache).
2. run_update_self used os.execv regardless of environment — the re-execed
process kept the updater's existing cgroup/capability inheritance but
systemd would notice MainPID change and mark the unit degraded.
3. No path to surface a failed forwarder restart (legacy enrollments have
no forwarder unit).
Now: agent restart first, forwarder restart as best-effort (logged but
non-fatal so legacy workers still update), MainPID still read from the
agent unit. For update-self under systemd, spawn a detached sleep+
systemctl restart so the HTTP response flushes before the unit cycles.
Bootstrap used to end with `decnet updater --daemon` which forks and
detaches — invisible to systemctl, no auto-restart, dies on reboot.
Ships a decnet-updater.service template matching the pattern of the
other units (Restart=on-failure, log to /var/log/decnet/decnet.updater.log,
certs from /etc/decnet/updater, install tree at /opt/decnet), bundles
it alongside agent/forwarder/engine units, and the installer now
`systemctl enable --now`s it when --with-updater is set.
The create helpers short-circuited on name alone, so a prior macvlan
deploy left Docker's decnet_lan network in place. A subsequent ipvlan
deploy would no-op the network create, then container attach would try
to add a macvlan port on enp0s3 that already had an ipvlan slave —
EBUSY, agent 500, docker ps empty.
Now: when the existing network's driver disagrees with the requested
one, disconnect any live containers and DROP it before recreating.
Parent-NIC can host one driver at a time.
Also: setup_host_{macvlan,ipvlan} opportunistically delete the opposite
host-side helper so we don't leave cruft across driver swaps.
_DB_RESET_TABLES was missing the swarm tables, so drop-tables mode left
them intact. create_all doesn't alter columns on existing tables, so any
schema change to SwarmHost (like use_ipvlan) never took effect after a
reset. Child FK first (decky_shards -> swarm_hosts).
Wi-Fi APs bind one MAC per associated station, so VirtualBox/VMware
guests bridged over Wi-Fi rotate the VM's DHCP lease when Docker's
macvlan starts emitting container-MAC frames through the vNIC. Adds a
`use_ipvlan` toggle on the Agent Enrollment tab (mirrors the updater
daemon checkbox): flips the flag on SwarmHost, bakes `ipvlan=true` into
the agent's decnet.ini, and `_worker_config` forces ipvlan=True on the
per-host shard at dispatch. Safe no-op on wired/bare-metal agents.
Deckies merged in from a prior deployment's saved state kept their
original host_uuid — which dispatch_decnet_config then 404'd on if that
host had since been decommissioned or re-enrolled at a different uuid.
Before round-robin assignment, drop any host_uuid that isn't in the live
swarm_hosts set so orphaned entries get reassigned instead of exploding
with 'unknown host_uuid'.
tar_working_tree (walks repo + gzips several MB) and detect_git_sha
(shells out) were called directly on the event loop, so /swarm-updates/push
and /swarm-updates/push-self froze every other request until the tarball
was ready. Wrap both in asyncio.to_thread.
systemd daemons run with WorkingDirectory=/ by default; docker compose
derives the project name from basename(cwd), which is empty at '/', and
aborts with 'project name must not be empty'. Pass -p decnet explicitly
so the project name is independent of cwd, and set WorkingDirectory=/opt/decnet
on the three DECNET units so compose artifacts (decnet-compose.yml,
build contexts) also land in the install dir.
POST /deckies/deploy now branches on DECNET_MODE + enrolled host presence:
when the caller is a master with at least one reachable swarm host, round-
robin host_uuids are assigned over new deckies and the config is dispatched
via AgentClient. Falls back to local docker-compose otherwise.
Extracts the dispatch loop from api_deploy_swarm into dispatch_decnet_config
so both endpoints share the same shard/dispatch/persist path. Adds
GET /system/deployment-mode for the UI to show 'will shard across N hosts'
vs 'will deploy locally' before the operator clicks deploy.
Stateless /api/v1/deckies/deploy previously instantiated DecnetConfig with
deckies=[] so it could merge entries later — but DecnetConfig.deckies is
min_length=1, so Pydantic raised and the global handler mapped it to 422
'Internal data consistency error'. Construct the config after
build_deckies_from_ini returns at least one DeckyConfig.
Rename log-file-path -> log-directory (maps to DECNET_LOG_DIRECTORY). Bundle
now ships three systemd units rendered with agent_name/master_host and installs
them into /etc/systemd/system/. Bootstrap replaces direct 'decnet X --daemon'
calls with systemctl enable --now. Each unit pins DECNET_SYSTEM_LOGS so agent,
forwarder, and deckies logs land at decnet.{agent,forwarder}.log and decnet.log
under /var/log/decnet.
Mirrors the agent→forwarder pattern: `decnet swarmctl` now fires the
syslog-TLS listener as a detached Popen sibling so a single master
invocation brings the full receive pipeline online. --no-listener opts
out for operators who want to run the listener on a different host (or
under their own systemd unit).
Listener bind host / port come from DECNET_LISTENER_HOST and
DECNET_SWARM_SYSLOG_PORT — both seedable from /etc/decnet/decnet.ini.
PID at $(pid_dir)/listener.pid so operators can kill/restart manually.
decnet.ini.example ships alongside env.config.example as the
documented surface for the new role-scoped config. Mode, forwarder
targets, listener bind, and master ports all live there — no more
memorizing flag trees.
Extends tests/test_auto_spawn.py with two swarmctl cases: listener is
spawned with the expected argv + PID file, and --no-listener suppresses.
New _spawn_detached(argv, pid_file) helper uses Popen with
start_new_session=True + close_fds=True + DEVNULL stdio to launch a
DECNET subcommand as a fully independent process. The parent does NOT
wait(); if it dies the child survives under init. This is deliberately
not a supervisor — if the child dies the operator restarts it manually.
_pid_dir() picks /opt/decnet when writable else ~/.decnet, so both
root-run production and non-root dev work without ceremony.
`decnet agent` now auto-spawns `decnet forwarder --daemon ...` as
that detached sibling, pulling master host + syslog port from
DECNET_SWARM_MASTER_HOST / DECNET_SWARM_SYSLOG_PORT. --no-forwarder
opts out. If DECNET_SWARM_MASTER_HOST is unset the auto-spawn is
silently skipped (single-host dev or operator wants to start the
forwarder separately).
tests/test_auto_spawn.py monkeypatches subprocess.Popen and verifies:
the detach kwargs are passed, the PID file exists and contains a
valid positive integer (PID-file corruption is a real operational
headache — catching bad writes at the test level is free), the
--no-forwarder flag suppresses the spawn, and the unset-master-host
path silently skips.
- MASTER_ONLY_COMMANDS / MASTER_ONLY_GROUPS frozensets enumerate every
command a worker host must not see. Comment block at the declaration
puts the maintenance obligation in front of anyone touching command
registration.
- _gate_commands_by_mode() filters both app.registered_commands (for
@app.command() registrations) and app.registered_groups (for
add_typer sub-apps) so the 'swarm' group disappears along with
'api', 'swarmctl', 'deploy', etc. on agent hosts.
- _require_master_mode() is the belt-and-braces in-function guard,
added to the four highest-risk commands (api, swarmctl, deploy,
teardown). Protects against direct function imports that would
bypass Typer.
- DECNET_DISALLOW_MASTER=false is the escape hatch for hybrid dev
hosts that legitimately play both roles.
tests/test_mode_gating.py exercises help-text listings via subprocess
and the defence-in-depth guard via direct import.
- decnet/__init__.py now calls load_ini_config() on first import of any
decnet.* module, seeding os.environ via setdefault() so env.py's
module-level reads pick up INI values before the shell had to export
them. Real env vars still win.
- env.py exposes DECNET_MODE (default 'master') and
DECNET_DISALLOW_MASTER (default true), consumed by the upcoming
master-command gating in cli.py.
Back-compat: missing /etc/decnet/decnet.ini is a no-op. Existing
.env.local + flag-based launches behave identically.
- decnet/agent/app.py /health: drop leftover 'push-test-2' canary
planted during live VM push verification and never cleaned up;
test_health_endpoint asserts the exact dict shape.
- tests/test_factory.py: switch the lazy-engine check from
mysql+aiomysql (not in pyproject) to mysql+asyncmy (the driver the
project actually ships). The test does not hit the wire so the
dialect swap is safe.
Both were red on `pytest tests/` before any config/auto-spawn work
began; fixing them here so the upcoming commits land on a green
full-suite baseline.
New decnet/config_ini.py parses a role-scoped INI file via stdlib
configparser and seeds os.environ via setdefault — real env vars still
win, keeping full back-compat with .env.local flows.
[decnet] holds role-agnostic keys (mode, disallow-master, log-file-path);
the role section matching `mode` is loaded, the other is ignored
silently so a worker never reads master-only keys (and vice versa).
Loader is standalone in this commit — not wired into cli.py yet.
The module-level _require_env('DECNET_JWT_SECRET') call blocked
`decnet agent` and `decnet updater` from starting on workers that
legitimately have no business knowing the master's JWT signing key.
Move the resolution into a module `__getattr__`: only consumers that
actually read `decnet.env.DECNET_JWT_SECRET` trigger the validation,
which in practice means only decnet.web.auth (master-side).
Adds tests/test_env_lazy_jwt.py covering both the in-process lazy path
and an out-of-process `decnet agent --help` subprocess check with a
fully sanitized environment.
React component for /swarm-updates: per-host table polled every 10s,
row actions for Push Update / Update Updater / Rollback, a fleet-wide
'Push to All' modal with the include_self toggle, and toast feedback
per result.
Admin-only (both server-gated and UI-gated). Unreachable hosts surface
as an explicit state; actions are disabled on them. Rollback is
disabled when the worker has no previous release slot (previous_sha
null from /hosts).
Adds /api/v1/swarm-updates/{hosts,push,push-self,rollback} behind
require_admin. Reuses the existing UpdaterClient + tar_working_tree + the
per-host asyncio.gather pattern from api_deploy_swarm.py; tarball is
built exactly once per /push request and fanned out to every selected
worker. /hosts filters out decommissioned hosts and agent-only
enrollments (no updater bundle = not a target).
Connection drops during /update-self are treated as success — the
updater re-execs itself mid-response, so httpx always raises.
Pydantic models live in decnet/web/db/models.py (single source of
truth). 24 tests cover happy paths, rollback, transport failures,
include_self ordering (skip on rolled-back agents), validation, and
RBAC gating.
Add deploy/ unit files for every DECNET daemon (agent, updater, api, web,
swarmctl, listener, forwarder). All run as User=decnet with NoNewPrivileges,
ProtectSystem, PrivateTmp, LockPersonality; AmbientCapabilities=CAP_NET_ADMIN
CAP_NET_RAW only on the agent (MACVLAN/scapy). Existing api/web units migrated
to /opt/decnet layout and the same hardening stanza.
Make the updater's _spawn_agent systemd-aware: under systemd (detected via
INVOCATION_ID + systemctl on PATH), `systemctl restart decnet-agent.service`
replaces the Popen path so the new agent inherits the unit's ambient caps
instead of the updater's empty set. _stop_agent becomes a no-op in that mode
to avoid racing systemctl's own stop phase.
Tests cover the dispatcher branch selection, MainPID parsing, and the
systemd no-op stop.
- _run_pip: on first venv use, install decnet with its full dep tree so the
bootstrapped environment actually has typer/fastapi/uvicorn. Subsequent
updates keep --no-deps for a near-no-op refresh.
- run_update_self: do not reuse sys.argv to re-exec the updater. Inside the
live process, sys.argv is the uvicorn subprocess invocation (--ssl-keyfile
etc.), which 'decnet updater' CLI rejects. Reconstruct the operator-visible
command from env vars set by updater.server.run.
If the agent was started outside the updater (manually, during dev,
or from a prior systemd unit), there is no agent.pid for _stop_agent
to target, so a successful code install leaves the old in-memory
agent process still serving requests. Scan /proc for any decnet agent
command and SIGTERM all matches so restart is reliable regardless of
how the agent was originally launched.
Adds a separate `decnet updater` daemon on each worker that owns the
agent's release directory and installs tarball pushes from the master
over mTLS. A normal `/update` never touches the updater itself, so the
updater is always a known-good rescuer if a bad agent push breaks
/health — the rotation is reversed and the agent restarted against the
previous release. `POST /update-self` handles updater upgrades
explicitly (no auto-rollback).
- decnet/updater/: executor, FastAPI app, uvicorn launcher
- decnet/swarm/updater_client.py, tar_tree.py: master-side push
- cli: `decnet updater`, `decnet swarm update [--host|--all]
[--include-self] [--dry-run]`, `--updater` on `swarm enroll`
- enrollment API issues a second cert (CN=updater@<host>) signed by the
same CA; SwarmHost records updater_cert_fingerprint
- tests: executor, app, CLI, tar tree, enroll-with-updater (37 new)
- wiki: Remote-Updates page + sidebar + SWARM-Mode cross-link
`swarm list` only shows enrolled workers — there was no way to see which
deckies are running and where. Adds GET /swarm/deckies on the controller
(joins DeckyShard with SwarmHost for name/address/status) plus the CLI
wrapper with --host / --state filters and --json.
deploy --mode swarm was failing on every heterogeneous fleet: the master
populates config.interface from its own box (detect_interface() → its
default NIC), then ships that verbatim. The worker's deployer then calls
get_host_ip(config.interface), hits 'ip addr show wlp6s0' on a VM whose
NIC is enp0s3, and 500s.
Fix: agent.executor._relocalize() runs on every swarm-mode deploy.
Re-detects the worker's interface/subnet/gateway/host_ip locally and
swaps them into the config before calling deployer.deploy(). When the
worker's subnet doesn't match the master's, decky IPs are re-allocated
from the worker's subnet via allocate_ips() so they're reachable.
Unihost-mode configs are left untouched — they're already built against
the local box and second-guessing them would be wrong.
Validated against anti@192.168.1.13: master dispatched interface=wlp6s0,
agent logged 'relocalized interface=enp0s3', deployer ran successfully,
dry-run returned ok=deployed.
4 new tests cover both branches (matching-subnet preserves decky IPs;
mismatch re-allocates), the end-to-end executor.deploy() path, and the
unihost short-circuit.
The swarmctl API already exposes POST /swarm/check — an active mTLS
probe that refreshes SwarmHost.status + last_heartbeat for every
enrolled worker. The CLI was missing a wrapper, so operators had to
curl the endpoint directly (which is how the VM validation run did it,
and how the wiki Deployment-Modes / SWARM-Mode pages ended up doc'ing
a command that didn't exist yet).
Matches the existing list/enroll/decommission pattern: typer subcommand
under swarm_app, --url override, Rich table output plus --json for
scripting. Three tests: populated table, empty-swarm path, and --json
emission.
New `decnet listener` command runs the master-side RFC 5425 syslog-TLS
receiver as a standalone process (mirrors `decnet api` / `decnet swarmctl`
pattern, SIGTERM/SIGINT handlers, --daemon support).
`decnet agent` now accepts --agent-dir so operators running the worker
agent under sudo/root can point at a bundle outside /root/.decnet/agent
(the HOME under sudo propagation).
Both flags were needed to stand up the full SWARM pipeline end-to-end on
a throwaway VM: mTLS control plane reachable, syslog-over-TLS wire
confirmed via tcpdump, master-crash/resume proved with zero loss and
zero duplication across 10 forwarded lines.
pyproject: bump asyncmy floor to 0.2.11 (resolver already pulled this in).
Covers failure modes the happy-path tests miss:
- log rotation (copytruncate): st_size shrinks under the forwarder, it
resets offset=0 and reships the new contents instead of getting wedged
past EOF;
- listener restart: forwarder retries, resumes from the persisted offset,
and the previously-acked lines are NOT duplicated on the master;
- listener tolerates a well-authenticated client that sends a partial
octet-count frame and drops — the server must stay up and accept
follow-on connections;
- peer_cn / fingerprint_from_ssl degrade to 'unknown' / None when no
peer cert is available (defensive path that otherwise rarely fires).
New sub-app talks HTTP to the local swarm controller (127.0.0.1:8770 by
default; override with --url or $DECNET_SWARMCTL_URL).
- enroll: POSTs /swarm/enroll, prints fingerprint, optionally writes
ca.crt/worker.crt/worker.key to --out-dir for scp to the worker.
- list: renders enrolled workers as a rich table (with --status filter).
- decommission: looks up uuid by --name, confirms, DELETEs.
deploy --mode swarm now:
1. fetches enrolled+active workers from the controller,
2. round-robin-assigns host_uuid to each decky,
3. POSTs the sharded DecnetConfig to /swarm/deploy,
4. renders per-worker pass/fail in a results table.
Exits non-zero if no workers exist or any worker's dispatch failed.
The forwarder module existed but had no runner — closes that gap so the
worker-side process can actually be launched and runs isolated from the
agent (asyncio.run + SIGTERM/SIGINT → stop_event).
Guards: refuses to start without a worker cert bundle or a resolvable
master host ($DECNET_SWARM_MASTER_HOST or --master-host).
Worker-side log_forwarder tails the local RFC 5424 log file and ships
each line as an octet-counted frame to the master over mTLS. Offset is
persisted in a tiny local SQLite so master outages never cause loss or
duplication — reconnect resumes from the exact byte where the previous
session left off. Impostor workers (cert not signed by DECNET CA) are
rejected at TLS handshake.
Master-side log_listener terminates mTLS on 0.0.0.0:6514, validates the
client cert, extracts the peer CN as authoritative worker provenance,
and appends each frame to the master's ingest log files. Attacker-
controlled syslog HOSTNAME field is ignored — the CA-controlled CN is
the only source of provenance.
7 tests added covering framing codec, offset persistence across
reopens, end-to-end mTLS delivery, crash-resilience (offset survives
restart, no duplicate shipping), and impostor-CA rejection.
DECNET_SWARM_SYSLOG_PORT / DECNET_SWARM_MASTER_HOST env bindings
added.
_schemas.py was a local exception to the codebase convention. The rest
of the app keeps all API request/response DTOs in decnet/web/db/models.py
alongside UserResponse, DeployIniRequest, etc. — the swarm endpoints now
follow the same convention (SwarmEnrollRequest, SwarmHostView, etc).
Deletes decnet/web/router/swarm/_schemas.py.
Splits the three grouped router files into eight api_<verb>_<resource>.py
modules under decnet/web/router/swarm/ to match the convention used by
router/fleet/ and router/config/. Shared request/response models live in
_schemas.py. Keeps each endpoint easy to locate and modify without
stepping on siblings.
Adds decnet/web/swarm_api.py as an independent FastAPI app with routers
for host enrollment, deployment dispatch (sharding DecnetConfig across
enrolled workers via AgentClient), and active health probing. Runs as
its own uvicorn subprocess via 'decnet swarmctl', mirroring the isolation
pattern used by 'decnet api'. Also wires up 'decnet agent' CLI entry for
the worker side.
29 tests added under tests/swarm/test_swarm_api.py cover enrollment
(including bundle generation + duplicate rejection), host CRUD, sharding
correctness, non-swarm-mode rejection, teardown, and health probes with
a stubbed AgentClient.
- decnet.models.DeckyConfig grows an optional 'host_uuid' (the SwarmHost
that runs this decky). Defaults to None so legacy unihost state files
deserialize unchanged.
- decnet.agent.executor: replace non-existent config.name references
with config.mode / config.interface in logs and status payload.
- tests/swarm/test_state_schema.py covers legacy-dict roundtrip, field
default, and swarm-mode assignments.
decnet.swarm.client exposes:
- MasterIdentity / ensure_master_identity(): the master's own CA-signed
client bundle, issued once into ~/.decnet/ca/master/.
- AgentClient: async-context httpx wrapper that talks to a worker agent
over mTLS. health/status/deploy/teardown methods mirror the agent API.
SSL context is built from a bare ssl.SSLContext(PROTOCOL_TLS_CLIENT)
instead of httpx.create_ssl_context — the latter layers on default-CA
and purpose logic that broke private-CA mTLS. Server cert is pinned by
CA + chain, not DNS (workers enroll with arbitrary SANs).
tests/swarm/test_client_agent_roundtrip.py spins uvicorn in-process
with real certs on disk and verifies:
- A CA-signed master client passes health + status calls.
- An impostor whose cert comes from a different CA cannot connect.
Worker agent (decnet.agent):
- mTLS FastAPI service exposing /deploy, /teardown, /status, /health,
/mutate. uvicorn enforces CERT_REQUIRED with the DECNET CA pinned.
- executor.py offloads the blocking deployer onto asyncio.to_thread so
the event loop stays responsive.
- server.py refuses to start without an enrolled bundle in
~/.decnet/agent/ — unauthenticated agents are not a supported mode.
- docs/openapi disabled on the agent — narrow attack surface.
tests/test_base_repo.py: DummyRepo was missing get_attacker_artifacts
(pre-existing abstractmethod) and so could not be instantiated. Added
the stub + coverage for the new swarm CRUD surface on BaseRepository.
decnet.swarm.pki provides:
- generate_ca() / ensure_ca() — self-signed root, PKCS8 PEM, 4096-bit.
- issue_worker_cert() — per-worker keypair + cert signed by the CA with
serverAuth + clientAuth EKU so the same identity backs the agent's
HTTPS endpoint AND the syslog-over-TLS upstream.
- write_worker_bundle() / load_worker_bundle() — persist with 0600 on
private keys.
- fingerprint() — SHA-256 DER hex for master-side pinning.
tests/swarm/test_pki.py covers:
- CA idempotency on disk.
- Signed chain validates against CA subject.
- SAN population (DNS + IP).
- Bundle roundtrip with 0600 key perms.
- End-to-end mTLS handshake between two CA-issued peers.
- Cross-CA client rejection (handshake fails).
Introduces the master-side persistence layer for swarm mode:
- SwarmHost: enrolled worker metadata, cert fingerprint, heartbeat.
- DeckyShard: per-decky host assignment, state, last error.
Repo methods are added as default-raising on BaseRepository so unihost
deployments are untouched; SQLModelRepository implements them (shared
between the sqlite and mysql subclasses per the existing pattern).
decnet.collector.log / decnet.system.log and the *.db-shm / *.db-wal
sidecars produced by the sqlite WAL journal were slipping through the
existing rules. Extend the patterns so runtime state doesn't show up
in git status.
Reference template for .env / .env.local showing every variable that
decnet/env.py consumes, with short rationale per section (system
logging, embedded workers, profiling, API server, …). Copy to .env
and fill in secrets; .env itself stays gitignored.
Exercises the JSON → syslog formatter end to end: flat fields ride as
SD params, bulky nested metadata collapses into the meta_json_b64 blob,
and the event_type / hostname / service mapping lands in the right
RFC 5424 header slots.
Frontend now handles syslog lines from producers that don't use
structured-data (notably the SSH PROMPT_COMMAND hook, which emits
'CMD uid=0 user=root src=IP pwd=… cmd=…' as a plain logger message).
A new parseEventBody utility splits the body into head + key/value
pairs and preserves the final value verbatim so commands stay intact.
Dashboard and LiveLogs use this parser to render consistent pills
whether the structure came from SD params or from the MSG body.
The host-side sniffer interface depends on the deploy's driver choice
(--ipvlan flag). Instead of hardcoding HOST_MACVLAN_IFACE, probe both
names and pick whichever exists; warn and disable cleanly if neither
is present. Explicit DECNET_SNIFFER_IFACE still wins.
- Relaxed RFC 5424 regex to accept either NILVALUE or a numeric PROCID;
sshd / sudo go through rsyslog with their real PID, while
syslog_bridge emitters keep using '-'.
- Added a fallback pass that scans the MSG body for IP-shaped
key=value tokens. This rescues attacker attribution for plain logger
callers like the SSH PROMPT_COMMAND shim, which emits
'CMD … src=IP …' without SD-element params.
Each honeypot container now carries its own copy of the shared RFC 5424
formatter. Services that previously rolled their own ad-hoc syslog
lines can now import syslog_line / write_syslog_file for a consistent
SD-element format that the collector already knows how to parse.
Adds the server-side wiring and frontend UI to surface files captured
by the SSH honeypot for a given attacker.
- New repository method get_attacker_artifacts (abstract + SQLModel
impl) that joins the attacker's IP to `file_captured` log rows.
- New route GET /attackers/{uuid}/artifacts.
- New router /artifacts/{decky}/{service}/{stored_as} that streams a
quarantined file back to an authenticated viewer.
- AttackerDetail grows an ArtifactDrawer panel with per-file metadata
(sha256, size, orig_path) and a download action.
- ssh service fragment now sets NODE_NAME=decky_name so logs and the
host-side artifacts bind-mount share the same decky identifier.
The /opt/emit_capture.py, /opt/syslog_bridge.py, and
/usr/libexec/udev/journal-relay files were plaintext and world-readable
to any attacker root-shelled into the SSH honeypot — revealing the full
capture logic on a single cat.
Pack all three into /entrypoint.sh as XOR+gzip+base64 blobs at build
time (_build_stealth.py), then decode in-memory at container start and
exec the capture loop from a bash -c string. No .py files under /opt,
no journal-relay file under /usr/libexec/udev, no argv_zap name
anywhere. The LD_PRELOAD shim is installed as
/usr/lib/x86_64-linux-gnu/libudev-shared.so.1 — sits next to the real
libudev.so.1 and blends into the multiarch layout.
A 1-byte random XOR key is chosen at image build so a bare
'base64 -d | gunzip' probe on the visible entrypoint returns binary
noise instead of readable Python.
Docker-dependent tests live under tests/docker/ behind a new 'docker'
pytest marker (excluded from the default run, same pattern as fuzz /
live / bench).
The named pipe at /run/systemd/journal/syslog-relay had two problems
beyond its argv leak: any root-in-container process could (a) `cat`
the pipe and watch the live SIEM feed, and (b) write to it and inject
forged log lines. Since an attacker with a shell is already root
inside the honeypot, file permissions can't fix it.
Point rsyslog's auth/user actions directly at /proc/1/fd/1 — the
container-stdout fd Docker attached to PID 1 — and delete the
mkfifo + cat relay from the entrypoint. No pipe on disk, nothing to
read, nothing to inject, and one fewer cloaked process in `ps`.
Two leaks remained after the inotifywait argv fix:
1. The bash running journal-relay showed its argv[1] (the script path)
in /proc/PID/cmdline, producing a line like
'journal-relay /usr/libexec/udev/journal-relay'
Apply argv_zap.so to that bash too.
2. argv_zap previously hardcoded PR_SET_NAME to 'kmsg-watch', which was
wrong for any caller other than inotifywait. The comm name now comes
from ARGV_ZAP_COMM so each caller can pick its own (kmsg-watch for
inotifywait, journal-relay for the watcher bash).
3. The capture.sh header started with 'SSH honeypot file-catcher' —
fatal if an attacker runs 'cat' on it. Rewritten as a plausible
systemd-journal relay helper; stray 'attacker' / 'honeypot' words
in mid-script comments stripped too.
A lived-in Linux box ships with iputils-ping, ca-certificates, and nmap
available. Their absence is a cheap tell, and they're handy for letting
the attacker move laterally in ways we want to observe. iproute2 (ip a)
was already installed for attribution — noted here for completeness.
The kmsg-watch (inotifywait) process was the last honest giveaway in
`ps aux` — its watch paths and event flags betrayed the honeypot. The
argv_zap.so shim hooks __libc_start_main, heap-copies argv for the real
main, then memsets the contiguous argv[1..] region to NUL so the kernel's
cmdline reader returns just argv[0].
gcc is installed and purged in the same Docker layer to keep the image
slim. The shim also calls prctl(PR_SET_NAME) so /proc/self/comm mirrors
the argv[0] disguise.
exec -a replaces argv[0] so ps shows 'journal-relay /usr/libexec/udev/journal-relay'
instead of '/bin/bash /usr/libexec/udev/journal-relay' — no interpreter
hint on the watcher process.
inotify | while spawns a subshell for the tail of the pipeline, so
two bash processes (the script itself and the while-loop subshell)
showed up under /usr/libexec/udev/journal-relay in ps aux. Enable
lastpipe so the while loop runs in the main shell — ps now shows
one bash plus the inotify child, matching a simple udev helper.
Rename the container-side logging module decnet_logging → syslog_bridge
(canonical at templates/syslog_bridge.py, synced into each template by
the deployer). Drop the stale per-template copies; setuptools find was
picking them up anyway. Swap useradd/USER/chown "decnet" for "logrelay"
so no obvious token appears in the rendered container image.
Apply the same cloaking pattern to the telnet template that SSH got:
syslog pipe moves to /run/systemd/journal/syslog-relay and the relay
is cat'd via exec -a "systemd-journal-fwd". rsyslog.d conf rename
99-decnet.conf → 50-journal-forward.conf. SSH capture script:
/var/decnet/captured → /var/lib/systemd/coredump (real systemd path),
logger tag decnet-capture → systemd-journal. Compose volume updated
to match the new in-container quarantine path.
SD element ID shifts decnet@55555 → relay@55555; synced across
collector, parser, sniffer, prober, formatter, tests, and docs so the
host-side pipeline still matches what containers emit.
Rename the rsyslog→stdout pipe from /var/run/decnet-logs (dead giveaway)
to /run/systemd/journal/syslog-relay, and launch the relay via
exec -a "systemd-journal-fwd" so ps shows a plausible systemd forwarder
instead of a bare cat. Casual ps/ls inspection now shows nothing
with "decnet" in the name.
Old ps output was a dead giveaway: two "decnet-capture" bash procs
and a raw "inotifywait". Install script at /usr/libexec/udev/journal-relay
and invoke inotifywait through a /usr/libexec/udev/kmsg-watch symlink so
both now render as plausible udev/journal helpers under casual inspection.
fuser and /proc fd walks race scp/wget/sftp — by close_write the writer
has already closed the fd, so pid-chain attribution always resolved to
unknown for non-interactive drops. Fall back to the ss snapshot: one
established session → ss-only, multiple → ss-ambiguous (still record
src_ip from the first, analysts cross-check concurrent_sessions).
inotifywait watches writable paths in the SSH decky and mirrors any
file close_write/moved_to into a per-decky host-mounted quarantine dir.
Each artifact carries a .meta.json with attacker attribution resolved
by walking the writer PID's PPid chain to the sshd session leader,
then cross-referencing ss and utmp for source IP/user/login time.
Also emits an RFC 5424 syslog line per capture for SIEM correlation.
Commit-by-commit evidence of the perf work: each CSV is the raw
Locust output for the commit hash in its filename, plus the four
fb69a06 variants (single worker, tracing on/off, single-core pinned,
12 workers) referenced in the README baseline table.
Some pyinstrument frame trees contain branches where an identifier is
missing (typically at the very top or with certain async boundaries),
which crashed the aggregator with a KeyError mid-run. Short-circuit
on None frames and missing identifiers so a single ugly HTML no
longer kills the summary of the other few hundred.
asyncmy needs cryptography for caching_sha2_password (the MySQL 8
default auth plugin). Without it, connection handshake fails the
moment the server negotiates the modern plugin.
Capture Locust numbers from the fb69a06 branch across five
configurations so future regressions have something to measure against.
- 500u tracing-on single-worker: ~960 RPS / p99 2.9 s
- 1500u tracing-on single-worker: ~880 RPS / p99 9.5 s
- 1500u tracing-off single-worker: ~990 RPS / p99 8.4 s
- 1500u tracing-off pinned to one core: ~46 RPS / p99 122 s
- 1500u tracing-off 12 workers: ~1585 RPS / p99 4.2 s
Also note MySQL max_connections math (pool_size * max_overflow *
workers = 720) to explain why the default 151 needs bumping, and the
Python 3.14 GC segfault so nobody repeats that mistake.
Previous attempt (shield + sync invalidate fallback) didn't work
because shield only protects against cancellation from *other* tasks.
When the caller task itself is cancelled mid-query, its next await
re-raises CancelledError as soon as the shielded coroutine yields —
rollback inside session.close() never completes, the aiomysql
connection is orphaned, and the pool logs 'non-checked-in connection'
when GC finally reaches it.
Hand exception-path cleanup to loop.create_task() so the new task
isn't subject to the caller's pending cancellation. close() (and the
invalidate() fallback for a dead connection) runs to completion.
Success path is unchanged — still awaits close() inline so callers
see commit visibility and pool release before proceeding.
Under high-concurrency MySQL load, uvicorn cancels request tasks when
clients disconnect. If cancellation lands mid-query, session.close()
tries to ROLLBACK on a connection that aiomysql has already marked as
closed — raising InterfaceError("Cancelled during execution") and
leaving the connection checked-out until GC, which the pool then
warns about as a 'non-checked-in connection'.
The old fallback tried sync.rollback() + sync.close(), but those still
go through the async driver and fail the same way on a dead connection.
Replace them with session.sync_session.invalidate(), which just flips
the pool's internal record — no I/O, so it can't be cancelled — and
tells the pool to drop the connection immediately instead of waiting
for garbage collection.
Locust @task(2) hammers /auth/login in steady state on top of the
on_start burst. After caching the uuid-keyed user lookup and every
other read endpoint, login alone accounted for 47% of total
_execute at 500c/u — pure DB queueing on SELECT users WHERE
username=?.
5s TTL, positive hits only (misses bypass so a freshly-created
user can log in immediately). Password verify still runs against
the cached hash, so security is unchanged — the only staleness
window is: a changed password accepts the old password for up to
5s until invalidate_user_cache fires (it's called on every write).
The per-request SELECT users WHERE uuid=? in require_role was the
hidden tax behind every authed endpoint — it kept _execute at ~60%
across the profile even after the page caches landed. Even /health
(with its DB and Docker probes cached) was still 52% _execute from
this one query.
- dependencies.py: 10s TTL cache on get_user_by_uuid, well below JWT
expiry. invalidate_user_cache(uuid) is called on password change,
role change, and user delete.
- api_get_config.py: 5s TTL cache on the admin branch's list_users()
(previously fetched every /config call). Invalidated on user
create/update/delete.
- api_change_pass.py + api_manage_users.py: invalidation hooks on
all user-mutating endpoints.
Round-2 follow-up: profile at 500c/u showed _execute still dominating
the uncached read endpoints (/bounty 76%, /logs/histogram 73%,
/deckies 56%). Same router-level TTL pattern as /stats — 5s window,
asyncio.Lock to collapse concurrent calls into one DB hit.
- /bounty: cache default unfiltered page (limit=50, offset=0,
bounty_type=None, search=None). Filtered requests bypass.
- /logs/histogram: cache default (interval_minutes=15, no filters).
Filtered / non-default interval requests bypass.
- /deckies: cache full response (endpoint takes no params).
- /config: bump _STATE_TTL from 1.0 to 5.0 — admin writes are rare,
1s was too short for bursts to coalesce at high concurrency.
SQLite is a local file — a SELECT 1 per session checkout is pure
overhead. Env var DECNET_DB_POOL_PRE_PING stays for anyone running
on a network-mounted volume. MySQL backend keeps its current default.
Popen moved inside the try so a missing uvicorn falls through to the
existing error message instead of crashing the CLI. test_cli was still
patching the old subprocess.run entrypoint; switched both api command
tests to patch subprocess.Popen / os.killpg to match the current path.
Every /stats call ran SELECT count(*) FROM logs + SELECT count(DISTINCT
attacker_ip) FROM logs; every /logs and /attackers call ran an
unfiltered count for the paginator. At 500 concurrent users these
serialize through aiosqlite's worker threads and dominate wall time.
Cache at the router layer (repo stays dialect-agnostic):
- /stats response: 5s TTL
- /logs total (only when no filters): 2s TTL
- /attackers total (only when no filters): 2s TTL
Filtered paths bypass the cache. Pattern reused from api_get_config
and api_get_health (asyncio.Lock + time.monotonic window + lazy lock).
require_role._check previously chained from get_current_user, which
already loaded the user — then looked it up again. Inline the decode +
single user fetch + must_change_password + role check so every
authenticated request costs one SELECT users WHERE uuid=? instead of
two.
Only database, docker, and ingestion_worker now count as critical
(→ 503 unhealthy). attacker/sniffer/collector failures drop overall
status to degraded (still 200) so the dashboard doesn't panic when a
non-essential worker isn't running.
The ingester now accumulates up to DECNET_BATCH_SIZE rows (default 100)
or DECNET_BATCH_MAX_WAIT_MS (default 250ms) before flushing through
repo.add_logs — one transaction, one COMMIT per batch instead of per
row. Under attacker traffic this collapses N commits into ⌈N/100⌉ and
takes most of the SQLite writer-lock contention off the hot path.
Flush semantics are cancel-safe: _position only advances after a batch
commits successfully, and the flush helper bails without touching the
DB if the enclosing task is being cancelled (lifespan teardown).
Un-flushed lines stay in the file and are re-read on next startup.
Tests updated to assert on add_logs (bulk) instead of the per-row
add_log that the ingester no longer uses, plus a new test that 250
lines flush in ≤5 calls.
Adds BaseRepository.add_logs (default: loops add_log for backwards
compatibility) and a real single-session/single-commit implementation
on SQLModelRepository. Introduces DECNET_BATCH_SIZE (default 100) and
DECNET_BATCH_MAX_WAIT_MS (default 250) so the ingester can flush on
either a size or a time bound when it adopts the new method.
Ingester wiring is deferred to a later pass — the single-log path was
deadlocking tests when flushed during lifespan teardown, so this change
ships the DB primitive alone.
A module-level asyncio.Lock binds to the loop it was first awaited on.
Under pytest-anyio (and xdist) each test spins up a new loop; any later
test that hit /health or /config would wait on a lock owned by a dead
loop and the whole worker would hang.
Create the lock on first use and drop it in the test-reset helpers so a
fresh loop always gets a fresh lock.
Under CPU saturation the sync docker.from_env()/ping() calls could miss
their socket timeout, cache _docker_healthy=False, and return 503 for
the full 5s TTL window. Both calls now run on a thread so the event
loop keeps serving other requests while Docker is being probed.
With --workers > 1, SIGINT from the terminal raced uvicorn's supervisor:
some workers got signaled directly, the supervisor respawned them, and
the result behaved like a forkbomb. Start uvicorn in its own session and
signal the whole process group (SIGTERM → 10s grace → SIGKILL) when we
catch KeyboardInterrupt.
Forwards straight to uvicorn's --workers. Default stays at 1 so the
single-worker efficiency direction is preserved; raising it is available
for threat-actor load scenarios where the honeypot needs to soak real
attack traffic without queueing on one event loop.
Previously every user did login → change-pass → re-login in on_start
regardless of whether the server actually required a password change.
With bcrypt at ~250ms/call that's 3 bcrypt-bound requests per user.
At 2500 users the on_start queue was ~10k bcrypt ops — users never
escaped warmup, so @task endpoints never fired.
Login already returns must_change_password; only run the change-pass
+ re-login dance when the server says we have to. Cuts on_start from
3 requests to 1 for every user after the first DB initialization.
stdlib json was FastAPI's default. Every response body, every SSE frame,
and every add_log/state/payload write paid the stdlib encode cost.
- pyproject.toml: add orjson>=3.10 as a core dep.
- decnet/web/api.py: default_response_class=ORJSONResponse on the
FastAPI app, so every endpoint return goes through orjson without
touching call sites. Explicit JSONResponse sites in the validation
exception handlers migrated to ORJSONResponse for consistency.
- health endpoint's explicit JSONResponse → ORJSONResponse.
- SSE stream (api_stream_events.py): 6 json.dumps call sites →
orjson.dumps(...).decode() — the per-event frames that fire on every
sse tick.
- sqlmodel_repo.py: encode sites on the log-insert path switched to
orjson (fields, payload, state value). Parser sites (json.loads)
left as-is for now — not on the measured hot path.
Locust hit /health and /config on every @task(3), so each request was
firing repo.get_total_logs() and two repo.get_state() calls against
aiosqlite — filling the driver queue for data that changes on the order
of seconds, not milliseconds.
Both caches follow the shape already used by the existing Docker cache:
- asyncio.Lock with double-checked TTL so concurrent callers collapse
into one DB hit per 1s window.
- _reset_* helpers called from tests/api/conftest.py::setup_db so the
module-level cache can't leak across tests.
tests/test_health_config_cache.py asserts 50 concurrent callers
produce exactly 1 repo call, and the cache expires after TTL.
Creating a new docker.from_env() client per /health request opened a
fresh unix-socket connection each time. Under load that's wasteful and
hammers dockerd.
Keep a module-level client + last-check timestamp; actually ping every
5 seconds, return cached state in between. Reset helper provided for
tests.
- aiomysql → asyncmy on both sides of the URL/import (faster, maintained).
- Pool sizing now reads DECNET_DB_POOL_SIZE / MAX_OVERFLOW / RECYCLE /
PRE_PING for both SQLite and MySQL engines so stress runs can bump
without code edits.
- MySQL initialize() now wraps schema DDL in a GET_LOCK advisory lock so
concurrent uvicorn workers racing create_all() don't hit 'Table was
skipped since its definition is being modified by concurrent DDL'.
- sqlite & mysql repo get_log_histogram use the shared _session() helper
instead of session_factory() for consistency with the rest of the repo.
- SSE stream_events docstring updated to asyncmy.
verify_password / get_password_hash are CPU-bound and take ~250ms each
at rounds=12. Called directly from async endpoints, they stall every
other coroutine for that window — the single biggest single-worker
bottleneck on the login path.
Adds averify_password / ahash_password that wrap the sync versions in
asyncio.to_thread. Sync versions stay put because _ensure_admin_user and
tests still use them.
5 call sites updated: login, change-password, create-user, reset-password.
tests/test_auth_async.py asserts parallel averify runs concurrently (~1x
of a single verify, not 2x).
_ensure_admin_user was strict insert-if-missing: once a stale hash landed
in decnet.db (e.g. from a deploy that used a different DECNET_ADMIN_PASSWORD),
login silently 401'd because changing the env var later had no effect.
Now on startup: if the admin still has must_change_password=True (they
never finalized their own password), re-sync the hash from the current
env var. Once the admin sets a real password, we leave it alone.
Found via locustfile.py login storm — see tests/test_admin_seed.py.
Note: this commit also bundles uncommitted pool-management work already
present in sqlmodel_repo.py from prior sessions.
Parses every HTML in profiles/, reattributes [self]/[await] synthetic
leaves to their parent function, and reports per-endpoint wall-time
(mean/p50/p95/max) plus top hot functions by cumulative self-time.
Makes post-locust profile dirs actually readable — otherwise they're
just a pile of hundred-plus HTML files.
When decnet.system.log is root-owned (e.g. created by a pre-fix 'sudo
decnet deploy') and a subsequent non-root process tries to log, the
InodeAwareRotatingFileHandler raised PermissionError out of emit(),
which propagated up through logger.debug/info and killed the collector's
log stream loop ('log stream ended ... reason=[Errno 13]').
Now matches stdlib behaviour: wrap _open() in try/except OSError and
defer to handleError() on failure. Adds a regression test.
Also: scripts/profile/view.sh 'pyinstrument' keyword was matching
memray-flamegraph-*.html files. Exclude the memray-* prefix.
Reads the memray usage CSV and emits a verdict based on tail-drop-from-
peak: CLIMB-AND-DROP, MOSTLY-RELEASED, or SUSTAINED-AT-PEAK. Deliberately
ignores net-growth-vs-baseline since any active workload grows vs. a cold
interpreter — that metric is misleading as a leak signal.
Mirrors the inode-check fix from 935a9a5 (collector worker) for the
stdlib-handler-based log paths. Both decnet.system.log (config.py) and
decnet.log (logging/file_handler.py) now use a subclass that stats the
target path before each emit and reopens on inode/device mismatch —
matching the behavior of stdlib WatchedFileHandler while preserving
size-based rotation.
Previously: rm decnet.system.log → handler kept writing to the orphaned
inode until maxBytes triggered; all lines between were lost.
'sudo decnet deploy' needs root for MACVLAN, but the log files it creates
(decnet.log and decnet.system.log) end up owned by root. A subsequent
non-root 'decnet api' then crashes on PermissionError appending to them.
New decnet.privdrop helper reads SUDO_UID/SUDO_GID and chowns files/dirs
back to the invoking user. Best-effort: no-op when not root, not under
sudo, path missing, or chown fails. Applied at both log-file creation
sites (config.py system log, logging/file_handler.py syslog file).
The API's lifespan unconditionally spawned a MACVLAN sniffer task, which
duplicated the standalone 'decnet sniffer --daemon' process that
'decnet deploy' always starts — causing two workers to sniff the same
interface, double events, and wasted CPU.
Mirror the existing DECNET_EMBED_PROFILER pattern: sniffer is OFF by
default, opt in explicitly. Static regression tests guard against
accidental removal of the gate.
Without it, 'Total number of frames seen: 0' in memray stats and flamegraphs
render empty / C-only. Also added --follow-fork so uvicorn workers spawned
as child processes are tracked.
Dispatches by extension: .prof -> snakeviz, memray .bin -> memray flamegraph
(overridable via VIEW=table|tree|stats|summary|leaks), .svg/.html -> xdg-open.
Positional arg can be a file path or a type keyword (cprofile, memray, pyspy,
pyinstrument).
Root cause of 'No python processes found in process <pid>': py-spy needs
per-release ABI knowledge and 0.4.1 (latest PyPI) predates 3.14. Wrapper
now detects the interpreter and points users at pyinstrument/memray/cProfile.
The builder in decnet/web/db/mysql/database.py emits 'mysql+asyncmy://' URLs
(asyncmy is the declared dep in pyproject.toml). Tests were stale from a
prior aiomysql era.
New `profile` optional-deps group, opt-in Pyinstrument ASGI middleware
gated by DECNET_PROFILE_REQUESTS, bench marker + tests/perf/ micro-benchmarks
for repository hot paths, and scripts/profile/ helpers for py-spy/cProfile/memray.
Root cause: test_schemathesis.py mutates decnet.web.auth.SECRET_KEY at
module-level import time, poisoning JWT verification for all other tests
in the same process — even when fuzz tests are deselected.
- Add pytest_ignore_collect hook in tests/api/conftest.py to skip
collecting test_schemathesis.py unless -m fuzz is selected
- Add --dist loadscope to addopts so xdist groups by module (protects
module-scoped fixtures in live tests)
- Remove now-unnecessary xdist_group markers from live test classes
- Add 403 response to all RBAC-gated endpoints (schemathesis UndefinedStatusCode)
- Add 400 response to all endpoints accepting JSON bodies (malformed input)
- Add required 'title' field to schemathesis.toml for schemathesis 4.15+
- Add xdist_group markers to live tests with module-scoped fixtures to
prevent xdist from distributing them across workers (fixture isolation)
Extends tracing to every remaining module: all 23 API route handlers,
correlation engine, sniffer (fingerprint/p0f/syslog), prober (jarm/hassh/tcpfp),
profiler behavioral analysis, logging subsystem, engine, and mutator.
Bridges the ingester→SSE trace gap by persisting trace_id/span_id columns on
the logs table and creating OTEL span links in the SSE endpoint. Adds log-trace
correlation via _TraceContextFilter injecting otel_trace_id into Python LogRecords.
Includes development/docs/TRACING.md with full span reference (76 spans),
pipeline propagation architecture, quick start guide, and troubleshooting.
Collector now creates a span per event and injects W3C trace context
into JSON records. Ingester extracts that context and creates child
spans, connecting the full event journey: collector -> ingester ->
db.add_log + extract_bounty -> db.add_bounty.
Profiler now creates per-IP spans inside update_profiles with rich
attributes (event_count, is_traversal, bounty_count, command_count).
Traces in Jaeger now show the complete execution map from capture
through ingestion and profiling.
Replace brittle explicit method-by-method proxy with __getattr__-based
dynamic proxy that forwards all args/kwargs to the inner repo. Fixes
TypeError on get_logs_after_id() where concrete repo accepts extra
kwargs beyond the ABC signature.
Pin DECNET_DEVELOPER_TRACING=false in conftest.py so .env.local
settings don't leak into the test suite.
Gated by DECNET_DEVELOPER_TRACING env var (default off, zero overhead).
When enabled, traces flow through FastAPI routes, background workers
(collector, ingester, profiler, sniffer, prober), engine/mutator
operations, and all DB calls via TracedRepository proxy.
Includes Jaeger docker-compose for local dev and 18 unit tests.
resp.read(4096) blocks until 4096 bytes accumulate, which stalls SSE
events (~100-500 bytes each) in the proxy buffer indefinitely. Switch
to read1() which returns bytes immediately available without waiting
for more. Also disable the 120s socket timeout for SSE connections.
The collector spawned one permanent thread per Docker container via
asyncio.to_thread(), saturating the default asyncio executor. This
starved short-lived to_thread(load_state) calls in get_deckies() and
get_stats_summary(), causing the SSE stream and deckies endpoints to
hang indefinitely while other DB-only endpoints worked fine.
Give the collector and sniffer their own ThreadPoolExecutor so they
never compete with the default pool.
decnet deploy spawns a standalone profiler daemon AND api.py was also starting
attacker_profile_worker as an asyncio task inside the web server. Both instances
shared the same attacker_worker_cursor key in the state table, causing a race
where one instance could skip events already claimed by the other or overwrite
the cursor mid-batch.
Default is now OFF (embedded profiler disabled). The standalone daemon started
by decnet deploy is the single authoritative instance. Set DECNET_EMBED_PROFILER=true
only when running decnet api in isolation without a full deploy.
The active prober emits tcpfp_fingerprint events with TTL, window, MSS etc.
from the attacker's SYN-ACK. These were invisible to the behavioral profiler
for two reasons:
1. target_ip (prober's field name for attacker IP) was not in _IP_FIELDS in
collector/worker.py or correlation/parser.py, so the profiler re-parsed
raw_lines and got attacker_ip=None, never attributing prober events to
the attacker profile.
2. sniffer_rollup only handled tcp_syn_fingerprint (passive sniffer) and
ignored tcpfp_fingerprint (active prober). Prober events use different
field names: window_size/window_scale/sack_ok vs window/wscale/has_sack.
Changes:
- Add target_ip to _IP_FIELDS in collector and parser
- Add _PROBER_TCPFP_EVENT and _INITIAL_TTL table to behavioral.py
- sniffer_rollup now processes tcpfp_fingerprint: maps field names, derives
OS from TTL via _os_from_ttl, computes hop_distance = initial_ttl - observed
- Expand prober DEFAULT_TCPFP_PORTS to [22,80,443,8080,8443,445,3389] for
better SYN-ACK coverage on attacker machines
- Add 4 tests covering prober OS detection, hop distance, and field mapping
Templates for http, https, k8s, and docker_api log the client IP as
remote_addr (Flask's request.remote_addr) instead of src_ip. The collector
and correlation parser only checked src_ip/src/client_ip/remote_ip/ip, so
every request event from those services was stored with attacker_ip="Unknown"
and never associated with any attacker profile.
Adding remote_addr to _IP_FIELDS in both collector/worker.py and
correlation/parser.py fixes attribution. The profiler cursor was also reset
to 0 so the worker performs a cold rebuild and re-ingests existing events with
the corrected field mapping.
templates/decnet_logging.py calls str(v) on all SD-PARAM values, turning a
headers dict into Python repr ('{'User-Agent': ...}') rather than JSON.
detect_tools_from_headers() called json.loads() on that string and silently
swallowed the error, returning [] for every HTTP event. Same bug prevented
the ingester from extracting User-Agent bounty fingerprints.
- templates/http/server.py: wrap headers dict in json.dumps() before passing
to syslog_line so the value is a valid JSON string in the syslog record
- behavioral.py: add ast.literal_eval fallback for existing DB rows that were
stored with the old Python repr format
- ingester.py: parse headers as JSON string in _extract_bounty so User-Agent
fingerprints are stored correctly going forward
- tests: add test_json_string_headers and test_python_repr_headers_fallback
to exercise both formats in detect_tools_from_headers
Replaces the single persistent open() with inode-based reopen logic.
If decnet.log or decnet.json is deleted or renamed by logrotate, the
next write detects the stale inode, closes the old handle, and creates
a fresh file — preventing silent data loss to orphaned inodes.
- Ingester now loads byte-offset from DB on startup (key: ingest_worker_position)
and saves it after each batch — prevents full re-read on every API restart
- On file truncation/rotation the saved offset is reset to 0
- Profiler worker now loads last_log_id from DB on startup — every restart
becomes an incremental update instead of a full cold rebuild
- Updated all affected tests to mock get_state/set_state; added new tests
covering position restore, set_state call, truncation reset, and cursor
restore/cold-start paths
Cold start fetched all logs in one bulk query then processed them in a tight
synchronous loop with no yields, blocking the asyncio event loop for seconds
on datasets of 30K+ rows. This stalled every concurrent await — including the
SSE stream generator's initial DB calls — causing the dashboard to show
INITIALIZING SENSORS indefinitely.
Changes:
- Drop _cold_start() and get_all_logs_raw(); uninitialized state now runs the
same cursor loop as incremental, starting from last_log_id=0
- Yield to the event loop after every _BATCH_SIZE rows (asyncio.sleep(0))
- Add SSE keepalive comment as first yield so the connection flushes before
any DB work begins
- Add Cache-Control/X-Accel-Buffering headers to StreamingResponse
Existing MySQL databases hit a DataError when the commands/fingerprints
JSON blobs exceed 64 KiB (TEXT limit). _BIG_TEXT emits MEDIUMTEXT only
at CREATE TABLE time; create_all() is a no-op on existing columns.
Add MySQLRepository._migrate_column_types() that queries
information_schema and issues ALTER TABLE … MODIFY COLUMN … MEDIUMTEXT
for the five affected columns (commands, fingerprints, services, deckies,
state.value) whenever they are still TEXT. Called from an overridden
initialize() after _migrate_attackers_table() and before create_all().
Add tests/test_mysql_migration.py covering: ALTER issued for TEXT columns,
no-op for already-MEDIUMTEXT, idempotency, DEFAULT clause correctness,
and initialize() call order.
- test_mysql_backend_live.py: live integration tests for MySQL connections
- test_mysql_histogram_sql.py: dialect-specific histogram query tests
- test_mysql_url_builder.py: MySQL connection string construction
- mysql_spinup.sh: Docker spinup script for local MySQL testing
- templates/sniffer/decnet_logging.py: add logging configuration for sniffer integration
- templates/ssh/decnet_logging.py: add SSH service logging template
- development/DEVELOPMENT.md: document new MySQL backend, p0f, profiler, config API features
- pyproject.toml: update dependencies for MySQL, p0f, profiler functionality
- decnet/profiler/: analyze attacker behavior timings, command sequences, service probing patterns
- Enables detection of coordinated attacks vs random scanning
- Feeds into attacker scoring and risk assessment
- Implement MySQLRepository extending BaseRepository
- Add SQLAlchemy/SQLModel ORM abstraction layer (sqlmodel_repo.py)
- Support connection pooling and tuning via DECNET_DB_URL env var
- Cross-compatible with SQLite backend via factory pattern
- Prepared for production deployment with MySQL SIEM/ELK integration
- Add @require_role() decorators to all GET/POST/PUT endpoints
- Centralize role-based access control per memory: RBAC null-role bug required server-side gating
- Admin (manage_admins), Editor (write ops), Viewer (read ops), Public endpoints
- Removes client-side role checks as per memory: server-side UI gating is mandatory
- Refactor deploy command to support service randomization and selective service deployment
- Add --services flag to filter deployed services by name
- Improve status and teardown command output formatting
- Update help text for clarity
- Extract dialect-agnostic methods to BaseRepository
- Keep only SQLite-specific SQL and initialization in SQLiteRepository
- Reduces duplication for upcoming MySQL backend
- Maintains 100% backward compatibility
- Add `get_repository()` factory function to select DB implementation at runtime via DECNET_DB_TYPE env var
- Extract BaseRepository abstract interface from SQLiteRepository
- Update dependencies to use factory-based repository injection
- Add DECNET_DB_TYPE env var support (defaults to sqlite)
- Refactor models and repository base class for cross-dialect compatibility
Connection-lifecycle events (connect, disconnect, accept, close) fire once
per TCP connection. During a portscan or credential-stuffing run this
firehoses the SQLite ingester with tiny WAL writes and starves all reads
until the queue drains.
The collector now deduplicates these events by
(attacker_ip, decky, service, event_type) over a 1-second window before
writing to the .json ingestion stream. The raw .log file is untouched, so
rsyslog/SIEM still see every event for forensic fidelity.
Tunable via DECNET_COLLECTOR_RL_WINDOW_SEC and DECNET_COLLECTOR_RL_EVENT_TYPES.
Remove unused imports (ruff F401), suppress B324 false positives on
spec-mandated MD5 in HASSH/JA3/JA3S fingerprinting, drop unused
record_version assignment in JARM parser, and pin pip>=26.0 in dev
deps to address CVE-2025-8869 and CVE-2026-1703.
The live test modules set DECNET_CONTRACT_TEST=true at module level,
which persisted across xdist workers and caused the mutate endpoint
to short-circuit before the mock was reached. Clear the env var in
affected tests with monkeypatch.delenv.
21 live tests covering all background workers against real resources:
collector (real Docker daemon), ingester (real filesystem + DB),
attacker worker (real DB profiles), sniffer (real network interfaces),
API lifespan (real health endpoint), and cross-service cascade isolation.
9 tests covering auth enforcement, component reporting, status
transitions, degraded mode, and real DB/Docker state validation.
Runs with -m live alongside other live service tests.
23 tests verifying that each background worker degrades gracefully
when its dependencies are unavailable, and that failures don't cascade:
- Collector: Docker unavailable, no state file, empty fleet
- Ingester: missing log file, unset env var, malformed JSON, fatal DB
- Attacker: DB errors, empty database
- Sniffer: missing interface, no state, scapy crash, non-decky traffic
- API lifespan: all workers failing, DB init failure, sniffer import fail
- Cascade: collector→ingester, ingester→attacker, sniffer→collector, DB→sniffer
Replace per-decky sniffer containers with a single host-side sniffer
that monitors all traffic on the MACVLAN interface. Runs as a background
task in the FastAPI lifespan alongside the collector, fully fault-isolated
so failures never crash the API.
- Add fleet_singleton flag to BaseService; sniffer marked as singleton
- Composer skips fleet_singleton services in compose generation
- Fleet builder excludes singletons from random service assignment
- Extract TLS fingerprinting engine from templates/sniffer/server.py
into decnet/sniffer/ package (parameterized for fleet-wide use)
- Sniffer worker maps packets to deckies via IP→name state mapping
- Original templates/sniffer/server.py preserved for future use
All info sections (Timeline, Services, Deckies, Commands, Fingerprints)
now have clickable headers with a chevron toggle to expand/collapse
content. Pagination controls in Commands stay clickable without
triggering the collapse. All sections default to open.
Replace flat fingerprint card list with a structured section that
groups fingerprints by type under two categories: Active Probes
(JARM, HASSH, TCP/IP) and Passive Fingerprints (TLS, certificates,
latency, etc.). Each group shows its icon, label, and count.
AttackerDetail: dedicated render components for JARM (hash + target),
HASSHServer (hash, banner, expandable KEX/encryption algorithms), and
TCP/IP stack (TTL, window, MSS as bold stats, DF/SACK/TS as tags,
options order string).
Bounty: add fingerprint field labels and priority keys so prober
bounties display structured rows instead of raw JSON. Add FINGERPRINTS
filter option to the type dropdown.
Extends the prober with two new active probe types alongside JARM:
- HASSHServer: SSH server fingerprinting via KEX_INIT algorithm ordering
(MD5 hash of kex;enc_s2c;mac_s2c;comp_s2c, pure stdlib)
- TCP/IP stack: OS/tool fingerprinting via SYN-ACK analysis using scapy
(TTL, window size, DF bit, MSS, TCP options ordering, SHA256 hash)
Worker probe cycle now runs three phases per IP with independent
per-type port tracking. Ingester extracts bounties for all three
fingerprint types.
Reverts commits 8c249f6, a6c7cfd, 7ff5703. The SSH log relay approach
requires container redeployment and doesn't retroactively fix existing
attacker profiles. Rolling back to reassess the approach.
New log_relay.py replaces raw 'cat' on the rsyslog pipe. Intercepts
sshd and bash lines and re-emits them as structured RFC 5424 events:
login_success, session_opened, disconnect, connection_closed, command.
Parsers updated to accept non-nil PROCID (sshd uses PID).
The SSH honeypot logs commands via PROMPT_COMMAND logger as:
<14>1 ... bash - - - CMD uid=0 pwd=/root cmd=ls
These lines had service=bash and event_type=-, so the attacker worker
never recognized them as commands. Both the collector and correlation
parsers now detect the CMD pattern and normalize to service=ssh,
event_type=command, with uid/pwd/command in fields.
New GET /attackers/{uuid}/commands?limit=&offset=&service= endpoint
serves commands with server-side pagination and optional service filter.
AttackerDetail frontend fetches commands from this endpoint with
page controls. Service badge filter now drives both the API query
and the local fingerprint filter.
Clicking a service badge in the attacker detail view now filters the
commands and fingerprints sections on that page instead of navigating
away. Click again to clear. Header shows filtered/total counts.
API now accepts ?service=https to filter attackers by targeted service.
Service badges are clickable in both the attacker list and detail views,
navigating to a filtered view. Active filter shows as a dismissable tag.
Same (src_ip, event_type, fingerprint) tuple is now suppressed within a
5-minute window (configurable via DEDUP_TTL env var). Prevents the bounty
vault from filling up with identical JA3/JA4 rows from repeated connections.
TLS-wrapped variant of the HTTP honeypot. Auto-generates a self-signed
certificate on startup if none is provided. Supports all the same persona
options (fake_app, server_header, custom_body, etc.) plus TLS_CERT,
TLS_KEY, and TLS_CN configuration.
EHLO/HELO require a domain or address-literal argument. Previously
the server accepted bare EHLO with no argument and responded 250,
which deviates from the spec and makes the honeypot easier to
fingerprint.
The collector kept streaming stale container IDs after a redeploy,
causing new service logs to never reach decnet.log. Now _kill_api()
also matches and SIGTERMs any running decnet.cli collect process.
Every service's _log() called print() then write_syslog_file() which also
calls print(), causing every log line to appear twice in Docker logs. The
collector streamed both copies, doubling ingested events. Removed the
redundant print() from all 22 service server.py files.
Two bugs fixed:
- data_received only split on CRLF, so clients sending bare LF (telnet, nc,
some libraries) got no responses at all. Now splits on LF and strips
trailing CR, matching real Postfix behavior.
- AUTH PLAIN without inline credentials set state to "await_plain" but no
handler existed for that state, causing the next line to be dispatched as
a normal command. Added the missing state handler.
Migrate Attacker model from IP-based to UUID-based primary key with
auto-migration for old schema. Add GET /attackers (paginated, search,
sort) and GET /attackers/{uuid} API routes. Rewrite Attackers.tsx as
a card grid with full threat info and create AttackerDetail.tsx as a
dedicated detail page with back navigation, stats, commands table,
and fingerprints.
- Modify Rfc5424Formatter to read decnet_component from LogRecord
and use it as RFC 5424 APP-NAME field (falls back to 'decnet')
- Add get_logger(component) factory in decnet/logging/__init__.py
with _ComponentFilter that injects decnet_component on each record
- Wire all five layers to their component tag:
cli -> 'cli', engine -> 'engine', api -> 'api' (api.py, ingester,
routers), mutator -> 'mutator', collector -> 'collector'
- Add structured INFO/DEBUG/WARNING/ERROR log calls throughout each
layer per the defined vocabulary; DEBUG calls are suppressed unless
DECNET_DEVELOPER=true
- Add tests/test_logging.py covering factory, filter, formatter
component-awareness, fallback behaviour, and level gating
- Fixed CLI tests by patching local imports at source (psutil, os, Path).
- Fixed Collector tests by globalizing docker.from_env mock.
- Stabilized SSE stream tests via AsyncMock and immediate generator termination to prevent hangs.
- Achieved >80% coverage on CLI (84%), Collector (97%), and DB Repository (100%).
- Implemented SMTP Relay service tests (100%).
- Add merge-to-testing job: after all CI checks pass on dev, auto-merge
into testing with --no-ff for clear merge history
- Move open-pr job to trigger on testing branch instead of dev
- PR now opens testing → main instead of dev → main
- Add bandit and pip-audit jobs to pr.yml PR gate for full suite coverage
- PR gate test job now installs dev dependencies consistently
Spins up each service's server.py in a real subprocess via a free ephemeral
port (PORT env var), connects with real protocol clients, and asserts both
correct protocol behavior and RFC 5424 log output.
- 44 live tests across 10 services: http, ftp, smtp, redis, mqtt,
mysql, postgres, mongodb, pop3, imap
- Shared conftest.py: _ServiceProcess (bg reader thread + queue),
free_port, live_service fixture, assert_rfc5424 helper
- PORT env var added to all 10 targeted server.py templates
- New pytest marker `live`; excluded from default addopts run
- requirements-live-tests.txt: flask, twisted + protocol clients
MongoDB had the same infinite-loop bug as MSSQL (msg_len=0 → buffer never
shrinks in while loop). Postgres, MySQL, and MQTT had related length-field
issues (stuck state, resource exhaustion, overlong remaining-length).
Also fixes an existing MongoDB _op_reply struct.pack format bug (extra 'q'
specifier caused struct.error on any OP_QUERY response).
Adds 53 regression + protocol boundary tests across MSSQL, MongoDB,
Postgres, MySQL, and MQTT, including a _run_with_timeout threading harness
to catch infinite loops and @pytest.mark.fuzz hypothesis tests for each.
Cowrie was exposing an SSH daemon on port 22 alongside the telnet service
even when COWRIE_SSH_ENABLED=false, contaminating deployments that did not
request an SSH service.
New implementation mirrors the SSH service pattern:
- busybox telnetd in foreground mode on port 23
- /bin/login for real PAM authentication (brute-force attempts logged)
- rsyslog RFC 5424 bridge piped to stdout for Docker log capture
- Configurable root password and hostname via env vars
- No Cowrie dependency
real_ssh was a separate service name pointing to the same template and
behaviour as ssh. Merged them: ssh is now the single real-OpenSSH service.
- Rename templates/real_ssh/ → templates/ssh/
- Remove decnet/services/real_ssh.py
- Deaddeck archetype updated: services=["ssh"]
- Merge test_real_ssh.py into test_ssh.py (includes deaddeck + logging tests)
- Drop decnet.services.real_ssh from test_build module list
The collector subprocess was spawned via 'python3 -m decnet.cli collect'
but cli.py had no 'if __name__ == __main__: app()' guard. Python executed
the module, defined all functions, then exited cleanly with code 0 without
ever calling the collect command. No output, no log file, exit 0 — silent
non-start every time.
Also route collector stderr to <log_file>.collector.log so future crashes
are visible instead of disappearing into DEVNULL.
Collector and mutator watcher subprocesses were spawned without
start_new_session=True, leaving them in the parent's process group.
SIGHUP (sent when the controlling terminal closes) killed both
processes silently — stdout/stderr were DEVNULL so the crash was
invisible.
Also update test_services and test_composer to reflect the ssh plugin
no longer using Cowrie env vars (replaced with SSH_ROOT_PASSWORD /
SSH_HOSTNAME matching the real_ssh plugin).
Scraps the Cowrie emulation layer. The real_ssh template now runs a
genuine sshd backed by a three-layer logging stack forwarded to stdout
as RFC 5424 for the DECNET collector:
auth,authpriv.* → rsyslogd → named pipe → stdout (logins/failures)
user.* → rsyslogd → named pipe → stdout (PROMPT_COMMAND cmds)
sudo syslog=auth → rsyslogd → named pipe → stdout (privilege escalation)
sudo logfile → /var/log/sudo.log (local backup with I/O)
The ssh.py service plugin now points to templates/real_ssh and drops all
COWRIE_* / NODE_NAME env vars, sharing the same compose fragment shape as
real_ssh.py.
_load_service_container_names() reads decnet-state.json and builds the
exact set of expected container names ({decky}-{service}). is_service_container()
and is_service_event() do a direct set lookup — no regex, no label
inspection, no heuristics.
Two bugs caused the log file to never be written:
1. is_service_container() used regex '^decky-\d+-\w' which only matched
the old decky-01-smtp naming style. Actual containers are named
omega-decky-smtp, relay-decky-smtp, etc. Fixed by using Docker Compose
labels instead: com.docker.compose.project=decnet + non-empty
depends_on discriminates service containers from base (sleep infinity)
containers reliably regardless of decky naming convention.
Added is_service_event() for the Docker events path.
2. The collector was only started when --api was used. Added a 'collect'
CLI subcommand (decnet collect --log-file <path>) and wired it into
deploy as an auto-started background process when --api is not in use.
Default log path: /var/log/decnet/decnet.log
When --parallel is set:
- DOCKER_BUILDKIT=1 is injected into the subprocess environment to
ensure BuildKit is active regardless of host daemon config
- docker compose build runs first (all images built concurrently)
- docker compose up -d follows without --build (no redundant checks)
Without --parallel the original up --build path is preserved.
--parallel and --no-cache compose correctly (build --no-cache).
Conpot is a third-party app with its own Python logger — it never calls
decnet_logging. Added entrypoint.py as a subprocess wrapper that:
- Launches conpot and captures its stdout/stderr
- Classifies each line (startup/request/warning/error/log)
- Extracts source IPs via regex
- Emits RFC 5424 syslog lines to stdout for Docker/collector pickup
Entrypoint is self-contained (no import of shared decnet_logging.py)
because the conpot base image runs Python 3.6, which cannot parse the
dict[str, Any] / str | None type syntax used in the canonical file.
The BASE_IMAGE build arg was being unconditionally overwritten by
composer.py with the decky's distro build_base (debian:bookworm-slim),
turning the conpot container into a bare Debian image with no conpot
installation — hence the silent restart loop.
Two fixes:
1. composer.py: use args.setdefault() so services that pre-declare
BASE_IMAGE in their compose_fragment() win over the distro default.
2. conpot.py: pre-declare BASE_IMAGE=honeynet/conpot:latest in build
args so it always uses the upstream image regardless of decky distro.
Also removed the USER decnet switch from the conpot Dockerfile. The
upstream image already runs as the non-root 'conpot' user; switching to
'decnet' broke pkg_resources because conpot's eggs live under
/home/conpot/.local and are only on sys.path for that user.
Windows: both 0 (no ICMP rate limiting — matches real Windows behavior)
Linux: 1000ms / mask 6168 (kernel defaults)
BSD: 250ms / mask 6168 (FreeBSD default is faster than Linux)
Embedded/Cisco: both 0 (most firmware doesn't rate-limit ICMP)
These affect nmap's IE and U1 probe groups which measure ICMP error
response timing to closed UDP ports. Windows responds to all probes
instantly while Linux throttles to ~1/sec.
Tests: 10 new cases (5 per sysctl). Suite: 822 passed.
Phase 1 is complete. Live testing revealed:
- Window size (64240) is already correct — Phase 2 window mangling unnecessary
- TI=Z (IP ID = 0) is the single remaining blocker for Windows spoofing
- ip_no_pmtu_disc does NOT fix TI=Z (tested and confirmed)
Revised phase plan:
- Phase 2: ICMP tuning (icmp_ratelimit + icmp_ratemask sysctls)
- Phase 3: NFQUEUE daemon for IP ID rewriting (fixes TI=Z)
- Phase 4: diminishing returns, not recommended
Added detailed NFQUEUE architecture, TCPOPTSTRIP notes, and
note clarifying P= field in nmap output.
ip_no_pmtu_disc controls PMTU discovery for UDP/ICMP paths only.
TI=Z originates from ip_select_ident() in the kernel TCP stack setting
IP ID=0 for DF=1 TCP packets — a namespace-scoped sysctl cannot change this.
The previous commit was based on incorrect root-cause analysis.
When ip_no_pmtu_disc=0 the Linux kernel sets DF=1 on TCP packets and uses
IP ID=0 (RFC 6864). nmap's TI=Z fingerprint has no Windows match in its DB,
causing 91% confidence guesses of 'Linux 2.4/2.6 embedded' regardless of
TTL being 128. Setting ip_no_pmtu_disc=1 allows non-zero IP ID generation.
Trade-off: DF bit is not set on outgoing packets (slightly wrong for Windows)
but TI=Z is far more damaging to the spoof than losing DF accuracy.
The entrypoint.sh was present in the build context but never COPYed into
the image, causing 'stat /entrypoint.sh: no such file or directory' at
container start. Added COPY+chmod before the USER decnet instruction so
the script is installed as root and is executable by all users.
Add tcp_timestamps, tcp_window_scaling, tcp_sack, tcp_ecn, ip_no_pmtu_disc,
and tcp_fin_timeout to every OS profile in OS_SYSCTLS.
All 6 are network-namespace-scoped and safe to set per-container without
--privileged. They directly influence nmap's OPS, WIN, ECN, and T2-T6
probe groups, making OS family detection significantly more convincing.
Key changes:
- tcp_timestamps=0 for windows/embedded/cisco (strongest Windows discriminator)
- tcp_ecn=2 for linux (ECN offer), 0 for all others
- tcp_sack=0 / tcp_window_scaling=0 for embedded/cisco
- ip_no_pmtu_disc=1 for embedded/cisco (DF bit ICMP behaviour)
- Expose _REQUIRED_SYSCTLS frozenset for completeness assertions
Tests: 88 new test cases across all OS families and composer integration.
Total suite: 812 passed.
- Add dynamic challenge nonces to Postgres, VNC, and SIP.
- Add basic keyspace lookup and mock data to Redis.
- Correct MSSQL TDS pre-login offset bounds.
- Support MongoDB OP_MSG handshake version checking.
- Suppress Werkzeug HTTP server headers and normalize FTPAnonymousShell response.
- Add tracking for Dynamic Bait Store (DEBT-027) via DEBT.md.
- decnet/services/smtp_relay.py: open relay variant of smtp, same template
with SMTP_OPEN_RELAY=1 baked into the environment
- tests/service_testing/__init__.py: init so pytest discovers the subdirectory
- Buffer DATA body until CRLF.CRLF terminator — fixes 502-on-every-body-line bug
- SMTP_OPEN_RELAY=1: AUTH accepted (235), RCPT TO accepted for any domain,
full DATA pipeline with queued-as message ID
- Default (SMTP_OPEN_RELAY=0): credential harvester — AUTH rejected (535)
but connection stays open, RCPT TO returns 554 relay denied
- SASL PLAIN and LOGIN multi-step AUTH both decoded and logged
- RSET clears all per-transaction state
- Add development/SMTP_RELAY.md, IMAP_BAIT.md, ICS_SCADA.md, BUG_FIXES.md
(live-tested service realism plans)
- Add # nosec B104 to all intentional 0.0.0.0 binds in honeypot servers
(hardcoded_bind_all_interfaces is by design — deckies must accept attacker connections)
- Add # nosec B101 to assert statements used for protocol validation in ldap/snmp
- Add # nosec B105 to fake SASL placeholder in ldap
- Add # nosec B108 to /tmp usage in smb template
- Exclude root-owned auto-generated decnet_logging.py copies from bandit scan
via pyproject.toml [tool.bandit] config (synced by _sync_logging_helper at deploy)
Services now print RFC 5424 to stdout; Docker captures via json-file driver.
A new host-side collector (decnet.web.collector) streams docker logs from all
running decky service containers and writes RFC 5424 + parsed JSON to the host
log file. The existing ingester continues to tail the .json file unchanged.
rsyslog can consume the .log file independently — no DECNET involvement needed.
Removes: bind-mount volume injection, _LOG_NETWORK bridge, log_target config
field and --log-target CLI flag, TCP syslog forwarding from service templates.
- Rebuild repo.engine and repo.session_factory per-test using unique
in-memory SQLite URIs — fixes KeyError: 'access_token' caused by
stale session_factory pointing at production DB
- Add @pytest.mark.fuzz to all Hypothesis and Schemathesis tests;
default run excludes them (addopts = -m 'not fuzz')
- Add missing fuzz tests to bounty, fleet, histogram, and repository
- Use tmp_path for state file in patch_state_file/mock_state_file to
eliminate file-path race conditions under xdist parallelism
- Set default addopts: -v -q -x -n logical (26 tests in ~7s)
- decnet/env.py: DECNET_JWT_SECRET and DECNET_ADMIN_PASSWORD are now
required env vars; startup raises ValueError if unset or set to a
known-bad default ("admin", "password", etc.)
- decnet/env.py: add DECNET_CORS_ORIGINS (comma-separated, defaults to
http://localhost:8080) replacing the previous allow_origins=["*"]
- decnet/web/api.py: use DECNET_CORS_ORIGINS and tighten allow_methods
and allow_headers to explicit lists
- tests/conftest.py: set required env vars at module level so test
collection works without real credentials
- tests/test_web_api.py, test_web_api_fuzz.py: use DECNET_ADMIN_PASSWORD
from env instead of hardcoded "admin"
Closes DEBT-001, DEBT-002, DEBT-004
DECNET is a honeypot/deception network framework. It deploys fake machines (called **deckies**) with realistic services (RDP, SMB, SSH, FTP, etc.) to lure and profile attackers. All attacker interactions are aggregated to an isolated logging network (ELK stack / SIEM).
## Deployment Models
**UNIHOST** — one real host spins up _n_ deckies via a container orchestrator. Simpler, single-machine deployment.
**SWARM (MULTIHOST)** — _n_ real hosts each running deckies. Orchestrated via Ansible/sshpass or similar tooling.
## Core Technology Choices
- **Containers**: Docker Compose is the starting point but other orchestration frameworks should be evaluated if they serve the project better. `debian:bookworm-slim` is the default base image; mixing in Ubuntu, CentOS, or other distros is encouraged to make the decoy network look heterogeneous.
- **Networking**: Deckies need to appear as real machines on the LAN (own MACs/IPs). MACVLAN and IPVLAN are candidates; the right driver depends on the host environment. WSL has known limitations — bare metal or a VM is preferred for testing.
- **Log pipeline**: Logstash → ELK stack → SIEM (isolated network, not reachable from decoy network)
## Architecture Constraints
- The decoy network must be reachable from the outside (attacker-facing).
- The logging/aggregation network must be isolated from the decoy network.
- A publicly accessible real server acts as the bridge between the two networks.
- Deckies should differ in exposed services and OS fingerprints to appear as a heterogeneous network.
## Development and testing
- For every new feature, pytests must me made.
- Pytest is the main testing framework in use.
- NEVER pass broken code to the user.
- Broken means: not running, not passing 100% tests, etc.
- After tests pass with 100%, always git commit your changes.
- NEVER add "Co-Authored-By" or any Claude attribution lines to git commit messages.
The unihost model is a mode in which DECNET deploys an _n_ amount of machines from a single one. This execution model lives in a decoy network which is accessible to an attacker from the outside.
Each decky (the son of the DECNET unihost) should have different services (RDP, SMB, SSH, FTP, etc) and all of them should communicate with an external, isolated network, which aggregates data and allows
visualizations to be made. Think of the ELK stack. That data is then passed back via Logstash or other methods to a SIEM device or something else that may be beneficiated by this collected data.
## DECNET-MULTIHOST (SWARM) model
The SWARM model is similar to the UNIHOST model, but the difference is that instead of one real machine, we have n>1 machines. Same thought process really, but deployment may be different.
A low cost option and fairly automatable one is the usage of Ansible, sshpass, or other tools.
# Modus operandi
## Docker-Compose
I will use Docker Compose extensively for this project. The reasons are:
- Easily managed.
- Easily extensible.
- Less overhead.
To be completely transparent: I asked Deepseek to write the initial `docker-compose.yml` file. It was mostly boilerplate, and most of it mainly modified or deleted. It doesn't exist anymore.
## Distro to use.
I will be using the `debian:bookworm-slim` image for all the containers. I might think about mixing in there some Ubuntu or a Centos, but for now, Debian will do just fine.
The distro I'm running is WSL Kali Linux. Let's hope this doesn't cause any problems down the road.
## Networking
It was a hussle, but I think MACVLAN or IPVLAN (thanks @Deepseek!) might work. The reasoning behind picking this networking driver is that for the project to work, it requires having containers the entire container accessible from the network. This is to attempt to masquarede them as real, live machines.
Now, we will need a publicly accesible, real server that has access to this "internal" network. I'll try MACVLAN first.
### MACVLAN Tests
I will first use the default network to see what happens.
```
docker network create -d macvlan \
--subnet=192.168.1.0/24 \
--gateway=192.168.1.1 \
-o parent=eth0 localnet
```
#### Issues
This initial test doesn't seem to be working. Might be that I'm using WSL, so I downloaded a Ubuntu 22.04 Server ISO. I'll try the MACVLAN network on it. Now, if that doesn't work, I don't see how the 802.1q would work, at least on _my network_. Perhaps if I had a switch I could make it work, but currently I don't have one :c
- [ ]**Canary tokens** — Embed canary URLs, fake AWS keys, fake API tokens, and honeydocs (PDF/DOCX with phone-home URLs) into decky filesystems. Fire an alert the moment one is used.
- [ ]**Tarpit mode** — Slow down attackers by making services respond extremely slowly (e.g., SSH that takes 60s to reject, HTTP that drip-feeds bytes). Wastes attacker time and resources.
- [ ]**Dynamic decky mutation** — Deckies that change their exposed services or OS fingerprint over time to confuse port-scan caching and appear more "alive."
- [ ]**Credential harvesting DB** — Every username/password attempt across all services lands in a queryable database. Expose via CLI (`decnet creds`) and flag reuse across deckies.
- [ ]**Session recording** — Full session capture for SSH/Telnet (keystroke logs, commands run, files downloaded). Cowrie already does this — surface it better in the CLI and correlation engine.
- [ ]**Payload capture** — Store every file uploaded or command executed by an attacker. Hash and auto-submit to VirusTotal or a local sandbox.
## Detection & Intelligence
- [ ]**Real-time alerting** — Webhook/Slack/Telegram notifications when an attacker hits a decky for the first time, crosses N deckies (lateral movement), or uses a known bad IP.
- [ ]**Threat intel enrichment** — Auto-lookup attacker IPs against AbuseIPDB, Shodan, GreyNoise, and AlienVault OTX. Tag known scanners vs. targeted attackers.
- [ ]**Attack campaign clustering** — Group attacker sessions by tooling signatures, timing patterns, and credential sets. Identify coordinated campaigns hitting multiple deckies.
- [ ]**GeoIP mapping** — Attacker origin on a world map. Correlate with ASN data to identify cloud exit nodes, VPNs, and Tor exits.
- [ ]**TTPs tagging** — Map observed attacker behaviors to MITRE ATT&CK techniques automatically. Tag events in the correlation engine.
- [ ]**Honeypot interaction scoring** — Score attackers on a scale: casual scanner vs. persistent targeted attacker, based on depth of interaction and commands run.
## Dashboard & Visibility
- [ ]**Web dashboard** — Real-time web UI showing live decky status, attacker activity, traversal graphs, and credential stats. Could be a simple FastAPI + HTMX or a full React app.
- [ ]**Pre-built Kibana/Grafana dashboards** — Ship dashboard JSON exports out of the box so ELK/Grafana deployments are plug-and-play.
- [ ]**CLI live feed** — `decnet watch` command: tail all decky logs in a unified, colored terminal stream (like `docker-compose logs -f` but prettier).
- [ ]**Traversal graph export** — Export attacker traversal graphs as DOT/Graphviz or JSON for visualization in external tools.
- [ ]**Daily digest** — Automated daily summary email/report: new attackers, top credentials tried, most-hit services.
## Deployment & Infrastructure
- [ ]**SWARM / multihost mode** — Full Ansible-based orchestration for deploying deckies across N real hosts.
- [ ]**Terraform/Pulumi provider** — Spin up cloud-hosted deckies on AWS/GCP/Azure with one command. Useful for internet-facing honeynets.
- [ ]**Auto-scaling** — When attack traffic increases, automatically spawn more deckies to absorb and log more activity.
- [ ]**Kubernetes deployment mode** — Run deckies as Kubernetes pods for environments already running k8s.
- [ ]**Proxmox/libvirt backend** — Full VM-based deckies instead of containers, for even more realistic OS fingerprints and behavior. Docker for speed; VMs for realism.
- [ ]**Raspberry Pi / ARM support** — Low-cost physical honeynets using RPis. Validate ARM image builds.
- [ ]**Decky health monitoring** — Watchdog that auto-restarts crashed deckies and alerts if a service goes dark.
## Services & Realism
- [ ]**HTTPS/TLS support** — HTTP honeypot with a self-signed or Let's Encrypt cert. Many real-world services use HTTPS; plain HTTP stands out.
- [ ]**Fake Active Directory** — A convincing fake AD/LDAP with fake users, groups, and GPOs. Attacker tools like BloodHound should get juicy (fake) data.
- [ ]**Fake file shares** — SMB/NFS shares pre-populated with enticing but fake files: "passwords.xlsx", "vpn_config.ovpn", "backup_keys.tar.gz". All instrumented to detect access.
- [ ]**Realistic web apps** — HTTP honeypot serving convincing fake apps: a fake WordPress, a fake phpMyAdmin, a fake Grafana login — all logging every interaction.
- [ ]**OT/ICS profiles** — Expand Conpot support: Modbus, DNP3, BACnet, EtherNet/IP. Convincing industrial control system decoys.
- [ ]**Printer/IoT archetypes** — Expand existing printer/camera archetypes with actual service emulation (IPP, ONVIF, WS-Discovery).
- [ ]**Service interaction depth** — Some services currently just log the connection. Deepen interaction: fake MySQL that accepts queries and returns realistic fake data, fake Redis that stores and retrieves dummy keys.
## Developer Experience
- [ ]**Plugin SDK docs** — Full documentation and an example plugin for adding custom services. Lower the barrier for community contributions.
- [ ]**Integration tests** — Full deploy/teardown cycle tests against a real Docker daemon (not just unit tests).
- [ ]**Per-service tests** — Each of the 29 service implementations deserves its own test coverage.
- [ ]**CI/CD pipeline** — GitHub/Gitea Actions: run tests on push, lint, build Docker images, publish releases.
- [ ]**Config validation CLI** — `decnet validate my.ini` to dry-check an INI config before deploying.
- [ ]**Config generator wizard** — `decnet wizard` interactive prompt to generate an INI config without writing one by hand.
@@ -470,6 +490,34 @@ See [`test-full.ini`](test-full.ini) — covers all 25 services across 10 role-t
---
## Environment Configuration (.env)
DECNET supports loading configuration from `.env.local` and `.env` files located in the project root. This is useful for securing secrets like the JWT key and configuring default ports without passing flags every time.
An example `.env.example` is provided:
```ini
# API Options
DECNET_API_HOST=0.0.0.0
DECNET_API_PORT=8000
DECNET_JWT_SECRET=supersecretkey12345
DECNET_INGEST_LOG_FILE=/var/log/decnet/decnet.log
# Web Dashboard Options
DECNET_WEB_HOST=0.0.0.0
DECNET_WEB_PORT=8080
DECNET_ADMIN_USER=admin
DECNET_ADMIN_PASSWORD=admin
# Database pool tuning (applies to both SQLite and MySQL)
DECNET_DB_POOL_SIZE=20 # base pool connections (default: 20)
DECNET_DB_MAX_OVERFLOW=40 # extra connections under burst (default: 40)
```
Copy `.env.example` to `.env.local` and modify it to suit your environment.
---
## Logging
All attacker interactions are forwarded off the decoy network to an isolated logging sink. The log pipeline lives on a separate internal Docker bridge (`decnet_logs`) that is not reachable from the fake LAN.
@@ -631,3 +679,115 @@ The test suite covers:
| `test_cli_service_pool.py` | CLI service resolution |
Every new feature requires passing tests before merging.
### Stress Testing
A [Locust](https://locust.io)-based stress test suite lives in `tests/stress/`. It hammers every API endpoint with realistic traffic patterns to find throughput ceilings and latency degradation.
`/deckies`, `/config`) collapse concurrent duplicate work onto a
single DB hit per window — essential to reach this RPS on one worker.
- Turning off request tracing (`DECNET_TRACING=false`) is the next
free headroom: tracing was still on during the run above.
- On SQLite, `DECNET_DB_POOL_PRE_PING=false` skips the per-checkout
`SELECT 1`. On MySQL, keep it `true` — network disconnects are real.
#### System tuning: open file limit
Under heavy load (500+ concurrent users), the server will exhaust the default Linux open file limit (`ulimit -n`), causing `OSError: [Errno 24] Too many open files`. Most distros default to **1024**, which is far too low for stress testing or production use.
**Before running stress tests:**
```bash
# Check current limit
ulimit -n
# Bump for this shell session
ulimit -n 65536
```
**Permanent fix** — add to `/etc/security/limits.conf`:
```
* soft nofile 65536
* hard nofile 65536
```
Or for systemd-managed services, add `LimitNOFILE=65536` to the unit file.
> This applies to production deployments too — any server handling hundreds of concurrent connections needs a raised file descriptor limit.
# AI Disclosure
This project has been made with lots, and I mean lots of help from AIs. While most of the design was made by me, most of the coding was done by AI models.
Nevertheless, this project will be kept under high scrutiny by humans.
**Critical Pre-v1 Gaps** (blockers if signals are roadmap-committed):
1.**KEX algorithm ordering** — HASSH hash is stored, but raw `kex_algorithms` string is only emitted to syslog, not persisted to DB. Future extractor must parse syslog archives.
2.**Per-keystroke timing** — Asciinema v2 `"i"` events with `t` timestamps are written to day-shard files on disk, but no database ingestion. Requires filesystem polling + parsing path.
3.**TCP options order** — Captured in PCAP + sniffer logs (`options_sig`), but `options_sig` is a rolled-up signature string, not the raw per-connection sequence.
4.**Terminal size (COLS×ROWS)** — Not captured from pty-req at all; would require SSH protocol-level interception.
5.**SSH client version** — Server-side only sees RFC 4253 banner; full version string would require TLS cert inspection or prober modification.
**Biggest ROI capture improvements** (cheap, high-value):
1. Add `ssh_client_banner` column to Attacker table — capture SSH-2.0-* string from pty-req.
2. Ingest asciinema keystroke timing into new `SessionProfile` table (v2 roadmap already designs this).
3. Store raw KEX algorithm lists in `AttackerBehavior.kex_order_raw` (MEDIUMTEXT) instead of relying on syslog dedup.
- **Where**: SSH server can read TERM from pty-req; emitted in syslog by `emit_capture.py` if implemented.
- **Current path**: Not found in active code path. Check `decnet/templates/ssh/emit_capture.py` or syslog bridge.
- **Missing**: Database column in a `SessionProfile` table; no structured ingestion.
- **Cheap fix**: Modify SSH syslog bridge to emit `session_event` with `term=<value>`. Create `SessionProfile` table with `session_term` TEXT column.
- **Priority**: V2 backlog (nice-to-have for human vs. automation, low discriminative power).
#### LANG / LC_ALL
- **Status**: `not_captured`
- **Why**: Server-side locale is baked into container image, not attacker-controlled. Attacker's client locale is not visible over SSH.
- **Priority**: defer (non-capturable from server vantage point).
#### SSH client version string (full SSH-2.0-OpenSSH_9.2p1…)
- **Status**: `partial`
- **Where**: RFC 4253 banner string is transmitted in plaintext before encryption. Sniffer could capture it from TCP stream; prober `hassh.py` captures server banner (lines 58–101), not client.
- **Missing**: Client-side banner capture. Sniffer would need TCP stream reconstruction to pluck the SSH banner from the raw payload.
- **Cheap fix**: Extend sniffer to parse SSH banners from TCP stream (before TLS/encryption); emit `ssh_client_banner` event. Store in Attacker.`ssh_client_banners` (JSON list).
- **Priority**: v1 blocker if client-profiling is committed. Currently partial via TLS fingerprint fallback.
#### Terminal size (COLS × ROWS)
- **Status**: `not_captured`
- **Why**: SSH pty-req extension carries `terminal mode` (COLS, ROWS, speeds); server-side sshd parses this but does not log it by default. Would require patching sshd or intercepting at the protocol layer.
- **Missing**: No access to pty-req payload without protocol-level instrumentation.
- **Cheap fix**: Patch SSH entrypoint to log pty-req to syslog before accepting the request (requires custom OpenSSH build).
- **Priority**: V2 backlog (interesting for typing-space reconstruction, but not blocky).
#### Per-keystroke timing (t in asciinema "i" events)
- **Status**: `partial`
- **Where**: Sessrec pipeline (`decnet/templates/ssh/sessrec/`) writes asciinema v2 day-shards with per-keystroke `"i"` (input) events carrying `t` (timestamp in seconds since session start). Files on disk: `/var/lib/decnet/session_recordings/<decky>/<date>.json` (or similar).
- **Missing**: No ingestion into database. Extractors must read asciinema files from filesystem and parse the `"i"` event stream post-hoc.
- **Cheap fix**: Ingest keystroke timing stream into new `SessionProfile` table (design already in DEVELOPMENT_V2.md). Add job to parse day-shard files on rotation and compute IKI moments, burst ratio, etc.
- **Priority**: v1 blocker if keystroke dynamics is roadmap-committed. Data exists but not queryable.
- **Where**: Asciinema captures every keystroke as UTF-8/control byte in `"i"` events. Raw byte sequence is preserved.
- **Missing**: Same as above — files on disk, no DB ingestion. Future extractor can parse control bytes from the `"data"` field of each `"i"` event.
- **Cheap fix**: Same as keystroke timing — ingest asciinema events and compute `kd_ctrl_*` rates in SessionProfile.
- **Priority**: v2 (depends on SessionProfile schema).
#### Inter-command think time (prompt-return to next-command-start gap)
- **Status**: `not_captured`
- **Why**: Requires prompt boundary detection in the asciinema stream (heuristic: line ending in `$` or `#` + pause > 100ms). No active code marks prompts.
- **Missing**: Prompt-boundary markers in asciinema. Would require ML or regex-based post-processing.
- **Cheap fix**: Add prompt-regex configuration + marker injection during sessrec playback, or post-hoc analysis over asciinema.
- **Priority**: V2 (interesting but requires heuristic or attacker-side annotation).
#### Pause before sensitive commands
- **Status**: `not_captured`
- **Why**: Requires command-boundary detection (typing a full command, then detecting gap before Enter). Asciinema captures this timing, but no code marks command boundaries.
- **Missing**: Command-line parsing + gap detection logic.
- **Cheap fix**: Off-line analysis: parse `"i"` events, detect Enter (`\r`), measure gap before Enter. Correlate with command content from `"o"` (output) events.
- **Priority**: V2 backlog (post-extraction analysis; interesting for psychological profiling).
#### Command n-grams
- **Status**: `partial`
- **Where**: SSH service logs individual commands to syslog when pty input is detected. Attacker.`commands` JSON array stores seen commands (but coarse-grained per service/decky, not per-session).
- **Missing**: Per-session, per-command sequencing. No n-gram bigrams/trigrams computed.
- **Cheap fix**: Parse asciinema `"i"` + `"o"` stream to extract full command lines, store as JSON list in SessionProfile.`cmd_sequence` or new `SessionCommand` table.
- **Priority**: V2 (foundation for command chaining fingerprint).
#### Flag preferences (ls -la vs ls -al, ps -ef vs ps aux)
- **Status**: `not_captured`
- **Why**: Asciinema records the **typed** command line exactly, but no code parses flag ordering or normalizes commands for pattern comparison.
- **Cheap fix**: Off-line: regex-parse commands from asciinema, extract flag sequences, compute n-grams over flag positions.
- **Priority**: V2 (cheap post-processing, good human-vs-tool separator).
#### Typo patterns (suod, sl)
- **Status**: `not_captured`
- **Why**: Asciinema records corrected command line after backspacing, not the raw keystrokes with typos visible.
- **Example**: typing `suod<backspace>` then `ddo<backspace>` then `o` shows as `sudo` in `"o"` output; the intermediate typos are **visible** in the `"i"` event stream but require careful keystroke-by-keystroke parsing.
- **Missing**: Raw keystroke stream parsing to detect backspace/correction patterns.
- **Cheap fix**: Parse `"i"` events, reconstruct line state keystroke-by-keystroke, log (typed_text, final_text) pairs to detect corrections.
- **Priority**: V2 (unique human fingerprint, but requires manual asciinema parsing).
#### Editor choice (vi/vim/nano/ed)
- **Status**: `partial`
- **Where**: Command launch (`vi`, `nano`, `ed`) is visible in asciinema `"i"` + `"o"` stream and captured in Attacker.`commands`.
- **Missing**: No aggregation of editor invocations or time-in-editor statistics.
- **Cheap fix**: Post-process commands, count editor launches, extract editor type. Could add to AttackerBehavior.`preferred_editor` or new SessionProfile.`editor_used`.
- **Where**: Command input stream captures the actual invocation (if attacker types `!!`, it's visible in `"i"`). Output `"o"` shows the expanded command.
- **Missing**: No parsing of history expansion syntax; requires post-processing to identify `!` / `^` patterns.
- **Cheap fix**: Regex-scan asciinema input for shell history operators; count occurrences.
- **Priority**: V2 (interesting tool-chain signal, but low volume).
---
### Per-Attacker, SSH Transport (AttackerBehavior candidates)
#### HASSH / HASSHServer
- **Status**: `captured`
- **Where**: Prober (`decnet/prober/hassh.py`) computes HASSHServer fingerprint; stored as `Attacker.fingerprints` JSON list (generic bounty store). Also emitted to syslog by prober worker.
- **Note**: Roadmap says `[x]` (captured); verified in code at lines 244–252 of `hassh.py`.
- **Storage**: `Attacker.fingerprints` (JSON list of `{type, value, ...}` dicts); not per-attacker-behavior, but queryable.
- **Priority**: ✓ captured; v2: consider normalizing to `AttackerBehavior.hassh_server` for faster lookup.
#### KEX algorithm preference ORDER (beyond HASSH hash)
- **Status**: `partial`
- **Where**: Sniffer logs raw `kex_algorithms`, `encryption_s2c`, `mac_s2c`, `compression_s2c` strings to syslog in `tls_session` and `tcp_syn_fingerprint` events (fingerprint.py lines 240–252).
- **Missing**: Stored in **syslog only**, not in DB. Attacker table has `fingerprints` (bounty store) but no dedicated `kex_order_raw` column.
- **Path to recovery**: Read syslog archives and parse `kex_algorithms` field. But this is not queryable at scale.
- **Cheap fix**: Add `Attacker.kex_order_raw` (MEDIUMTEXT, JSON string list) and `kd_kex_order_hash` (similar to digraph simhash). Populate during sniffer event ingestion.
- **Priority**: v1 blocker if KEX ordering is committed to roadmap (currently only hash stored, raw data must be re-parsed from syslog).
#### Public key comment field
- **Status**: `not_captured`
- **Why**: SSH key comment is part of the OpenSSH wire format (only transmitted if key auth is used). Server-side sshd does not log it by default; would require PAM/auth hook instrumentation.
- **Missing**: No interception of public key authentication payloads.
- **Cheap fix**: Patch SSH server to emit auth_pubkey event with key comment extracted from wire format. Or use `net.ssh` library instrumentation.
- **Priority**: V2 backlog (valuable for key reuse fingerprinting, but rare).
#### Private key type advertised (Ed25519 / RSA / ECDSA)
- **Status**: `partial`
- **Where**: SSH transport carries key type in the public key authentication message. Sniffer cannot decode this (traffic is encrypted after ServerHello). Server-side sshd doesn't log it.
- **Missing**: Requires either passive PCAP of SSH-TRANSPORT (not available; encrypted) or server-side auth hook.
- **Cheap fix**: Patch sshd to emit `auth_pubkey_type` event during authentication.
- **Priority**: V2 (interesting but lower signal than key comment).
#### Agent forwarding requested?
- **Status**: `not_captured`
- **Why**: Agent forwarding is negotiated via SSH_MSG_SERVICE_REQUEST → ssh-userauth → "ssh-agent@openssh.com" extension. Encrypted after KEX.
- **Missing**: Would require decrypting SSH transport or instrumenting sshd auth hook.
- **Cheap fix**: Sshd can detect `SSH_AUTH_SOCK` or SSH_AGENT_FWD service request; add to syslog.
- **Priority**: V2 (useful for lateral-movement detection).
#### Channel multiplexing pattern
- **Status**: `partial`
- **Where**: SSH service logs each command separately. Channel open/close events could be tracked, but no code currently does.
- **Missing**: Per-session channel state machine (open channels, their types, lifetime).
- **Cheap fix**: Instrument sshd or use SSH_MSG_CHANNEL_OPEN events in syslog to track simultaneous channels.
- **Priority**: V2 (rare; most attackers use sequential commands).
- **Where**: SSH server **always** sets `SSH_CLIENT` and `SSH_CONNECTION` in the child shell. Server-side user code (bashrc, commands) can read them. If attacker runs `echo $SSH_CLIENT`, it's visible in asciinema output.
- **Missing**: No **automatic** logging of these vars. Requires parsing asciinema for intentional queries or patching sshd to emit them.
- **Cheap fix**: Patch SSH PAM or auth hook to log `SSH_CLIENT` on successful auth. Or parse asciinema for `echo $SSH_*` commands.
- **Priority**: V2 (low value; mostly redundant with src_ip already in logs).
- **Where**: PCAP contains TCP timestamps (if present). Sniffer code extracts MSS, window size, options (fingerprint.py line 77–94). TCP options include timestamp flag (`has_timestamps`).
- **Missing**: Raw timestamp values (`opt_value` for "Timestamp" in scapy) are NOT extracted. Only boolean `has_timestamps` flag is stored. To compute clock skew, need timestamp values across multiple packets.
- **Path to recovery**: Raw PCAP analysis (if PCAPs are retained on disk). Each TCP packet has `[TCP option: Timestamp x, y]` which can be parsed post-hoc.
- **Cheap fix**: Extend sniffer to extract timestamp sequence numbers and RTT deltas. Store as per-flow timing summary in `tcp_flow_timing` event (which already captures flow metrics).
- **Priority**: V2 (requires PCAP or extended sniffer capture; useful for OS fingerprinting).
#### TCP ISN generator characteristics
- **Status**: `not_captured`
- **Why**: ISN is visible in PCAP (TCP seq number on SYN). Sniffer code tracks flow seqs for retransmit detection (line 850) but does not extract the initial SYN seq across multiple connections to analyze ISN patterns.
- **Missing**: No per-connection ISN logging. Would need to roll up ISN sequences across multiple SYNs to the same port.
- **Cheap fix**: On every SYN, log `syn_seq` in `tcp_syn_fingerprint` event. Post-hoc analysis can compute randomness metrics.
- **Priority**: V2 backlog (weak signal; ISN randomization is standard on modern OS).
#### TCP options ordering in SYN
- **Status**: `partial`
- **Where**: Sniffer extracts `options_sig` (line 87) via `_extract_options_order()` from scapy TCP options. This is a **signature string** (e.g., `"MSS,WScale,SAckOK,Timestamp"`).
- **Missing**: The signature is **aggregated**; we don't store the raw per-packet ordering. Also, `options_sig` is deduplicated in logs (only one event per unique signature per dedup window).
- **Path to recovery**: Raw PCAP analysis or re-parsing sniffer logs to extract the signature. But the signature is a good enough feature for OS fingerprinting.
- **Cheap fix**: Store `tcp_fingerprint` JSON in AttackerBehavior with raw options list (not just signature). Current schema (models.py line 174–177) only stores aggregated `{window, wscale, mss, options_sig}`.
- **Priority**: v1 improvement (low effort, already have options_sig; add raw list).
#### Initial congestion window ramp-up
- **Status**: `not_captured`
- **Why**: Requires detailed TCP state machine tracking (SYN, SYN-ACK, ACK sequence with packet sizes). Sniffer tracks `packets` count and `bytes` total per flow (line 844–868), but not per-packet sequence or ACK-clock dynamics.
- **Missing**: Per-packet payload sizes and ACK timing.
- **Cheap fix**: Extend `tcp_flow_timing` event to include per-packet sizes (as JSON list) or CWND estimation from ACK patterns.
- **Priority**: V2 backlog (very niche; useful for Reno vs. Cubic vs. BBR detection, but rare in honeypot context).
#### Retransmit timing and backoff
- **Status**: `captured`
- **Where**: Sniffer tracks `retransmits` count per flow (lines 873–877, 922). Emitted in `tcp_flow_timing` event. No **timing** of retransmits, only count.
- **Why**: Sniffer computes mean/min/max inter-arrival time in milliseconds (lines 904–906), not microseconds. Modern pacing requires sub-millisecond precision.
- **Missing**: Sniffer uses `time.monotonic()` (typically millisecond granularity on Linux); would need OS-level timing hooks or PCAP with hardware timestamps.
- **Cheap fix**: Upgrade sniffer to use PCAP timestamps (pcap.ts_resolution) if available; log microsecond-resolution inter-packet gaps.
- **Priority**: V2 backlog (requires infrastructure upgrade; marginal value on honeypots).
#### Window scaling multipliers
- **Status**: `captured`
- **Where**: Sniffer extracts `wscale` from TCP options (line 80); stored in `tcp_fingerprint` JSON and emitted in `tcp_syn_fingerprint` event.
- **Why**: Encrypted TLS + requires HTTP state machine tracking (Set-Cookie responses vs. Cookie requests).
- **Missing**: Would need server-side HTTP middleware or browser instrumentation.
- **Cheap fix**: Add cookie jar logging to HTTP service (track which attacker cookies were accepted, rejected, resent).
- **Priority**: V2 (behavioral signal; interesting but niche).
---
### Per-Attacker, Aggregated/Derived (would live in new `AttackerAggregate` table)
#### Time-of-day activity distribution (chronotyping)
- **Status**: `partial`
- **Where**: Log entries have `timestamp` (datetime). All events are timestamped. Can compute hour-of-day histogram post-hoc.
- **Missing**: No aggregation table or computed features. Would live in new AttackerAggregate.
- **Cheap fix**: Batch job: group events by attacker + hour-of-day, compute distribution histogram. Store as JSON or new table.
- **Priority**: V2 (simple aggregation; good for clustering).
#### Session duration distribution
- **Status**: `partial`
- **Where**: SessionProfile schema (DEVELOPMENT_V2.md) includes `session_duration_s`. Asciinema files are per-decky-per-day, so duration can be computed.
- **Missing**: No SessionProfile table yet; no aggregation of durations across sessions.
- **Missing**: No per-attacker ratio column in AttackerAggregate. Would be simple division: `exfil_events / recon_events`.
- **Cheap fix**: Compute ratio in profiler job; store in new AttackerAggregate or as extension to AttackerBehavior.
- **Priority**: V2 (low effort; useful for threat level scoring).
#### Lateral movement style
- **Status**: `not_captured`
- **Why**: Requires graph traversal (attacker hopping between deckies). Correlation engine (correlation/engine.py) should track this, but no explicit "lateral movement style" feature (sequential vs. parallel, target selection heuristic).
- **Missing**: No code analyzing lateral movement pattern (which deckies were touched, in what order, dwell time per decky).
- **Cheap fix**: Extend CorrelationEngine to build per-attacker decky traversal graph; compute metrics (average dwell time, fan-out ratio, revisit frequency).
- **Why**: Requires semantic tagging of events (is this persistence activity? exfil activity?). Profiler has `EXFIL_EVENT_TYPES` (line 59–62) but no persistence catalog.
- **Missing**: No code to classify persistence attempts (cron jobs, reverse shells, privilege escalation).
- **Cheap fix**: Add PERSISTENCE_EVENT_TYPES list; compute persistence_start vs. exfil_start timestamps; store in AttackerBehavior or AttackerAggregate.
- **Priority**: V2 (requires event taxonomy; valuable for threat classification).
#### Tool-chain ordering
- **Status**: `partial`
- **Where**: Profiler logs tool guesses in AttackerBehavior.`tool_guesses` (line 183, behavioral.py lines 76–105). Tools are matched by beacon timing + header patterns.
- **Missing**: No **ordering** — tools are listed but not sequenced by first-appearance time.
- **Cheap fix**: Sort tool_guesses by first event timestamp; store as ordered list. Compute tool transition graph (tool A → tool B over time).
- **Priority**: V2 (interesting; small extension to existing tool attribution).
#### Error-response psychology
- **Status**: `not_captured`
- **Why**: Requires analyzing how attacker reacts to failures (e.g., retry frequency after auth failure, command error recovery). Would need per-command success/failure tracking.
- **Missing**: No error-categorization in logs; would need service-level event typing (auth_failure vs. auth_success, exec_error vs. exec_success).
- **Cheap fix**: Extend service events to include success/failure indicators; compute attacker error-response metrics (retry rate, time-to-recovery, behavior change after error).
- **Priority**: V2 backlog (niche; good for human vs. bot discrimination).
---
## Table Recommendations
### `AttackerBehavior` — Current & Recommended Additions
**Currently captured** (verified in models.py lines 161–194):
| Per-keystroke timing | partial | Asciinema "i" events with t timestamps | Files on disk, not ingested to DB | Implement SessionProfile table + ingest job | v1 blocker |
| Control-character stream | partial | Asciinema keystroke bytes | Same as above (files only) | Same as above | v1 blocker |
| Inter-command think time | not_captured | Requires prompt detection | Heuristic (line ending in $/#) not implemented | Post-hoc: regex + gap detection over asciinema | V2 |
| Pause before sensitive cmd | not_captured | Would be in asciinema timing | Requires command-line parsing + gap detection | Off-line analysis of asciinema | V2 |
| Flag preferences | not_captured | Asciinema input has typed flags | No parsing or normalization | Regex-parse and canonicalize flags from asciinema | V2 |
| Typo patterns | not_captured | Raw keystroke sequence in asciinema "i" | Requires keystroke-by-keystroke reconstruction | Parse "i" events with backspace markers; reconstruct line state | V2 |
| Editor choice | partial | Attacker.commands shows editor launch | No aggregation or time-in-editor | Count editor invocations; store preference in SessionProfile | V2 |
| Shell history usage | partial | Command input shows !, ^, !! | No parsing for history operators | Regex-scan for shell history syntax; count | V2 |
| KEX algorithm order | partial | Syslog event kex_algorithms= field | Not persisted to DB (only in syslog) | Add AttackerBehavior.kex_order_raw (MEDIUMTEXT, JSON) | v1 blocker |
| Public key comment | not_captured | SSH wire format (auth_pubkey) | Requires server-side auth hook | Patch sshd to emit auth_pubkey_comment event | V2 |
| Private key type | partial | SSH wire format (auth algorithm OID) | Encrypted after KEX; needs sshd hook | Patch sshd to emit auth_key_type event | V2 |
| Channel multiplexing | partial | SSH service logs commands separately | No channel state machine | Instrument sshd SSH_MSG_CHANNEL_OPEN events | V2 |
| SSH_CLIENT env vars | captured | Server sets automatically; queryable via shell | No automatic logging | Patch sshd PAM to emit SSH_CLIENT on auth | V2 |
| **Network/Transport** |
| TCP timestamp skew | partial | PCAP + sniffer has has_timestamps flag | Only boolean; not timestamp values | Extract timestamp seq numbers in sniffer | V2 |
| TCP ISN generator | not_captured | PCAP SYN seq field | No per-connection ISN logging | Log syn_seq in tcp_syn_fingerprint event | V2 |
| TCP options ordering | partial | Sniffer extracts options_sig signature | Aggregated string; no raw order per-packet | Extend tcp_fingerprint JSON with raw options list | v1 improvement |
| Initial congestion window | not_captured | Would require per-packet ACK analysis | Not tracked in sniffer | Extend tcp_flow_timing to include payload sizes list | V2 |
| Retransmit timing+backoff | partial | Sniffer counts retransmits; no timing | RTO/backoff timing not logged | Extend event to include RTO deltas | V2 |
| MTU/path-MTU discovery | partial | MSS in TCP SYN; byte counts per flow | No ICMP fragmentation-needed events | Add ICMP processing; correlate with TCP flows | V2 |
| Packet pacing (μs) | not_captured | Sniffer uses millisecond granularity | Needs PCAP hardware timestamps or OS hooks | Upgrade to sub-millisecond timing | V2+ |
| HTTP header ordering | not_captured | Encrypted; requires service logging | Service doesn't log raw headers | Patch HTTP service to log header order | V2 |
| Cookie handling | not_captured | Requires HTTP state machine | Not tracked | Add cookie jar logging to HTTP service | V2 |
| **Aggregated/Derived** |
| Time-of-day distribution | partial | Timestamps on all events | No aggregation table | Batch job: hour-of-day histogram → AttackerAggregate | V2 |
| Session duration dist | partial | SessionProfile would have duration | No SessionProfile table yet | Implement SessionProfile + duration stats | V2 |
| Recon-to-action ratio | partial | AttackerBehavior.phase_sequence | No per-attacker ratio column | Compute ratio in profiler; store in AttackerAggregate | V2 |
| Lateral movement style | not_captured | Correlation engine has traversal path | No traversal pattern analysis | Extend engine to compute dwell time + fan-out metrics | V2 |
The `SessionProfile` schema (table, schema_version field, numeric features) is designed to be the federation wire format. **No changes needed for v1**, but ensure schema_version is in the table definition from day one so gossip compatibility is straightforward in v2.
port:int=typer.Option(8765,"--port",help="Port for the worker agent"),
host:str=typer.Option("0.0.0.0","--host",help="Bind address for the worker agent"),# nosec B104
agent_dir:Optional[str]=typer.Option(None,"--agent-dir",help="Worker cert bundle dir (default: ~/.decnet/agent, expanded under the running user's HOME — set this when running as sudo/root)"),
daemon:bool=typer.Option(False,"--daemon","-d",help="Detach to background as a daemon process"),
no_forwarder:bool=typer.Option(False,"--no-forwarder",help="Do not auto-spawn the log forwarder alongside the agent"),
)->None:
"""Run the DECNET SWARM worker agent (requires a cert bundle in ~/.decnet/agent/).
By default, `decnet agent` auto-spawns `decnet forwarder` as a fully-
detached sibling process so worker logs start flowing to the master
without a second manual invocation. The forwarder survives agent
restarts and crashes — if it dies on its own, restart it manually
with `decnet forwarder --daemon …`. Pass --no-forwarder to skip.
deckies:Optional[int]=typer.Option(None,"--deckies","-n",help="Number of deckies to deploy (required without --config)",min=1),
interface:Optional[str]=typer.Option(None,"--interface","-i",help="Host NIC (auto-detected if omitted)"),
subnet:Optional[str]=typer.Option(None,"--subnet",help="LAN subnet CIDR (auto-detected if omitted)"),
ip_start:Optional[str]=typer.Option(None,"--ip-start",help="First decky IP (auto if omitted)"),
services:Optional[str]=typer.Option(None,"--services",help="Comma-separated services, e.g. ssh,smb,rdp"),
randomize_services:bool=typer.Option(False,"--randomize-services",help="Assign random services to each decky"),
distro:Optional[str]=typer.Option(None,"--distro",help="Comma-separated distro slugs, e.g. debian,ubuntu22,rocky9"),
randomize_distros:bool=typer.Option(False,"--randomize-distros",help="Assign a random distro to each decky"),
log_file:Optional[str]=typer.Option(DECNET_INGEST_LOG_FILE,"--log-file",help="Host path for the collector to write RFC 5424 logs (e.g. /var/log/decnet/decnet.log)"),
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.