Commit Graph

855 Commits

Author SHA1 Message Date
99adbebe75 refactor(db): extract SwarmMixin
Moves the 7 swarm-host CRUD methods into sqlmodel_repo/swarm.py.
2026-04-28 14:42:58 -04:00
85c914e754 refactor(db): extract AttackerIntelMixin
Moves upsert_attacker_intel, get_attacker_intel_by_uuid,
and get_unenriched_attackers into sqlmodel_repo/attacker_intel.py.
Composed onto SQLModelRepository via mixin inheritance.
2026-04-28 14:40:36 -04:00
e16f47ad24 refactor(db): extract _safe_session/_detach_close to _helpers.py
Module-level session helpers move into sqlmodel_repo/_helpers.py.
__init__.py re-exports them so external import paths
(decnet.web.db.sqlmodel_repo._safe_session) keep resolving.
2026-04-28 14:38:26 -04:00
4167345d51 refactor(db): convert sqlmodel_repo.py to a package
Pure rename — the old monolithic 3505-line file becomes
decnet/web/db/sqlmodel_repo/__init__.py. No code changes.
Subsequent commits will extract per-domain mixins out of __init__.py
to mirror the topical layout used by decnet/web/db/models/.
2026-04-28 14:37:18 -04:00
6d8c90777d chore: remove vulture-flagged dead code, add whitelist
- plain.py: drop `or True` short-circuit + unreachable return; drop now-unused _HASH_HINTS
- ingester.py: drop unused `current_position` param from _flush_batch
- vulture_whitelist.py: document remaining false positives (FastAPI Depends side-effects, IMAP uid_mode where UID==seq)
2026-04-28 14:30:12 -04:00
b994250ef6 dev(ci): added CVE-2026-3219 to ignore vulns; no fix is yet available
Some checks failed
CI / Lint (ruff) (push) Successful in 13s
CI / SAST (bandit) (push) Successful in 21s
CI / Dependency audit (pip-audit) (push) Successful in 34s
CI / Test (Standard) (3.11) (push) Successful in 13m21s
CI / Test (Live) (3.11) (push) Successful in 1m45s
CI / Test (Fuzz) (3.11) (push) Failing after 3h1m24s
CI / Merge dev → testing (push) Has been cancelled
CI / Prepare Merge to Main (push) Has been cancelled
CI / Finalize Merge to Main (push) Has been cancelled
2026-04-28 14:24:57 -04:00
b4adc7246f fixed: deleted line from pyproject
Some checks failed
CI / Lint (ruff) (push) Successful in 17s
CI / SAST (bandit) (push) Successful in 24s
CI / Dependency audit (pip-audit) (push) Failing after 32s
CI / Test (Standard) (3.11) (push) Has been skipped
CI / Test (Live) (3.11) (push) Has been skipped
CI / Test (Fuzz) (3.11) (push) Has been skipped
CI / Merge dev → testing (push) Has been skipped
CI / Prepare Merge to Main (push) Has been skipped
CI / Finalize Merge to Main (push) Has been skipped
2026-04-28 13:03:14 -04:00
674ac7dd13 test(db): cover BaseRepository.update_identity_fingerprints
DummyRepo couldn't instantiate — TLS-cert fingerprint rollup added a new
abstract method without a stub here. Add the override and a call site so
the abstract pass body is hit.
2026-04-28 13:01:37 -04:00
cc6abf7256 fix(tests/stress): eliminate 0-request flakes in locust runs
Three independent issues conspired to make stress tests record 0 requests:

1. Every virtual user did /auth/login in on_start. With 1000 users in a
   spike window, bcrypt-bound logins never finished and on_start failed
   for all users — aggregated requests stayed at 0. Pre-fetch a single
   admin token in the fixture (cached per-host) and pass it via
   DECNET_STRESS_TOKEN so locust users skip the login storm.

2. Locust exits non-zero on any request failure by default, causing
   run_locust to throw away an otherwise valid stats CSV. Pass
   --exit-code-on-error 0 so per-test assertions are the only fail gate.

3. test_stress_sustained ran two locust subprocesses against the same
   uvicorn. Phase 1's keep-alive connections wedged phase 2 into 0
   recorded requests ~2/3 of the time. Refactored stress_server into a
   start_stress_server() context manager and gave each phase its own
   uvicorn.

Stable 3/3 on full suite, 3/3 on test_stress_sustained alone.
2026-04-28 13:01:11 -04:00
681931d9bb docs(roadmap): tick certificate details and three sibling roadmap items 2026-04-28 11:41:17 -04:00
72cc928ebf feat(prober-cert): roll up fingerprints onto AttackerIdentity
Brings the federation-gossip columns on AttackerIdentity to life —
ja3_hashes, hassh_hashes, and the new tls_cert_sha256 — by projecting
the union of every member observation's fingerprints JSON onto the
identity at clusterer create / link / merge time.

- decnet/profiler/identity_rollup.py: pure extract_fp_summaries()
  reads the production bounty shape (payload.fingerprint_type +
  payload.{ja3,hash,cert_sha256}) and returns deduped+sorted JSON
  list[str] per family, or None when a family has no signal so the
  column stays NULL instead of '[]'.
- BaseRepository.update_identity_fingerprints + SQLModel impl: one
  idempotent write that overwrites the three summary columns and
  bumps updated_at.
- ConnectedComponentsClusterer: after every per-component
  reconciliation (fresh-create OR existing-merge+link), recomputes
  and writes the rollup for the target identity. Wrapped in a
  best-effort helper so a write failure logs but never breaks the
  tick.
- Tests: extract_fp_summaries unit (dedup, sort determinism,
  unknown types ignored, malformed JSON, nested-stringified
  payloads, non-string values); end-to-end clusterer ticks
  populate the columns on create + on later observation links;
  no-fingerprint clusters keep the columns NULL.
2026-04-28 11:28:54 -04:00
9ab43b4ea4 feat(prober-cert): UI for active TLS certificate captures
- FpCertificate renders the new cert_sha256 field (truncated, with
  full hash on hover) and a FROM line carrying the prober-side
  target_ip/port so the source is visible.
- tls_certificate payloads split on target_ip presence: prober certs
  land under ACTIVE PROBES, sniffer certs under PASSIVE FINGERPRINTS.
  Two synthetic fpType keys (tls_certificate_active /
  tls_certificate_passive) drive the bucketing without disturbing
  the on-the-wire fingerprint_type.
2026-04-28 11:23:34 -04:00
5f8149daee feat(prober-cert): capture leaf TLS cert after successful JARM
JARM probes are crafted ClientHellos with weird ciphers — they never
complete a real handshake, so the peer cert isn't reachable from
those sockets. After a non-empty JARM hash proves the port speaks
TLS, do a separate ssl.wrap_socket() against the same (ip, port) to
fetch and parse the leaf cert.

- decnet/prober/tlscert.py: fetch + parse via cryptography lib;
  swallows all connect/handshake/parse failures (returns None).
- decnet/prober/worker.py::_capture_tls_cert: emits a tls_certificate
  event with subject_cn / issuer / SANs / validity / SHA-256 +
  publishes on the bus. Wired from _jarm_phase only when JARM
  succeeds, so non-TLS ports never trigger a second connect.
- Tests cover happy path, cert-fetch failure, defense-in-depth crash,
  empty-JARM skip, publish_fn, and parser edge cases (garbage DER,
  empty bytes, missing SAN extension, non-self-signed).
2026-04-28 11:14:44 -04:00
4749c972e5 feat(prober-cert): schema for active TLS cert capture
Adds storage for TLS certificate details collected from attacker-run
servers by the active prober (sibling to the existing JARM probe).

- AttackerIdentity.tls_cert_sha256 / Campaign.tls_cert_sha256:
  JSON list[str] columns mirroring ja3_hashes / hassh_hashes for
  federation gossip.
- ingester clause 9b: emits a 'tls_certificate' fingerprint bounty
  when a prober event carries subject_cn (disjoint from the existing
  sniffer-gated clause).
- Prober-side capture (ssl.wrap_socket follow-up after JARM) and
  profiler rollup land in sibling commits.
2026-04-28 11:09:25 -04:00
e986e81421 fix(test-schemathesis): drop unsupported_method check
The check expects 405 for any HTTP method not declared on a path.
DECNET's topology router has a static `/topologies/services` (GET only)
sibling to a parameterized `/topologies/{topology_id}` (DELETE), so a
DELETE on the static path falls through to the parameterized route and
hits auth, which returns 401 — by design. Leaking 405-vs-401 would let
unauthenticated callers enumerate valid topology UUIDs.

The same shape applies to other static/dynamic sibling pairs across
the API. The check is fundamentally incompatible with that routing
strategy; document the omission inline.
2026-04-28 10:20:43 -04:00
ccc8619387 fix(test-schemathesis): disable rate limiter in fuzz subprocess
Schemathesis fires up to 3000 examples per endpoint. POST /auth/login
caps at 10/5min per IP, so the second example onward returns 429 and
the positive_data_acceptance check flags it as RejectedPositiveData
(its allowed-status list is hardcoded in schemathesis to
2xx/401/403/404/409/5xx, so OpenAPI tweaks can't fix it).

DECNET_LIMITER_ENABLED=false exists for exactly this case (see
limiter.py docstring on stress/load testing).

Reverts the custom_openapi shim from 5d88346 / 9b1168c — the endpoint
already declares 429 in its responses= map (api_login.py:38), and the
shim turned out to address a problem that wasn't there. Drop the
companion test along with it.
2026-04-28 09:51:49 -04:00
9b1168ce0b fix(api): scope 429 OpenAPI injection to rate-limited routes
Previous commit advertised 429 on every operation. Only routes
decorated with @limiter.limit can actually return slowapi's 429 —
currently just POST /api/v1/auth/login. Documenting it elsewhere is
dishonest and would mislead clients into expecting a response the
server cannot produce.

Walk slowapi's _route_limits / _dynamic_route_limits registries to
identify decorated endpoints, match them to FastAPI routes by
{module}.{name}, and only inject 429 on those.

Existing per-route 429 declarations (e.g. SSE connection-cap on
events streams via sse_limits) are untouched.
2026-04-28 01:00:34 -04:00
5d883466a2 fix(api): advertise 429 on every operation in OpenAPI
SlowAPI middleware can short-circuit any request with 429 once a
per-route or per-IP rate limit fires (e.g. POST /api/v1/auth/login is
capped at 10/5min). The OpenAPI spec did not declare 429 on any
operation, so schemathesis flagged legitimate rate-limit responses as
RejectedPositiveData / status-code-nonconformance failures.

Override app.openapi to inject a generic 429 response object on every
HTTP operation in the generated schema. Add a contract test that fails
if any operation drops the 429 advertisement.
2026-04-28 00:58:37 -04:00
6b407e8c9c fix(tests): align stale tests with current behavior
- swarm/test_swarm_api, swarm/test_heartbeat: replace deprecated
  asyncio.get_event_loop().run_until_complete() with asyncio.run();
  the former raises in 3.11 once another test has set+closed a loop on
  the main thread.
- prober/test_prober_bus, prober/test_prober_worker: extend tcp_fingerprint
  mocks with tos/dscp/ecn/server_isn so the worker doesn't KeyError into
  the prober_error branch.
- services/test_service_isolation: collector now retries on event-stream
  errors instead of exiting; assert it stays running and cancel cleanly.
- live/test_imap_live, live/test_pop3_live: log format emits
  outcome="failure", not "failed".
- live/test_service_isolation_live: is_service_container accepts label
  OR state-name; rewrite the empty-state test against a synthetic
  unlabeled container instead of the host's real fleet.
2026-04-28 00:44:40 -04:00
8344b539c8 fix(ssh-template): drop sshd/pam_unix native chatter at rsyslog
OpenSSH's native syslog ("Failed password", "Connection from",
"Connection closed by …") and the pam_unix lines emitted from sshd's
PAM stack add no signal beyond what auth-helper already captures as
structured login_attempt events. They cluttered the dashboard and
arrived without an SD wrapper, forcing prose-IP heuristics in the
collector.

Add a `:programname, isequal, "sshd" stop` rule above the forwarding
actions in /etc/rsyslog.d/50-journal-forward.conf. pam_unix lines from
sshd inherit programname=sshd so the same rule covers both. sudo /
login / su pam_unix lines keep flowing (different programname), so
post-login privilege escalation telemetry is preserved.
2026-04-27 23:26:53 -04:00
9350ce195a fix(collector,correlation): extract attacker IP from sshd/pam free-form prose
Native sshd and pam_unix lines route through rsyslog without the
relay@55555 SD wrapper and without key=value pairs, so attacker_ip
fell through to "Unknown". Add a prose-IP fallback to both parsers:
anchored patterns (from/rhost/client/src) win first so we never pick
the local listener in "Connection from X port Y on Z port 22", with
a bare-IPv4 scan as the last resort.
2026-04-27 23:16:42 -04:00
3c571cce5a fix(correlation): prober events no longer count as attacker traversal
The prober writes events with hostname=decnet-prober and target_ip=
<the attacker being fingerprinted>. The parser pulls target_ip into
attacker_ip (it's one of _IP_FIELDS), which is correct for indexing
fingerprints under the attacker — but it had a side effect: every
fingerprinted attacker had two distinct deckies on file (the real
decoy they touched + decnet-prober) and the correlation engine's
traversals() classified that as lateral movement. Live dashboard
showed bogus "dmz-gateway -> decnet-prober" paths and TRAVERSAL
badges on attackers who'd done nothing but knock on the front door.

The prober is internal infrastructure, not a hop. Filter the
"decnet-" namespace out of distinct-decky counts and hop paths in
the engine. Fingerprints stay attached to the attacker profile via
the existing per-IP event index — just no longer as traversal.
2026-04-27 23:02:23 -04:00
e03a6d10a0 fix(collector): retry on event-stream errors and add periodic reconciler
Hit live on first VPS deploy: a window between the initial
client.containers.list() snapshot and the client.events() start-event
stream let topology service containers slip through, requiring an
operator restart for them to be picked up.

Two fixes:

* `_watch_events` now wraps the events() call in a retry loop with
  exponential backoff (1s -> 30s cap). A docker.errors.APIError, daemon
  reload, or SDK stream-decode hiccup used to make the executor task
  return cleanly, leaving the collector "running" with no event
  subscription. Future container starts were silently dropped until
  the unit was restarted.

* New `_reconcile_loop` async task ticks every
  DECNET_COLLECTOR_RECONCILE_S (default 30s), re-scans
  client.containers.list(), and calls _spawn for any service container
  not already in `active`. Belt to the event watcher's suspenders:
  even if a start event is dropped during a reconnect window, the
  reconciler picks it up within one cycle. Also prunes finished
  futures from `active` so the dict's bounded by current container
  count rather than agent lifetime churn.
2026-04-27 22:56:13 -04:00
c5db1d7ba2 fix(config-ini): strip inline # and ; comments from values
The module docstring teaches inline comments — `mode = master    # or
"agent"` is the canonical example for the [decnet] section. Python's
configparser ignores those by default unless inline_comment_prefixes
is set explicitly, so the comment became part of the value and
downstream validators rejected it ("mode must be 'agent' or 'master',
got 'master                       # or \"agent\"'").

Hit live on first VPS deploy: every CLI invocation crashed at import
time with a stack trace that didn't make it obvious the docstring's
example was the trigger. Now the parser does what the docs promise.
2026-04-27 22:55:58 -04:00
0b1a17b4eb fix(agent): pass --always-recreate-deps so service netns shares stay fresh
Decky service containers join their base via `network_mode:
container:<base>` and Docker binds that share at service start time. If
`docker compose up` recreates a base (e.g. ports: changes after a
forwards_l3 toggle) but decides services are unchanged, services keep
a stale FD into the destroyed namespace and end up with only `lo` — so
external traffic hits a closed port on the live base and gets RST.

Hit live on the first VPS deploy: external SSH to the dmz-gateway was
refused while sshd was listening, because base and service netns
inodes had drifted apart. `--always-recreate-deps` makes compose
rebuild every dependent whenever its base is recreated, removing the
race entirely.
2026-04-27 22:55:48 -04:00
0a525ebd37 fix(web): proxy follows DECNET_API_HOST instead of hardcoding 127.0.0.1
The dashboard's /api/* proxy hardcoded 127.0.0.1 as the target host.
That works when the API binds to a wildcard or to loopback, but
breaks the moment an operator binds the API to a specific address —
e.g. a Tailscale IP for tailnet-only deploys: the API stops listening
on loopback entirely and the proxy gets ECONNREFUSED on every request.

The web command now reads DECNET_API_HOST and falls back to loopback
only when the API is on a wildcard (0.0.0.0 / :: / unset). A new
--api-host flag overrides at the CLI level.
2026-04-27 22:55:25 -04:00
673bc5b819 ops(init): ship logrotate config so /var/log/decnet can't fill the disk
Without rotation, the syslog listener and per-host collector grow
/var/log/decnet/ without bound — a noisy attacker (or an active
probe storm) fills the disk in hours on a small VPS. New
deploy/logrotate.d/decnet caps at 7 daily rotations or 100 MiB,
whichever comes first, and uses copytruncate because the ingester
and forwarder hold the files open via Python and won't reopen on
a rename rotation.

Wire install / remove into `decnet init` and `decnet init --deinit`
alongside the existing tmpfiles.d / polkit handling.
2026-04-27 21:26:13 -04:00
5415e98458 sec(api): mode-gate and eager-load JWT secret in lifespan
Refuse to start decnet.web.api when DECNET_MODE=agent (unless the
operator explicitly opts into dual-role with DECNET_DISALLOW_MASTER=
false). The Typer CLI already hides master-only commands on agents,
but a misconfigured systemd unit or a direct uvicorn invocation
would bypass that — now the lifespan itself refuses, before any
worker, DB or bus comes up.

Resolve DECNET_JWT_SECRET eagerly at startup so a missing or known-
bad value fails at boot rather than on the first auth-gated request.
The lazy-load shape stays useful for non-master CLIs.
2026-04-27 21:26:03 -04:00
1a7da33375 sec(env): refuse to start master API with footgun public-binding config
Add validate_public_binding() called from the master API lifespan: when
DECNET_API_HOST is non-loopback, refuse to start if DECNET_CORS_ORIGINS
still contains a loopback origin (catches the "operator flipped to
0.0.0.0 to make it work and forgot to update CORS" footgun) or if
DECNET_CANARY_HTTP_BASE is plaintext http:// to a non-loopback host.
Log CRITICAL when DECNET_LIMITER_ENABLED=false on a public binding.
The validator no-ops under pytest so unrelated suites don't trip on it.

Add DECNET_VERIFY_HOSTNAME env knob; AgentClient and UpdaterClient
consult it when verify_hostname is None, giving production deploys
TLS hostname verification on top of the existing CA + fingerprint pin.
Default off so dev enrollments with mismatched SANs keep working.
2026-04-27 21:15:15 -04:00
28e2a93355 sec(updater): harden tarball extraction and verify sha256 before extract
Reject symlinks, hardlinks, device nodes and FIFOs in update tarballs;
validate each member's resolved path stays under dest after symlink
resolution; cap uncompressed size at 256 MiB to bound gzip-bomb damage;
strip setuid/setgid bits from extracted modes.

Add an optional sha256 form field to /update and /update-self; the
master client computes and sends it on every push, the executor
refuses to extract on mismatch. mTLS already authenticates the
master, so this is defence-in-depth against in-transit corruption
and gives operators a way to pin "exactly these bytes" for vetted
releases.
2026-04-27 21:14:48 -04:00
1de4136ed9 style(realism-ui): adopt the persona-page design language
Both pages now layer on DeckyFleet.css + PersonaGeneration.css and use
the project's house vocabulary — fleet-root shell, page-header with
title-group + actions, btn / btn.violet / btn.ghost, info-banner with
the violet left rule, and the dim/matrix/alert text accents.

RealismConfig: inputs are flush-styled weight-input fields with a
violet focus ring; section heads carry a TOTAL badge; canary rows get
the project's amber accent; canary probability lives in a panel-bordered
slider row.

SyntheticFiles: the inline-styled table is now a styled .files-table
with the standard hover affordance, the filter-row uses tweak-group
label+select pairs, the drawer carries .drawer-eyebrow / .drawer-title
/ .meta-grid in the same style as the canary token drawer, and pager
buttons share the .btn.ghost.small treatment.

No behavioural change.
2026-04-27 18:08:58 -04:00
2950fc216e feat(realism-ui): human-readable content_class labels
Single source of truth in decnet_web/src/realism/labels.ts: maps each
ContentClass enum value to a friendly display name ("Note",
"Cron Log", "Canary · AWS Credentials", …). Used by RealismConfig
(weight tables + class filter dropdown) and SyntheticFiles (table row
+ drawer detail).

Canary classes get a subtle amber accent so the dashboard's read of
"this row is callback-bearing" doesn't depend on prefix-spotting in
mono text. Raw enum value still appears in dim mono next to the label
so an operator copy/pasting from logs or grepping the codebase still
finds it.

No backend change: the wire shape is still the snake_case enum; the
beautification is render-time only.
2026-04-27 18:04:33 -04:00
56a88d7bd4 feat(realism-ui): operator panel for planner weights + canary probability
New /realism-config page sits next to Persona Generation and
Synthetic Files under the Automation nav. Editable weight tables for
user / system / canary content classes (with live percent share),
plus a slider for canary_probability.

Wires GET/PUT /api/v1/realism/config — viewer can read; admin
required to save. Validation errors from the API are surfaced inline
rather than swallowed; the SAVE button refreshes from the server's
canonical snapshot so the operator sees exactly what landed (matters
because cross-list entries are silently dropped server-side).
2026-04-27 18:01:35 -04:00
2cc60bd677 feat(realism): operator-tunable planner weights via realism_config
New realism_config table (uuid PK + unique key) + two repo methods
(get/set) backs an admin-only GET/PUT /api/v1/realism/config surface.

The planner now exposes apply_payload(payload) / current_payload() /
reset_to_defaults() and reads its weights through mutable module
globals; pick() resolves the live values each call. Validation
catches negative weights, zero totals, out-of-range canary_probability,
unknown content_class names, and silently drops cross-list entries
(canary class on the user list, etc).

The orchestrator worker calls _refresh_realism_config(repo) on
startup and every 5 ticks (~5min at 60s interval). Operator changes
land within one refresh window with no bus signal — the simpler path
for a knob whose latency tolerance is minutes.
2026-04-27 18:00:08 -04:00
da3c35c6a4 fix(realism): synthetic_files path fits MySQL utf8mb4 index cap
The (decky_uuid VARCHAR(64), path VARCHAR(1024)) UNIQUE constraint
generated a 4352-byte composite key under utf8mb4 (4 bytes/char),
busting MySQL's 3072-byte cap and crashing decnet api on init with:

    Specified key was too long; max key length is 3072 bytes

Tighten path to VARCHAR(512) — (64+512)*4 = 2304 bytes, well under
the cap. Real realism + canary placement paths are short
(/home/<persona>/Documents/<file>, ~70 chars); 512 keeps headroom
without the index hassle. Pre-v1, no migration helper.

Adds a regression test pinning the (decky_uuid + path) byte budget so
a future widening fails loudly in CI rather than at MySQL deploy
time.
2026-04-27 17:55:35 -04:00
397a1a111e feat(realism): LLM/breaker status on orchestrator heartbeat
Surfaces realism subsystem state on the existing worker heartbeat
extra hook (system.orchestrator.health) — no new bus topic. Payload
carries {llm_enabled, llm_backend, llm_model, llm_breaker_state}, so
the dashboard's worker panel renders a live LLM badge with a colored
breaker-state dot:

  closed (green)   — LLM healthy
  half_open (amber) — cooldown elapsed; next call is a probe
  open (red)       — short-circuiting to deterministic templates

Heartbeat is the canonical worker self-report channel; piggybacking on
extra(...) avoids a new topic family while keeping the snapshot
recomputed each beat (30s).
2026-04-27 17:51:00 -04:00
55e86f606c feat(realism-ui): synthetic files browser
New /synthetic-files page sits next to Persona Generation and Canary
Tokens under the Automation nav group. Operators get a paginated
inventory of files realism has grown across the fleet (decky, path,
persona, content_class, last_modified, edit_count, hash) with filters
on decky / persona / content_class.

Decky filter is a dropdown sourced from /deckies — never free text.
Row click opens a drawer with the body preview; the drawer surfaces a
TRUNCATED chip when the stored body is at the 64KB cap.
2026-04-27 17:48:05 -04:00
87cb61c8b2 feat(realism): synthetic-files browser API
Adds GET /api/v1/realism/synthetic-files (paginated list, filters by
decky_uuid, persona, content_class) and
GET /api/v1/realism/synthetic-files/{uuid} (single row with last_body
and a truncated:bool flag set when the stored body is at the 64KB cap).

Repo gains count_synthetic_files() and get_synthetic_file(uuid). The
list view drops last_body to keep the wire payload bounded; the detail
endpoint is the only path that returns it. Read-only — orchestrator
remains the sole writer.
2026-04-27 17:44:53 -04:00
2eeec15f9c feat(orchestrator-ui): mark file-edit events with an EDIT badge
FileAction and EditAction both write kind="file" — the discriminator
is action="file:create" vs "file:edit". The dashboard timeline used
to render both identically; now an EDIT sub-chip surfaces edits without
widening the kind enum (which doubles as the bus topic family).

No schema or API change. Polish only.
2026-04-27 17:42:21 -04:00
147f52467f feat(canary): kind reflects trip surface per generator
decnet/canary/cultivator wrote kind="http" for every cultivated
token, even DNS-trip ones (ssh_key, mysql_dump) and passive bait
(aws_creds). The canary worker uses kind to route attacker callbacks
to the right token; a misaligned kind means a real DNS resolution of
ssh_key or mysql_dump never attributes to the planted slug.

Add _GENERATOR_TO_KIND aligned with CanaryKind in models/canary.py
and look it up at create_canary_token time.
2026-04-27 17:40:37 -04:00
49da15823f refactor(realism): single source of truth for persona→login
decnet/realism/naming._home and decnet/canary/cultivator._persona_login
both normalised "John Smith"→"johnsmith" with identical logic. Lift
to decnet.realism.personas.login_for(persona) and have both consumers
import it. Drift between the two would have left canary placement and
realism path naming using different login derivations.
2026-04-27 17:39:04 -04:00
7e9bc6d49a refactor(realism): enforce synthetic_files 64KB cap at the repo
The orchestrator worker clipped last_body at write time, but the repo
didn't enforce. A future caller that forgot the clip would write the
full body. Move the clip to record_synthetic_file and
update_synthetic_file via SYNTHETIC_FILE_BODY_LIMIT in
decnet/web/db/models/realism.py. Worker now passes the full body and
trusts the repo. Tests retargeted to assert repo enforcement.
2026-04-27 17:37:36 -04:00
b86129e35e tests: realism migration regression coverage
Four gaps from the realism migration plan, plus one flaky test
fixed.

Added:

- tests/deploy/test_orchestrator_unit.py — replaces the dead
  test_emailgen_unit.py. Asserts:
  * decnet-orchestrator.service.j2 carries the DECNET_REALISM_*
    env block (LLM, MODEL, TIMEOUT, PERSONAS) so per-host tuning
    works without editing the .j2.
  * Legacy DECNET_EMAILGEN_* vars are NOT referenced — clean break
    contract from stage 5.
  * decnet.target wants orchestrator + canary, does NOT want
    decnet-emailgen.service. Anti-regression for service-collapse.
  * deploy/decnet-emailgen.service.j2 stays deleted.

- tests/orchestrator/test_worker_integration.py — new
  test_one_tick_email_branch_records_orchestrator_email. Pins the
  action-roll to email, seeds a topology with an IMAP mail decky +
  two personas, stubs LLM + docker-exec write paths, verifies an
  orchestrator_emails row + bus event land. Restores end-to-end
  email coverage that was lost when the pre-collapse
  test_worker_integration.py was deleted.

- tests/realism/test_synthetic_files_truncation.py — pins the 64KB
  last_body cap on create + edit, and documents the consequence:
  edit candidates carry a truncated snapshot of files that exceeded
  the cap. If a future change lifts the cap, _LIMIT in the test
  must lift with it.

Fixed flaky:

- tests/orchestrator/test_scheduler.py — two pick_file tests
  pinned to random.Random(1). Without a seed, the 3% canary gate
  (stage 7) and 10% leave-alone roll occasionally flaked the
  assertions because the _FakeRepo doesn't carry a
  create_canary_token method.

Note: the existing
test_realism_subprocess_import_personas_rejects_in_agent_mode
already covers agent-mode rejection of decnet realism
import-personas; no new gating test needed.
2026-04-27 17:29:25 -04:00
a07fb3fe08 feat(realism): canary cultivator on the realism contract
Stage 7 — final stage of the realism migration. Canary plants are
now scheduled by the same realism planner that handles inert content,
keeping the orchestrator as the single decision point and avoiding
duplicate diurnal / persona / rate-limit logic in the canary
subsystem.

New surface:

- decnet/canary/cultivator.py: cultivate(plan, repo) builds a
  CanaryContext, calls the right generator (canary_aws_creds ->
  aws_creds, canary_mysql_dump -> mysql_dump, …), persists the
  canary_tokens row before plant so the canary worker can attribute
  callbacks even on plant-time previews. Resolves canary placements
  to credible operator paths (~/.aws/credentials, ~/.ssh/id_rsa,
  /var/backups/db_backup.sql).
- realism/planner.py adds 8 canary content_classes uniformly weighted
  inside a 3% probability gate. Hard-capped: each tick at most one
  canary; create branch falls through to inert otherwise.
- scheduler.pick_file dispatches canary content_class to the
  cultivator; FileAction grows an optional content_bytes field so
  binary canary artifacts (DOCX/PDF/honeydoc) survive the wire
  intact instead of being utf-8 round-tripped.
- SSHDriver._run_file uses content_bytes when set, falls back to
  encoding the str content otherwise.

Stealth (per feedback_stealth.md): cultivator does not introduce
any DECNET literal; the underlying generators are already
stealth-clean and the test suite asserts the contract holds.

Tests cover round-tripping every canary class through the cultivator,
verifying placement-path conventions, persona-login normalisation
("John Smith" -> /home/johnsmith/.aws/credentials), and the
no-DECNET-leak invariant.
2026-04-27 16:47:59 -04:00
4e436da569 feat(realism): LLM enrichment for user-class file bodies
Stage 6 of the realism migration. User-class file bodies (note,
todo, draft, script) optionally get LLM-authored content; system
classes (cron / daemon logs, /tmp caches) stay template-only because
formulaic *is* the right look for them.

New surface:

- realism.llm.circuit.LLMCircuitBreaker — process-local sliding-window
  breaker. 3 consecutive failures trip open; 60s cooldown to half-open;
  half-open success closes, failure re-opens. Protects the orchestrator
  tick from sustained Ollama wedges (per-call timeout already covers
  one-shot hangs).
- realism.prompts._style — em-dash suppression lifted from the
  email prompt. Persona.uses_llms_heavily opts out per the
  feedback_em_dash_llm_tell.md memory. Includes strip_em_dashes
  belt-and-braces sub for output that slipped past the prompt rule.
- realism.prompts.filebody — class-conditioned prompts (note / todo
  / draft / script) with persona context, language pinning, output
  shape rule.
- realism.bodies.make_body_with_llm — async wrapper around make_body
  that calls the LLM when one is provided AND the breaker allows.
  Falls back to template on timeout / error / empty / system-class.

Wiring:

- scheduler.pick_file accepts optional llm + llm_breaker + llm_timeout.
  When the planner picks a create action and the content_class is a
  user-class, the body_hint is replaced with the LLM-authored body
  (or falls back to the deterministic body_hint).
- orchestrator.worker constructs get_llm() at startup gated by
  DECNET_REALISM_LLM env var (any non-empty value enables; empty /
  "off" / "none" / "0" disables). Passes llm + breaker through every
  tick.
- decnet orchestrate gains --llm/--no-llm flag overriding the env var.
2026-04-27 16:42:58 -04:00
b321e29002 feat(realism): EditAction read-modify-write of planted files
Stage 3b of the realism migration. A TODO.md planted on Monday gets a
checkbox flipped on Tuesday; a notes file grows a follow-up line; a
cron log gets a fresh entry tacked on. The synthetic_files row's
edit_count, last_modified, and content_hash advance.

New surface:

- EditAction dataclass (peer of FileAction in scheduler.py): carries
  decky, path, persona, content_class, previous_body, mtime, and
  synthetic_file_uuid for the worker's update path.
- realism.bodies.next_iteration(cls, persona, prev, rng): per-class
  deterministic mutators. TODO flips an unchecked box and/or appends;
  notes/drafts/scripts append; logs are append-only (mirroring real
  log behaviour). Canary, cache_tmp, email raise KeyError —
  unsupported.
- realism.planner.pick gains an edit branch: 60% create, 30% edit
  (when an edit_candidate is supplied), 10% leave-alone. Returns
  None on leave-alone — quiet ticks are realism too.
- scheduler.pick_file pre-fetches a single edit candidate via
  repo.pick_random_synthetic_file_for_edit ~50% of ticks; the
  planner decides whether to use it.
- SSHDriver._run_edit: turns next_iteration output into a
  plant_file call (mtime-bumped, mode 0o644). Stashes new_body in
  result.payload so the worker can hash it for synthetic_files.
- worker._bump_synthetic_file_after_edit: patches edit_count + 1,
  last_modified=now, content_hash, last_body for the row UUID.
  No-op when the row was pruned mid-flight.
- events.to_row / topic_for / event_type_for now recognise
  EditAction (kind="file", action="file:edit").
2026-04-27 16:38:17 -04:00
32eeb0c813 refactor(orchestrator): collapse decnet-emailgen.service into orchestrator
Stage 5 of the realism migration. Email generation is no longer a
separate worker / systemd unit / CLI subcommand — the orchestrator's
single tick loop covers SSH traffic, file plants, and email drops.
Going from 21 services to 20.

Worker:
- _one_tick rolls between traffic / file / email (45/45/10 weights).
  The 10% email weight at a 60s orchestrator interval produces ~one
  email per 10 minutes, close to the pre-collapse 5-minute cadence.
- get_driver_for(action) (stage 4) handles SSH vs Email dispatch.
- Quiet branches fall through so a (decky-set, persona-pool,
  mail-decky) shape that silences one branch doesn't waste the tick.
- Periodic prune covers both orchestrator_events and
  orchestrator_emails tables.

Deletions:
- deploy/decnet-emailgen.service.j2
- decnet/orchestrator/emailgen/worker.py
- decnet/cli/emailgen.py
- tests/orchestrator/emailgen/test_worker_integration.py

Renames (history-preserving):
- decnet/web/router/emailgen/ -> decnet/web/router/realism/
- tests/api/emailgen/        -> tests/api/realism/
- tests/cli/test_emailgen_*  -> tests/cli/test_realism_*

Public surface changes (clean break, pre-v1):
- API URL /api/v1/emailgen/personas -> /api/v1/realism/personas
- CLI `decnet emailgen import-personas` -> `decnet realism
  import-personas`. `decnet emailgen run` is gone — the orchestrator
  covers it.
- gating.py: emailgen master-only group replaced by realism.
- decnet-orchestrator.service.j2: DECNET_REALISM_* env block added.
- decnet.target: decnet-emailgen.service entry removed.
- frontend: PersonaGeneration.tsx fetches /realism/personas.
2026-04-27 16:33:04 -04:00
cb1872c52f feat(realism): synthetic_files table + planner wiring + scheduler swap
Stage 3 of the realism migration. Replaces orchestrator/scheduler.py's
hardcoded _FILE_TEMPLATES/_USERS (3 templates emitting epoch-suffixed
filenames like notes-1777315854.txt with identical bodies per
template) with a persona-driven realism engine.

New surface:

- SyntheticFile SQLModel (synthetic_files table, UNIQUE on
  decky_uuid+path) — per-(decky, path) state for the future
  edit-in-place flow. Pre-v1, no _migrate_* helper.
- BaseRepository methods: record_synthetic_file,
  update_synthetic_file, list_synthetic_files,
  pick_random_synthetic_file_for_edit (used by stage 3b).
- realism/naming.py: per-content-class filename templates,
  persona-conditioned. /var/log/cron.log + logrotate skeleton for
  system-class; /home/<persona>/TODO.md, scratch.md, etc. for
  user-class. Anti-regression test pins "no 8+ digit decimals in
  basenames" (the realism failure today).
- realism/bodies.py: deterministic body templates per content_class.
  TODO body uses checkbox markdown, script body has a shebang, cron
  body matches syslog cron shape ("CRON[PID]: (user) CMD (...)").
- realism/planner.py: pick(deckies, now, rng) returns a Plan.
  Diurnal-gated, weighted user/system content split (70/30 user
  bias). Create-only in stage 3; edit branch lands in stage 3b.

Scheduler split:

- scheduler.pick is now traffic-only (sync).
- scheduler.pick_file is async, takes a repo, resolves personas
  (Topology.email_personas for topology-source deckies; global
  realism.personas_pool otherwise), and maps Plan -> FileAction.
- FileAction gains persona/content_class/mtime fields.

Worker:

- _one_tick rolls 50/50 between traffic and file each tick. After a
  successful FileAction plant, _record_synthetic_file persists or
  patches the synthetic_files row (catching the unique-constraint
  collision on re-plant of the same path).
- SSHDriver._run_file passes action.mtime through to plant_file so
  files don't all stamp at wall-clock-now.
2026-04-27 16:22:07 -04:00
636c057cc5 refactor(orchestrator): extract ActivityDriver ABC + driver factory
Stage 4 of the realism migration. Lifts the driver Protocol into a
proper ABC with default plant_file/read_file methods (raise
NotImplementedError), and adds get_driver_for(action) so the
orchestrator worker can dispatch by action shape without isinstance
chains.

SSHDriver now inherits ActivityDriver and implements:
- plant_file: streams base64 via stdin (ARG_MAX-safe, mirrors
  decnet.canary.planter; commit c17b9e0). Honours mtime via touch -d
  so realism-planned files don't all stamp at wall-clock-now.
- read_file: docker exec cat with FileNotFoundError on rc=1, used by
  the upcoming EditAction (stage 3b).

EmailDriver inherits ActivityDriver. Driver alias kept for back-compat
during the migration; removed once realism stages 5-7 land.
2026-04-27 16:09:46 -04:00
0b9873982d refactor(realism): move emailgen LLM/personas/prompt into shared library
Lift the format-agnostic pieces from decnet/orchestrator/emailgen/
into the new decnet/realism/ library so file-class content generation
(stage 3 of the realism migration) can reuse them. Email-specific
delivery (RFC 2822 EML, IMAP/POP3 spool, thread chains) stays in
orchestrator/.

Renames (history-preserving git mv):
  emailgen/personas.py     -> realism/personas.py
  emailgen/prompt.py       -> realism/prompts/email.py
  emailgen/global_pool.py  -> realism/personas_pool.py
  emailgen/llm/            -> realism/llm/

Env-var clean break (pre-v1, no aliases):
  DECNET_EMAILGEN_LLM      -> DECNET_REALISM_LLM
  DECNET_EMAILGEN_MODEL    -> DECNET_REALISM_MODEL
  DECNET_EMAILGEN_TIMEOUT  -> DECNET_REALISM_TIMEOUT
  DECNET_EMAILGEN_PERSONAS -> DECNET_REALISM_PERSONAS
  DECNET_EMAILGEN_FAKE_OUTPUT -> DECNET_REALISM_FAKE_OUTPUT

Importers rewritten in: orchestrator/emailgen/scheduler.py,
orchestrator/drivers/email.py, web/router/{emailgen,topology}/
api_personas.py, cli/emailgen.py. Tests for moved modules relocated
to tests/realism/; tests for stay-put modules updated in place.

API URL `/api/v1/emailgen/personas` and CLI `decnet emailgen
import-personas` keep their public names until the service-collapse
commit (stage 5).
2026-04-27 16:05:43 -04:00