Commit Graph

350 Commits

Author SHA1 Message Date
323077b383 fix(web/transcripts): fall back to shard-scan when Log row has no shard_path
sessrec.c emits the session_recorded SD blob with sid/service/src_ip/
duration_s/bytes/truncated — it never emitted shard_path. The web
handler still asked for fields.shard_path, got "", tripped the
sessions-YYYY-MM-DD.jsonl basename regex and returned
400 "invalid shard name" for every legitimate transcript request.

Handler now:
- Fast-paths when fields.shard_path IS present and validates
  (for any future emitter or ingester that backfills it).
- Otherwise enumerates sessions-YYYY-MM-DD.jsonl shards under
  ARTIFACTS_ROOT/{decky}/{service}/transcripts/ (newest first) and
  returns the first one whose per-sid index contains our sid.
- Security invariant preserved: only files whose basename matches the
  _SHARD_BASENAME_RE are ever opened, and they always resolve inside
  ARTIFACTS_ROOT. A forged fields.shard_path is silently ignored.
- Soft-fails OSError/PermissionError on the transcripts dir (decky
  containers often write it with a uid the API can't read) — returns
  404 instead of a 500 traceback.

test_forged_shard_path_blocked updated to match the new semantics:
forgery is ignored, the real shard is served via fallback. The
invariant (no /etc/passwd access) is still asserted by the fact
that status is 200 with data from the test shard.
2026-04-24 01:18:40 -04:00
e4ccf30133 fix(init): template the polkit rule on --group too
polkit rule 50-decnet-workers.rules hardcoded isInGroup("decnet"),
so when 'decnet init --group anti' installed systemd units as
User=anti / Group=anti, the API (running as anti) could no longer
systemctl start/stop decnet-*.service — polkit fell back to
'interactive authentication required', which in a daemon context is
a hard fail:

  START FAILED · COLLECTOR — Failed to start decnet-collector.service:
  Access denied as the requested operation requires interactive
  authentication.

Rename the rule to .j2, parameterise the group on {{ group }}, and
route _install_polkit through _render_template /
_write_rendered_if_changed. Now the polkit rule matches whatever
group was passed to 'decnet init'.

Test fixture updated to seed the .j2 variant.
2026-04-24 01:07:16 -04:00
311da4098e fix(logging): don't crash the process if the system log can't be opened
_configure_logging opened InodeAwareRotatingFileHandler against
DECNET_SYSTEM_LOGS (default: relative decnet.system.log) without
guarding OSError. Under systemd with ProtectSystem=full +
ProtectHome=read-only and no writable path baked into the unit, the
first import of decnet.config raised OSError and the daemon died
before it could even print a useful error — the root-cause log line
showed up in journalctl as a stack trace rather than a warning.

Wrap the handler attachment in try/except OSError and log a single
WARNING via the already-installed stream handler. stderr is always
attached, so losing the file handler means operators tail
journalctl / docker logs instead — the daemon keeps running.
2026-04-24 01:00:42 -04:00
bfff212a05 fix(api): gate embedded Docker-log collector on DECNET_EMBED_COLLECTOR
The API lifespan unconditionally spawned log_collector_worker,
appending every container line to DECNET_INGEST_LOG_FILE. On hosts
that also run decnet-collector.service (installed by 'decnet init')
that's two tailers writing the same events to the same file — the
ingester then inserts each event twice and the dashboard shows every
command duplicated.

Add DECNET_EMBED_COLLECTOR (default false), matching the existing
DECNET_EMBED_PROFILER and DECNET_EMBED_SNIFFER pattern directly
above this block. Single-process dev setups without systemd can flip
it on to restore the all-in-one behaviour; multi-process production
gets the single-writer invariant by default.
2026-04-24 00:47:37 -04:00
edc8297af3 fix(init): gate userdel/groupdel on --purge to avoid nuking the operator
Every plain `decnet deinit` ran userdel + groupdel unconditionally. In
dev the operator may pass `--user $USER --group $USER` to avoid file
ownership churn against a source checkout — at which point deinit
would cheerfully delete their own login account.

Move user/group removal behind --purge, matching the existing
behaviour for /var/lib/decnet + /var/log/decnet. Help text updated:
--purge now clearly advertises that it also wipes the service
user/group, with an explicit warning to only run it when `decnet init`
created the account in the first place.

Test updated: plain --deinit must NOT invoke userdel/groupdel;
--deinit --purge must.
2026-04-24 00:38:51 -04:00
38832d87d5 fix(init): thread --user / --group through systemd unit templates
Every decnet-*.service.j2 hardcoded User=decnet / Group=decnet. The
init CLI accepted --user / --group and used them for useradd,
chown, /etc/decnet ownership and ReadWritePaths — but the Jinja
context omitted them entirely, so

  sudo decnet init --install-dir $PWD --user anti --group anti

rendered

  User=decnet
  Group=decnet

into every unit, which at best ran the workers as a user that didn't
match the files (fails to read the venv / config), and at worst spun
a parallel system user the operator never asked for.

Swap the hardcoded lines to {{ user }} / {{ group }} across all 13
templates and add both to the Jinja context in _install_units.
2026-04-24 00:36:23 -04:00
51012eaa67 feat(init): decouple venv from install_dir; fail loud if no venv exists
The systemd unit templates hardcoded {{ install_dir }}/venv/bin/decnet.
On production hosts enroll_bootstrap.sh creates exactly that path so it
worked. On dev boxes where the operator runs `sudo decnet init` against
a source checkout with a differently-named venv (.venv, .311, .312),
every decnet-*.service looped forever in auto-restart with:

  Failed at step EXEC spawning .../venv/bin/decnet: No such file or
  directory

Templates now use {{ venv_dir }} as an independent Jinja2 var. `decnet
init` adds --venv-dir (explicit override), otherwise autodetects:

  1. $VIRTUAL_ENV (only when inside --install-dir, so a user-home venv
     never gets baked into a root-owned unit),
  2. {install_dir}/venv (production default; what enroll_bootstrap
     creates),
  3. {install_dir}/{.venv,.311,.312,.313} (common dev conventions).

Init aborts before any file writes if nothing resolves — an
operator-friendly error beats journalctl spam on every unit restart.

python3-venv doesn't set a persistent system variable — $VIRTUAL_ENV
lives in the activated shell only — so this has to be decided + baked
in at init time; there's no way for systemd to "inherit the current
venv" at unit start.

Test mode (--prefix) skips venv validation so the existing test suite
doesn't need to stub up a venv tree per case.
2026-04-24 00:29:49 -04:00
cb692d570a feat(cli): status queries systemd for every decnet-* unit
'decnet status' used to psutil-scan for cmdlines matching hand-coded
service launch args. That worked on dev boxes running workers via
'python -m decnet.cli ...' but missed the systemd reality on real
hosts: units may be installed but not started, failed, or in
auto-restart — all invisible to a cmdline grep.

New behaviour: status calls `systemctl list-units --type=service --all
--output=json 'decnet-*.service'` and renders the unit/load/active/
sub/description matrix. One view works for masters, agents, and
mixed hosts — iterates over whatever 'decnet-*' units were installed
by 'decnet init' / the enroll-bundle. Agent/master mode filtering is
no longer needed in the CLI; the host literally does not have
master-only units installed if it enrolled as an agent.

The psutil path survives as a fallback for boxes without systemd
(dev laptops, CI containers, minimal init systems) so the command
stays useful there. Clearly labelled 'psutil fallback' in the table
title so operators know which view they're looking at.
2026-04-24 00:23:00 -04:00
d61e143b71 fix(stress): unblock Locust runs from login rate-limit self-DoS
Locust spawns N virtual users (default 1000), all from 127.0.0.1 as
admin. /auth/login is rate-limited 10/5min per-IP AND per-username, so
the 11th on_start() got 429 and a RuntimeError. A @task(2) login in
the task weights turned the whole run into a 429 factory even after
ramp-up. And _login_with_retry treated 429 as non-retryable, so there
was no graceful degradation path.

Three changes, one root cause:

- decnet/web/limiter.py: read DECNET_LIMITER_ENABLED (default true).
  When false, slowapi's Limiter(enabled=False) makes @limiter.limit a
  no-op. Default ships unchanged; nobody should ever release with this
  off.
- tests/stress/conftest.py: set DECNET_LIMITER_ENABLED=false in the
  uvicorn subprocess env. Stress tests measure throughput, not rate
  limiting.
- tests/stress/locustfile.py: drop the @task(2) login — it added zero
  coverage (every user already logs in at on_start) and only generated
  contention. Teach _login_with_retry to honour 429 + Retry-After so a
  Locust pointed at a limiter-enabled server degrades gracefully
  instead of crashing on_start.
2026-04-24 00:13:15 -04:00
26d04d5eb8 fix(db): SessionProfile.kd_digraph_simhash must be BINARY(8), not BLOB
MySQL can't index a BLOB/TEXT column without a prefix length, so
create_all() on a fresh MySQL schema blew up with "BLOB/TEXT column
'kd_digraph_simhash' used in key specification without a key length".

SimHashes are a fixed 8 bytes — the variable-length type was a
SQLAlchemy-side auto-mapping from 'Optional[bytes]', not an actual
schema requirement. Switch to BINARY(8), which is portable: MySQL gets
a fixed-width indexable BINARY, SQLite treats it as BLOB and doesn't
care about key length.
2026-04-23 22:06:38 -04:00
2ff392511b feat(templates): per-instance stealth seeding helper
Adds instance_seed.py to every service template (conpot, docker_api,
imap, k8s, llmnr, pop3, rdp, sip, smb, snmp, ssh, telnet, tftp, vnc).

Derives a stable per-instance seed from NODE_NAME (+ optional
INSTANCE_ID) and exposes deterministic helpers for the boring details
scanners would otherwise use to fingerprint the whole fleet as one
machine: cluster UUIDs, auth salts, uptime fixtures, minor version
strings. Connection-time jitter is intentionally NOT seeded — two hits
to the same decky must not replay the same latency curve.

Identical source across every template; lives next to each service so
the Docker build context picks it up without a shared package-data hop.
2026-04-23 21:51:51 -04:00
0eb0b32c7a refactor(swarm): enroll bundle switches from exclude list to include list
Exclude lists fail open — anything new at the master's repo root (venvs,
logs, dev notes, .env.local, local DB dumps) silently leaks into every
agent bundle. On this box a stray .311 venv (335 MB) + logs/ (220 MB)
bloated the tarball to ~150 MB and blew test_enroll_bundle timeouts.

Replace _EXCLUDES + _is_excluded with _INCLUDED_ROOT_FILES +
_INCLUDED_DIRS + _EXCLUDED_DECNET_SUBTREES and iterate via os.walk with
in-place dirnames[:] pruning so master-only subtrees (decnet/web,
decnet/mutator, decnet/profiler) and __pycache__ aren't descended into
at all.

Bundle contents are now strictly: pyproject.toml + the decnet/ package
minus the three master-only subtrees. Synthetic entries (INI, certs,
systemd units) unchanged — they were always added inline, not from the
tree walk.

test_enroll_bundle.py: 20/20 pass in 24s (was timing out at 15s/test).
2026-04-23 21:47:47 -04:00
ffc275f051 feat(geoip): country-code enrichment via RIR delegated-stats
Populates Attacker.country_code + country_source (MVP) using the five
RIR delegated-stats files (ARIN/RIPE/APNIC/LACNIC/AFRINIC). Offline,
license-free, no outbound traffic that could burn honeypot stealth.

- decnet.geoip package with factory/base/lookup + rir/ subpackage
  (fetch/parse/provider) mirroring the db + bus factory convention
- Profiler._build_record calls enrich_ip on every upsert
- Idempotent ALTER TABLE migrations for both SQLite and MySQL
- decnet geoip refresh/lookup CLI (master-only)
- /var/lib/decnet/geoip seeded by decnet init
- DECNET_GEOIP_ENABLED=false kill-switch; set in tests/conftest.py so
  unit tests never trigger the first-access fetch
2026-04-23 21:12:38 -04:00
07bf3dc8cb feat(config): promote /etc/decnet/decnet.ini to real config with domain sections
The config file `decnet init` dropped at /etc/decnet/config.ini was a
stub with a single [decnet] header saying 'reserved for future structured
settings.' Admins who wanted to tune DECNET_API_HOST, DECNET_DB_URL,
DECNET_BATCH_SIZE, etc. had to hunt env.py for the exact variable name
and drop it in .env.local.

Changes:
- decnet/config_ini.py — adds a _DOMAIN_MAP translation table covering
  [api], [web], [database], [bus], [swarm], [logging], [ingester],
  [tracing]. Loads regardless of mode; unknown keys inside a known
  section log a WARNING (operator typos shouldn't be silent).
  Explicit key map (not auto kebab-to-snake) so [web] admin-user lands
  in DECNET_ADMIN_USER without silently renaming the env-var contract
  consumers import from decnet.env.
- decnet/cli/init.py — renames the placeholder target config.ini →
  decnet.ini (unifies with the name already used by load_ini_config and
  the enroll bundle's _render_decnet_ini). Placeholder body now shows
  every domain section as a commented example so admins learn the
  shape by reading. Deinit removes both decnet.ini and the legacy
  config.ini so upgrading hosts leave no orphan file.

Precedence is unchanged: real env > INI > built-in default in env.py.
os.environ.setdefault means systemd EnvironmentFile= and one-off
DECNET_FOO=bar decnet ... invocations always win.

Secrets explicitly NOT moved to the INI:
 - DECNET_JWT_SECRET
 - DECNET_ADMIN_PASSWORD
 - DECNET_DB_PASSWORD
They stay in .env.local / EnvironmentFile= — never in a group-readable
INI, never in a diff, never on the dashboard.

Dev/profiling flags (DECNET_DEVELOPER, DECNET_EMBED_*, DECNET_PROFILE_*)
also stay env-only per maintainer direction — dev knobs shouldn't
be one 'I'll flip this for tonight' away.

Tests: +5 in test_config_ini.py (domain sections load regardless of mode,
env beats INI for domain keys, unknown key warns, absent section is
no-op, role section beats domain section via setdefault precedence). +1
in test_init.py (placeholder writes decnet.ini with every section
header present as commented guidance).

31 tests pass across the two files (was 26).
2026-04-23 18:21:00 -04:00
1753eca198 feat(deploy): templatize systemd services on install_dir via Jinja2
Distros reserve /opt for different things (some package managers own it
outright), and a DECNET install that wants to live at /srv/decnet or
/usr/local/decnet had to hand-edit 13 service files post-install.

Converts every deploy/decnet-*.service to a .j2 template keyed on
{{ install_dir }}, rendered by `decnet init` at install time. All other
paths (log_dir, state_dir, runtime_dir, user, group) stay standard —
only install_dir varies.

Changes:
- deploy/decnet-*.service → deploy/decnet-*.service.j2 (13 files).
- decnet init gains --install-dir (default /opt/decnet, preserves
  existing behaviour byte-for-byte). Validates absolute-path at the
  CLI boundary. Threads through useradd --home-dir and the dir-creation
  list so the filesystem layout matches the rendered templates.
- _install_units renders via Jinja2 with StrictUndefined (typo → loud
  error, not a silent broken unit). SHA over rendered output so
  operators with a custom install_dir get idempotent re-runs.
- decnet.target, tmpfiles.d, polkit rule stay static — they don't
  reference install paths.
- 4 new tests: custom install_dir renders into units, default remains
  /opt/decnet, relative paths rejected, second run with same custom
  dir is idempotent.
2026-04-23 18:08:26 -04:00
4418608a54 fix(bus): silently drop publishes on closed bus instead of raising
Worker bus instances (collector, ingester) close their private buses
in finally blocks on shutdown, but stream threads holding closure
references kept calling publish after close — one `RuntimeError:
publish on closed bus` per stream line, caught by publish_safely
and logged per call, flooding server logs.

Changes:
- `UnixSocketBus.publish()` now drops post-close calls. First drop
  WARNs loudly (bus is critical infra — silent drops would hide real
  problems); subsequent drops on the same instance log at DEBUG to
  prevent the flood. Sticky `_closed_publish_warned` flag, reset
  naturally per new bus instance.
- `make_thread_safe_publisher` short-circuits on a closed bus before
  marshalling a coroutine onto the loop. Avoids the wasted scheduling
  work in the hot shutdown path.

Degradation is safe: callers go through `publish_safely`, which
already treats exceptions as 'dropped notification, DB is source of
truth.' We just stop manufacturing the exception in the first place
for a known-benign condition.
2026-04-23 18:00:47 -04:00
eb2308d9e1 fix(bus): retry app-bus connect with backoff instead of one-shot veto
A startup race between `decnet bus` being ready and the API's lifespan
hitting `get_app_bus()` at api.py:135 would set `_tried = True`
permanently, poisoning the singleton for the rest of the process: the
dashboard shows BUS OFFLINE, topology SSE falls into the bus-is-None
snapshot-only branch, mutator publish calls no-op. Only an API
restart recovered.

Replaces the one-shot veto with a time-gated retry keyed on a
`_last_failure_ts` monotonic timestamp plus a 2 s backoff. Publishers
on the hot path still pay at most one connect attempt every 2 s when
the bus is down, but the singleton auto-recovers within 5 s (one
dashboard poll) once the bus comes up.

The asyncio lock still serialises concurrent callers so the bus server
doesn't get stampeded with parallel connect attempts on startup.
2026-04-23 17:59:17 -04:00
ef4179ea1f feat(api): opaque 500 handler + error_id correlation for unhandled exceptions
Registers a generic @app.exception_handler(Exception) that catches anything
uncaught in route handlers / dependencies. Prod response is opaque:
{detail: 'Internal Server Error', error_id: <uuid4 hex>}. Dev mode
(DECNET_DEVELOPER=True) adds exception_type and traceback fields so
failures are debuggable without tailing server logs.

The error_id is logged alongside the full traceback server-side, letting
operators correlate a user's 500 report with the exact exception via
`grep <error_id> /var/log/decnet.log`.

FastAPI's own HTTPException routing and the existing
RequestValidationError / ValidationError / RateLimitExceeded handlers
still take precedence — this handler only fires on genuinely-uncaught
exceptions.

Flips threat model F1/I 'traceback / stack trace leakage' from ? to M
and logs a follow-up checklist entry for 4 detail=str(e) sites in the
fleet deploy router (admin-gated, different threat class, separate
audit).
2026-04-23 14:07:32 -04:00
2f4f81e5de feat(api): rate-limit /auth/login + scaffold threat model
Adds slowapi two-bucket rate limit on /auth/login — 10 attempts per
5 minutes per-IP AND per-username, tripping either → 429. Per-IP
catches botnets hitting one account; per-username catches distributed
credential stuffing against one account. In-memory storage: dashboard
API is single-process, Redis is disproportionate for v1.

X-Forwarded-For is deliberately NOT trusted (spoofable); reverse-proxy
deployments get one shared bucket per proxy IP. Logged in the threat
model as accepted risk DA-08, to be revisited when a verified-proxy
config lands.

Also scaffolds development/THREAT_MODEL.md with STRIDE-per-element
methodology, system-context DFD, and Dashboard↔API as the first fully
worked component (7 sub-flows, ~50 threat entries). F1 Authn ships
with 3 threats mitigated: rate limit (new), uniform 401 (verified
already in place), bcrypt length clamp (verified already in place via
Pydantic max_length=72).
2026-04-23 13:25:28 -04:00
8cbb7834ef feat(web): SMTP victim-domain + stored-mail panels on attacker detail
Adds GET /attackers/{uuid}/smtp-targets (viewer) and GET /attackers/{uuid}/mail
(admin) endpoints, plus two new sections on the attacker detail page:
VICTIM DOMAINS rollup (aggregate-only, federation-gossip-safe) and STORED MAIL
with a drawer that decodes headers, lists attachments, and downloads the raw
.eml via the existing artifact endpoint (?service=smtp).
2026-04-22 22:33:53 -04:00
d43303251d feat(profiler): track SMTP victim domains per attacker
New SmtpTarget table records each (attacker, domain) pair observed via
the SMTP honeypots. Only the domain is stored — local-parts are dropped
at ingestion, so this table holds no user-identifying data beyond the
target organisation's identity.

The profiler worker extracts domains from rcpt_to / rcpt_denied /
message_accepted events, normalizes them (lowercase, strip local-part,
drop blocked TLDs), and upserts one row per pair with a running count +
first_seen / last_seen.

Three repo methods shipped:
  * increment_smtp_target(attacker, domain) — upsert + bump
  * list_smtp_targets(attacker) — per-attacker view
  * smtp_target_seen(domain) — cross-attacker aggregate, shaped as the
    federation-gossip RPC that V2 will expose.

The gossip-query shape is load-bearing: each operator can answer
"have any of your attackers targeted corp1.com?" without leaking
which attackers or when — the aggregate returns a bool + total count
+ first/last seen, nothing else.
2026-04-22 22:23:27 -04:00
c50448995b feat(smtp): capture full messages + attachments to disk
SMTP template now writes each accepted DATA body as a .eml file into a
bind-mounted per-decky quarantine dir and emits a `message_stored` log
with sha256, size, decoded headers, and an attachment manifest
(filename + sha256 + size + content-type). Attachment hashing uses the
*decoded* payload so operators can match against VT / MalwareBazaar
directly. Body accumulator is capped at SMTP_MAX_BODY_BYTES (default
10 MB, matching the EHLO SIZE advert) so a streaming client can't OOM
the container.

The existing /api/v1/artifacts/{decky}/{stored_as} endpoint now takes
an optional ?service= query param (defaults to ssh for back-compat)
and can serve .eml files out of the smtp subdir. Forensic metadata
rides the normal log pipeline, same as SSH file_captured.
2026-04-22 22:17:50 -04:00
d47a84c90b refactor(models): split models.py into topical submodules
decnet/web/db/models.py was approaching 1000 lines across User/Log/
Attacker/Swarm/Topology/Workers/Updater/Health domains. Split into a
package with one module per domain; __init__.py re-exports every symbol
so all 52 call sites keep importing from decnet.web.db.models
unchanged.
2026-04-22 21:55:41 -04:00
119b4e8724 feat(db): add session_profile table for keystroke-dynamics fingerprints
New purpose-built table with schema_version column committed from day one
so V2 federation gossip can cluster sessions across operators without
retrofitting. Ships with the empty write path (upsert_session_profile);
ingestion of keystroke features (IKI moments, control-char rates, digraph
SimHash) is tracked as V2 work.

Closes gap #2 from SIGNAL_CAPTURE_AUDIT.md.
2026-04-22 21:39:17 -04:00
d3321324eb feat(sniffer): capture SSH client banner from TCP stream
Parse RFC 4253 §4.2 identification strings from the first attacker→decky
data segment on TCP/22; emit ssh_client_banner syslog events and bus
fan-out. Profiler's sniffer_rollup dedupes observed banners into a new
AttackerBehavior.ssh_client_banners JSON column.

Closes gap #3 from SIGNAL_CAPTURE_AUDIT.md.
2026-04-22 21:37:01 -04:00
8181f39ae2 feat(profiler): persist raw SSH KEX algorithm ordering
Prober already emits kex_algorithms in hassh_fingerprint syslog events, but
the raw ordered list was only queryable via the generic bounty store. Add a
dedicated AttackerBehavior.kex_order_raw column (TEXT, JSON list) so
post-v1 KEX-order fingerprinting has a typed, indexable home.

Pipeline:
  - sniffer_rollup() now consumes hassh_fingerprint events and collects
    distinct kex_algorithms strings across ports.
  - build_behavior_record() JSON-encodes the list (NULL when empty).
  - sqlmodel_repo._deserialize_behavior() parses it back into a list.

Closes pre-v1 gap #1 from SIGNAL_CAPTURE_AUDIT.md.
2026-04-22 21:29:46 -04:00
25838eb9f3 refactor(profiler): split behavioral.py into topical modules
Break the 603-line behavioral.py into timing/classify/tools/phases/fingerprint
sibling modules plus a slim orchestrator. Public API unchanged: behavioral.py
re-exports every previously-exposed symbol, so worker.py and existing tests
keep working with zero import changes.

No behavior change; all 64 profiler tests pass.
2026-04-22 21:10:19 -04:00
5704e8fcce fix(topology): delete topology_mutations in delete-cascade
delete_topology_cascade manually deletes status_events, edges, deckies
and lans but overlooked topology_mutations, so deleting any topology
that ever had a mutation enqueued (i.e. edits while active|degraded)
failed with an FK IntegrityError. Add the missing DELETE and extend
the cascade test to seed a mutation row.
2026-04-22 17:50:30 -04:00
3f460bab84 feat(web): show MazeNET decky running count + roll into dashboard
MazeNET header now reports '{running}/{total} DECKIES RUNNING' so
operators can see per-topology runtime status at a glance.

Dashboard ACTIVE DECKIES counters used to reflect only the fleet state
file; TopologyDecky rows (MazeNET deployments) are now added in —
deployed_deckies = fleet + all topology rows, active_deckies = fleet
(no runtime field) + topology rows whose state is 'running'.
2026-04-22 17:48:04 -04:00
6f537f52c2 fix(topology): remove DMZ gateway auto-attach on LAN create
POST /topologies/{id}/lans previously called _auto_attach_gateway()
whenever a non-DMZ LAN was created, which wired the DMZ gateway decky
to every new subnet. That's why a deployed gateway ended up with
eth0..ethN on every LAN regardless of what the user drew in MazeNET.

Drop the auto-attach helper entirely. The DMZ_ORPHAN deploy-time
validator (decnet/topology/validate.py:65-110) stays strict — users
must explicitly wire the gateway to each subnet they want bridged,
which is the whole point of having a topology editor.

useMazeApi.ts: drop stale auto-bridge reference from comment.
2026-04-22 17:14:09 -04:00
91111ea7ee feat(cli): add decnet init --deinit to undo a previous bootstrap
Reverse of init, step-by-step: systemctl disable --now decnet.target,
remove every decnet-*.service + decnet.target unit file, drop the
polkit rule, drop the tmpfiles.d entry, daemon-reload, remove
/etc/decnet + /etc/decnet/config.ini, /run/decnet, /opt/decnet, and
userdel/groupdel the decnet identity.

Preserves /var/lib/decnet and /var/log/decnet by default — those
hold operator data. Pass `--deinit --purge` to rm -rf them too.
Idempotent on a clean host (every step prints [SKIP]). Honours
--dry-run.

5 new tests cover the full-undo path, --purge, idempotent clean-host
deinit, dry-run side-effect-free behaviour, and the --purge without
--deinit guard.
2026-04-22 14:31:56 -04:00
3dae44c652 feat(cli): add decnet init one-shot master-host bootstrap
Creates the decnet system user/group, installs every unit file from
deploy/ into /etc/systemd/system, drops the polkit rule, seeds
/opt/decnet + /var/{lib,log}/decnet + /etc/decnet + /run/decnet,
writes a placeholder /etc/decnet/config.ini, applies the new
tmpfiles.d entry so /run/decnet survives reboots, daemon-reloads,
and `systemctl enable --now decnet.target`.

Idempotent (re-runs print [SKIP] on already-configured items),
--dry-run previews the plan without touching anything, --no-start
defers the target start, --force overwrites even matching unit
files. Master-only (added to MASTER_ONLY_COMMANDS).

9 orchestration tests cover the non-root gate, dry-run, useradd/
groupadd argv, SKIP on present user/group, unit-file idempotency,
--force overwrite, --no-start suppression, happy path, and the
"deploy/ not found" error message.
2026-04-22 14:28:11 -04:00
13ea916943 feat(workers): add start + start-all endpoints (systemd supervisor)
POST /api/v1/workers/{name}/start — 202 on acceptance, 404 unknown
worker, 503 if the unit file is not installed, 502 if systemctl
returns non-zero (stderr snippet in detail, full stack logged).
Admin only.

POST /api/v1/workers/start-all — best-effort: walks the worker list
in dependency order (bus → api → data-plane), skips already-active
and uninstalled units, aggregates outcomes into
{started, already_running, failed[]}. Returns 200 even on partial
failure; the caller reads the three lists.

Both endpoints delegate to the systemd_control helper, so the attack
surface for "what gets executed" is locked to `decnet-<validated-name>
.service` at two layers (router KNOWN_WORKERS + helper regex).
2026-04-22 14:12:29 -04:00
0fbb07c2ec feat(workers): bus-backed Workers panel (registry, control, installed flag)
Ships the backend half of Config → Workers:

* Worker registry aggregates `system.*.health` + `system.bus.health`
  heartbeats into a last-seen dict; OK / STALE / UNKNOWN tiers drop
  out of a 90s window (3× the 30s heartbeat interval).
* `GET /api/v1/workers` returns the snapshot plus `bus_connected`
  (so the UI can explain "all UNKNOWN" when the bus socket is down)
  and a per-row `installed` flag populated from
  `systemctl list-unit-files decnet-*.service` (cached 30s).
* `POST /api/v1/workers/{name}/stop` publishes a stop intent on
  `system.<name>.control`; workers listen via the shared control
  listener in `bus/publish.py`.
* Heartbeat + control listener wired into collector / profiler /
  sniffer / prober / mutator worker loops. API self-heartbeats too
  so the panel always has one ground-truth row.
* Topic helper `system_control(name)` + tests covering builder
  validation, control listener shutdown path, and the API surface
  (auth gating, bus-connected field, unknown-name 404).

Adds `StartFailure` / `StartAllResponse` models in anticipation of
the upcoming start endpoints (DEBT-034).
2026-04-22 14:10:39 -04:00
fcaac648a4 feat(web): add systemd_control helper for worker unit management
Thin async wrapper over `systemctl` — never shell=True, always
create_subprocess_exec. Unit names are built from
`decnet-<validated-name>.service`; the regex check is defence in depth
on top of the router-level KNOWN_WORKERS validation.

Exposes start / stop / is_active / list_installed; last is cached for
30s to keep the Workers panel cheap under REFRESH spam. On non-systemd
hosts list_installed returns an empty set, so the UI renders with
every row marked not-installed instead of 500-ing.
2026-04-22 14:08:35 -04:00
3fb84ac5d0 feat(templates): per-instance stealth via instance_seed in service servers
Every service template now pulls version strings, cluster/node UUIDs, auth
salts, greeting banners, and uptime from the seeded per-instance RNG instead
of hard-coded defaults. Scanners sweeping the fleet now see legitimately
diverging fingerprints per decky while each decky's own responses stay
internally consistent across restarts.

Covers elasticsearch, ftp, http, https, ldap, mongodb, mqtt, mssql, mysql,
postgres, redis, and smtp templates.
2026-04-22 09:24:16 -04:00
51e9e263ca feat(templates): add instance_seed stealth helper and wire into template builds
Each decky now gets a deterministic-per-instance seeded RNG derived from
NODE_NAME, so cluster UUIDs, version strings, uptime, and credentials diverge
across the fleet while staying stable within one container. The canonical
helper lives at decnet/templates/instance_seed.py; the deployer copies it into
every active template build context alongside syslog_bridge.py. Dockerfiles
COPY it to /opt/ so server.py can import it.

Connection-time jitter intentionally stays unseeded — two hits to the same
decky must not replay the same latency curve.
2026-04-22 09:24:04 -04:00
6bbb2376f7 refactor(services): make artifact root configurable via DECNET_ARTIFACTS_ROOT
The ssh and telnet services hard-coded /var/lib/decnet/artifacts as the host
quarantine mount. Read it from DECNET_ARTIFACTS_ROOT with the same default so
dev/rootless deploys can point it elsewhere.
2026-04-22 09:23:36 -04:00
6725197d58 test(web): transcripts API + attacker-transcripts router coverage
Paging, truncation surfacing, admin gate, path traversal, sid-regex and
decky-mismatch rejection for /transcripts; mirror coverage for
/attackers/{uuid}/transcripts. Flips the Session Recording box in the
roadmap (sessrec pty relay now shipping end-to-end).
2026-04-21 23:11:40 -04:00
6e522c5a55 feat(web): transcripts API + repository lookups
Adds get_attacker_transcripts (mirror of artifacts for session_recorded
logs) and get_session_log for sid→shard resolution. New
/api/v1/transcripts/{decky}/{sid}?offset=&limit= pages asciinema events
out of the shared JSONL day-shard via an mtime-keyed byte-offset index
— never scans the whole shard per request. New
/api/v1/attackers/{uuid}/transcripts lists sessions for drilldown. Both
endpoints admin-gated.
2026-04-21 23:06:39 -04:00
a58d42e492 feat(templates): wire SSH+Telnet to sessrec transcript recorder
Build login-session into both images as the swapped root shell, add a
quarantine bind mount for telnet (symmetric to SSH), seed transcripts/
dir and service discriminant at entrypoint. Deployer syncs sessrec.c +
Makefile into each build context alongside the existing syslog_bridge
helper. sessrec falls back to /etc/sessrec.service when env is stripped
(busybox /bin/login).
2026-04-21 23:03:42 -04:00
4596c1d69a feat(templates): add sessrec pty transcript recorder
New decnet/templates/_shared/sessrec/ — a small C program installed as the
login shell in SSH / Telnet deckies. Forkpty-relays /bin/bash, records each
chunk as an asciinema v2 event into a shared JSONL day-shard keyed by sid,
and emits one RFC 5424 session_recorded line on exit (direct to PID 1's
stdout, same pattern syslog_bridge.py uses).

Storage: one shard per (decky, UTC day) at
/var/lib/systemd/coredump/transcripts/sessions-YYYY-MM-DD.jsonl. Concurrent
appends are lock-free: each write is chunked below PIPE_BUF so O_APPEND
interleaves atomically. Per-session cap 10 MB with a trunc sentinel; disk-
free precheck (<200 MB) falls through to plain bash with a session_skipped
log event. Attacker src_ip resolves from \$SSH_CONNECTION, getpeername(0),
or utmp in that order. SIGWINCH appends a 'r' resize event so ncurses
replays stay aligned.

Stealth for v1: /etc/passwd shell-swap to /usr/libexec/login-session
(plausible login-machinery path) + prctl comm disguise. Full LD_PRELOAD
argv-zap is deferred — sshd strips LD_PRELOAD from the session env, so
wiring the existing argv_zap.so into this path needs a separate wrapper.

DEBT-033 opened for size-based day-shard rotation; v1's disk-free precheck
covers the worst case but can be blinded by a one-shot disk fill.
2026-04-21 22:56:42 -04:00
8f25ff677f feat(engine,api): add orphan topology resource reaper
Topology rows deleted without a proper teardown leave Docker containers
and bridge networks behind, holding IPAM pools that cause 403 "Pool
overlaps" on the next deploy at the same subnet.

- engine/reaper.py walks the local Docker daemon, extracts the 8-char
  topology prefix from every decnet_t_* resource, and force-removes
  containers + networks whose prefix is not in the repo.
- POST /api/v1/topologies/reap-orphans (admin-only) returns a report
  of live/orphan prefixes and what was removed.
- Resources belonging to live topologies are never touched; per-resource
  errors are captured without aborting the sweep.
2026-04-21 22:13:44 -04:00
85bb0e2f65 fix(engine): roll back partial Docker state on deploy failure
When create_bridge_network or compose-up raised mid-deploy, the
deployer marked the topology FAILED and re-raised — but left every
network it had already created alive. The next deploy attempt tripped
over the orphans with 'Pool overlaps with other one on this address
space' (IPAM conflict).

Track networks created in the current attempt; on exception, tear down
the started compose stack (if any), remove the networks in reverse
order, and delete the compose file before marking FAILED. Rollback
errors are logged but never mask the original failure.

Covered by a new regression test that drives a docker client which
succeeds once then raises, and asserts every created network is also
removed.
2026-04-21 20:23:03 -04:00
c266d1b6e3 feat(mutator,web): add_decky op — create-and-attach in one mutation
apply_attach_decky requires an existing decky, so the MazeNET editor
had no way to grow a live topology: creating a new decky on active
topologies 409'd on the direct-CRUD createDecky call.

- Backend: new apply_add_decky that creates the decky row + its
  home-LAN edge atomically, auto-allocating an IP if none pinned.
  Post-apply validation still runs. Added to DISPATCH + _MUTATION_OPS
  Literal + CLI help text.
- Tests: 3 new ops tests (happy path, duplicate-name rejection,
  missing-LAN rejection) plus dispatch coverage update.
- Frontend: useTopologyEditor gains addDeckyToLan() composite. Pending
  routes through createDecky + attachEdge as before; active routes
  through a single add_decky enqueue. MazeNET.tsx drag-archetype,
  duplicate, DMZ-gateway, and ctx-menu add-decky paths all use the
  composite so active topologies stop 409'ing on new-decky drops.
2026-04-21 20:13:39 -04:00
a93cbe76f9 feat(mutator): update_decky payload accepts top-level services list
apply_update_decky only merged payload.patch into decky_config. Since
services is a separate DB column, there was no way to replace a decky's
services list via a mutation. Add a top-level services key to the op
payload that maps straight onto the services column.

Unblocks the MazeNET editor routing service-add/service-drop actions
through the mutation queue on active topologies.
2026-04-21 19:56:58 -04:00
d4d8a2ad0d feat(correlation): interleave mutation markers into attacker traversals
Parser now tags ``mutator`` / ``decky_mutated`` lines with ``kind="mutation"``
so the engine can route them into a sibling ``_mutations`` index keyed
by decky name instead of the per-IP attacker index.  ``traversals()``
joins the two streams: every attacker gets a ``mutations_during`` list
of markers from touched deckies bounded by their first/last-seen
window.  ``AttackerTraversal.to_dict()`` grows a ``mutations_during``
field and a ``timeline`` that chronologically interleaves hops and
markers, so an ``SSH at T5 → mutation at T6 → HTTP at T7`` substrate
transition is visible to UI consumers instead of reading as a silent
discontinuity.

The existing hops-only JSON shape is preserved; old clients that
ignore unknown keys keep working.
2026-04-21 19:37:35 -04:00
bf5ed7abbb feat(engine): emit creation/retirement mutation events on deploy/teardown
Close the lifecycle loop for the correlation graph: every decky now
enters the substrate with an explicit `trigger=creation` event
(old_services=[] ⇒ new_services=<initial>) and leaves it with
`trigger=retirement` (old=<current> ⇒ new=[]).  With scheduled/operator
mutations already flowing through emit_decky_mutated, the entire decky
lifecycle is now a well-formed sequence of mutation events — the
correlator can fold substrate_state(t) at any T by replaying them.

Lazy-imports mutator.events to dodge the engine↔mutator circular
dependency.  Bus is None at CLI sites; the syslog write is what the
correlator consumes.  Emission is soft-failing so a broken log path
never aborts a deploy.
2026-04-21 19:35:05 -04:00
fa0cdb3ab5 feat(mutator): route mutate_decky through emit_decky_mutated with trigger
Mutator now emits one decky_mutated event (RFC 5424 + bus) per
successful mutation instead of the inline decky.<id>.state bus
publish.  The previous state topic published new_services only;
mutation events carry old/new/trigger, which is what the correlation
engine needs to interleave substrate-change markers into attacker
traversals.

- mutate_decky gains trigger: MutationTrigger = "operator" and
  captures old_services before the shuffle; replaces the inline
  _publish_safely(decky.<id>.state) with emit_decky_mutated(...).
- mutate_all derives trigger internally: operator when force or
  only-filter is set (CLI --all, API mutate-now, UI bus request);
  scheduled on interval ticks.  Passed through to each mutate_decky
  call.
- Tests updated: the old decky.<id>.state assertion is replaced
  with decky.<id>.mutation topic + mutation payload shape; 3 new
  tests cover trigger derivation for scheduled / force / only paths.
  26 tests in test_mutator.py green; 116 across mutator + topology
  + bus.
2026-04-21 19:31:31 -04:00
f875350d75 feat(mutator): emit_decky_mutated helper — RFC 5424 + bus in one call
First step toward making mutation events first-class nodes in the
correlation graph. Today the graph silently reflects post-mutation
state with no marker of the transition; this helper lands the
emitter the mutator and deploy paths will call.

- decnet/mutator/events.py: emit_decky_mutated(bus, *, decky,
  old_services, new_services, trigger, actor=None, log_path=None)
  writes an RFC 5424 line (service=mutator, hostname=<decky>,
  MSGID=decky_mutated, SD params for old/new services + trigger +
  optional actor) to DECNET_INGEST_LOG_FILE, then fire-and-forget
  publishes on decky.<id>.mutation. Either side failing is soft —
  the other path still completes.
- MutationTrigger Literal covers creation, retirement, scheduled,
  operator, behavioral, healer, federation. Reserved values for v2/v3
  (behavioral + federation) stay nullable so the schema is stable.
- decnet/bus/topics.py: DECKY_MUTATION constant + decky_mutation(id)
  builder. Distinct from DECKY_STATE ("current shape") because a
  mutation is a transition event, not a steady-state snapshot.
- Empty-set symmetry: creation emits old_services=[], retirement
  emits new_services=[]. Every decky lifecycle becomes a well-formed
  fold sequence on the correlator side.
- 4 new tests: FakeBus + correlator parser round-trip; creation and
  retirement empty-set cases; bus=None still writes syslog;
  unwritable log path doesn't block bus publish. 95 tests green
  across test_mutator + tests/bus.
2026-04-21 19:29:21 -04:00