DECNET

Author	SHA1	Message	Date
anti	d61e143b71	fix(stress): unblock Locust runs from login rate-limit self-DoS Locust spawns N virtual users (default 1000), all from 127.0.0.1 as admin. /auth/login is rate-limited 10/5min per-IP AND per-username, so the 11th on_start() got 429 and a RuntimeError. A @task(2) login in the task weights turned the whole run into a 429 factory even after ramp-up. And _login_with_retry treated 429 as non-retryable, so there was no graceful degradation path. Three changes, one root cause: - decnet/web/limiter.py: read DECNET_LIMITER_ENABLED (default true). When false, slowapi's Limiter(enabled=False) makes @limiter.limit a no-op. Default ships unchanged; nobody should ever release with this off. - tests/stress/conftest.py: set DECNET_LIMITER_ENABLED=false in the uvicorn subprocess env. Stress tests measure throughput, not rate limiting. - tests/stress/locustfile.py: drop the @task(2) login — it added zero coverage (every user already logs in at on_start) and only generated contention. Teach _login_with_retry to honour 429 + Retry-After so a Locust pointed at a limiter-enabled server degrades gracefully instead of crashing on_start.	2026-04-24 00:13:15 -04:00
anti	ae92948e22	test(live): align mqtt/postgres/mysql live tests with honeypot + loop realities Three unrelated test-correctness fixes exposed by running tests/live: - test_mqtt_live: honeypot defaults to auth-required (post-2018 realistic broker). Anonymous CONNECT is rejected with CONNACK rc=5, which the "accept" / "subscribe" tests misread as a failure. Pass MQTT_ACCEPT_ALL=1 via a new env= override on the live_service factory so only those two tests opt into accept-all. - test_postgres_live::test_auth_hash_logged: connected with dbname='prod', which isn't in the honeypot's per-instance DB list, so Postgres (correctly) rejected at startup before asking for a password — blowing past the auth event the test asserts on. Target 'postgres' (always in _BASE_DBS) to reach the auth stage. - test_mysql_backend_live: the module-scoped mysql_test_db_url fixture is bound to the module loop, but function-scoped tests default to their own per-function loops. Any reuse of the asyncmy pool then tripped "Future attached to a different loop". Pin the whole module with pytest.mark.asyncio(loop_scope='module').	2026-04-23 22:06:55 -04:00
anti	ea95a009df	refactor(tests): move flat tests/.py into per-subsystem subfolders Groups every flat test_.py under the module it exercises, matching the existing tests/{profiler,sniffer,prober,collector,correlation,cli,web, topology,swarm,bus,updater,api,docker,geoip,...} layout. New folders: services/, fleet/, config/, logging/, db/ (+ db/mysql/), telemetry/, mutator/, core/. Path-dependent __file__ references bumped an extra .parent in three files that moved one level deeper: - tests/sniffer/test_sniffer_ja3.py (template path) - tests/services/test_ssh_capture_emit.py (template path) - tests/cli/test_mode_gating.py (REPO root) - tests/web/test_env_lazy_jwt.py (repo var) Also drops two SQLite runtime artifacts (test_decnet.db-{shm,wal}) that were leaking into the repo from a previous test run. Fixes two test_service_isolation cases that patched asyncio.sleep (no longer on the profiler main-loop hot path — same pre-existing bug I fixed earlier in test_attacker_worker.py) by patching asyncio.wait_for and passing interval=0.	2026-04-23 21:34:25 -04:00
anti	1854f9de28	fix(tests): profiler worker tests patched asyncio.sleep, but main loop uses wait_for Since the event-driven shutdown refactor (`0fbb07c`), the profiler main loop is asyncio.wait_for(shutdown.wait(), timeout=interval) — no sleep on the hot path. The four worker tests that patched asyncio.sleep to raise CancelledError on the Nth call were silently no-op'ing and hanging on the real 30 s wait_for timeout. Replace the sleep patches with a shared _cancel_after helper that patches wait_for itself. Pass interval=0 so the loop ticks without delay between iterations.	2026-04-23 21:14:45 -04:00
anti	ffc275f051	feat(geoip): country-code enrichment via RIR delegated-stats Populates Attacker.country_code + country_source (MVP) using the five RIR delegated-stats files (ARIN/RIPE/APNIC/LACNIC/AFRINIC). Offline, license-free, no outbound traffic that could burn honeypot stealth. - decnet.geoip package with factory/base/lookup + rir/ subpackage (fetch/parse/provider) mirroring the db + bus factory convention - Profiler._build_record calls enrich_ip on every upsert - Idempotent ALTER TABLE migrations for both SQLite and MySQL - decnet geoip refresh/lookup CLI (master-only) - /var/lib/decnet/geoip seeded by decnet init - DECNET_GEOIP_ENABLED=false kill-switch; set in tests/conftest.py so unit tests never trigger the first-access fetch	2026-04-23 21:12:38 -04:00
anti	07bf3dc8cb	feat(config): promote /etc/decnet/decnet.ini to real config with domain sections The config file `decnet init` dropped at /etc/decnet/config.ini was a stub with a single [decnet] header saying 'reserved for future structured settings.' Admins who wanted to tune DECNET_API_HOST, DECNET_DB_URL, DECNET_BATCH_SIZE, etc. had to hunt env.py for the exact variable name and drop it in .env.local. Changes: - decnet/config_ini.py — adds a _DOMAIN_MAP translation table covering [api], [web], [database], [bus], [swarm], [logging], [ingester], [tracing]. Loads regardless of mode; unknown keys inside a known section log a WARNING (operator typos shouldn't be silent). Explicit key map (not auto kebab-to-snake) so [web] admin-user lands in DECNET_ADMIN_USER without silently renaming the env-var contract consumers import from decnet.env. - decnet/cli/init.py — renames the placeholder target config.ini → decnet.ini (unifies with the name already used by load_ini_config and the enroll bundle's _render_decnet_ini). Placeholder body now shows every domain section as a commented example so admins learn the shape by reading. Deinit removes both decnet.ini and the legacy config.ini so upgrading hosts leave no orphan file. Precedence is unchanged: real env > INI > built-in default in env.py. os.environ.setdefault means systemd EnvironmentFile= and one-off DECNET_FOO=bar decnet ... invocations always win. Secrets explicitly NOT moved to the INI: - DECNET_JWT_SECRET - DECNET_ADMIN_PASSWORD - DECNET_DB_PASSWORD They stay in .env.local / EnvironmentFile= — never in a group-readable INI, never in a diff, never on the dashboard. Dev/profiling flags (DECNET_DEVELOPER, DECNET_EMBED_, DECNET_PROFILE_) also stay env-only per maintainer direction — dev knobs shouldn't be one 'I'll flip this for tonight' away. Tests: +5 in test_config_ini.py (domain sections load regardless of mode, env beats INI for domain keys, unknown key warns, absent section is no-op, role section beats domain section via setdefault precedence). +1 in test_init.py (placeholder writes decnet.ini with every section header present as commented guidance). 31 tests pass across the two files (was 26).	2026-04-23 18:21:00 -04:00
anti	1753eca198	feat(deploy): templatize systemd services on install_dir via Jinja2 Distros reserve /opt for different things (some package managers own it outright), and a DECNET install that wants to live at /srv/decnet or /usr/local/decnet had to hand-edit 13 service files post-install. Converts every deploy/decnet-.service to a .j2 template keyed on {{ install_dir }}, rendered by `decnet init` at install time. All other paths (log_dir, state_dir, runtime_dir, user, group) stay standard — only install_dir varies. Changes: - deploy/decnet-.service → deploy/decnet-*.service.j2 (13 files). - decnet init gains --install-dir (default /opt/decnet, preserves existing behaviour byte-for-byte). Validates absolute-path at the CLI boundary. Threads through useradd --home-dir and the dir-creation list so the filesystem layout matches the rendered templates. - _install_units renders via Jinja2 with StrictUndefined (typo → loud error, not a silent broken unit). SHA over rendered output so operators with a custom install_dir get idempotent re-runs. - decnet.target, tmpfiles.d, polkit rule stay static — they don't reference install paths. - 4 new tests: custom install_dir renders into units, default remains /opt/decnet, relative paths rejected, second run with same custom dir is idempotent.	2026-04-23 18:08:26 -04:00
anti	4418608a54	fix(bus): silently drop publishes on closed bus instead of raising Worker bus instances (collector, ingester) close their private buses in finally blocks on shutdown, but stream threads holding closure references kept calling publish after close — one `RuntimeError: publish on closed bus` per stream line, caught by publish_safely and logged per call, flooding server logs. Changes: - `UnixSocketBus.publish()` now drops post-close calls. First drop WARNs loudly (bus is critical infra — silent drops would hide real problems); subsequent drops on the same instance log at DEBUG to prevent the flood. Sticky `_closed_publish_warned` flag, reset naturally per new bus instance. - `make_thread_safe_publisher` short-circuits on a closed bus before marshalling a coroutine onto the loop. Avoids the wasted scheduling work in the hot shutdown path. Degradation is safe: callers go through `publish_safely`, which already treats exceptions as 'dropped notification, DB is source of truth.' We just stop manufacturing the exception in the first place for a known-benign condition.	2026-04-23 18:00:47 -04:00
anti	eb2308d9e1	fix(bus): retry app-bus connect with backoff instead of one-shot veto A startup race between `decnet bus` being ready and the API's lifespan hitting `get_app_bus()` at api.py:135 would set `_tried = True` permanently, poisoning the singleton for the rest of the process: the dashboard shows BUS OFFLINE, topology SSE falls into the bus-is-None snapshot-only branch, mutator publish calls no-op. Only an API restart recovered. Replaces the one-shot veto with a time-gated retry keyed on a `_last_failure_ts` monotonic timestamp plus a 2 s backoff. Publishers on the hot path still pay at most one connect attempt every 2 s when the bus is down, but the singleton auto-recovers within 5 s (one dashboard poll) once the bus comes up. The asyncio lock still serialises concurrent callers so the bus server doesn't get stampeded with parallel connect attempts on startup.	2026-04-23 17:59:17 -04:00
anti	ef4179ea1f	feat(api): opaque 500 handler + error_id correlation for unhandled exceptions Registers a generic @app.exception_handler(Exception) that catches anything uncaught in route handlers / dependencies. Prod response is opaque: {detail: 'Internal Server Error', error_id: <uuid4 hex>}. Dev mode (DECNET_DEVELOPER=True) adds exception_type and traceback fields so failures are debuggable without tailing server logs. The error_id is logged alongside the full traceback server-side, letting operators correlate a user's 500 report with the exact exception via `grep <error_id> /var/log/decnet.log`. FastAPI's own HTTPException routing and the existing RequestValidationError / ValidationError / RateLimitExceeded handlers still take precedence — this handler only fires on genuinely-uncaught exceptions. Flips threat model F1/I 'traceback / stack trace leakage' from ? to M and logs a follow-up checklist entry for 4 detail=str(e) sites in the fleet deploy router (admin-gated, different threat class, separate audit).	2026-04-23 14:07:32 -04:00
anti	2f4f81e5de	feat(api): rate-limit /auth/login + scaffold threat model Adds slowapi two-bucket rate limit on /auth/login — 10 attempts per 5 minutes per-IP AND per-username, tripping either → 429. Per-IP catches botnets hitting one account; per-username catches distributed credential stuffing against one account. In-memory storage: dashboard API is single-process, Redis is disproportionate for v1. X-Forwarded-For is deliberately NOT trusted (spoofable); reverse-proxy deployments get one shared bucket per proxy IP. Logged in the threat model as accepted risk DA-08, to be revisited when a verified-proxy config lands. Also scaffolds development/THREAT_MODEL.md with STRIDE-per-element methodology, system-context DFD, and Dashboard↔API as the first fully worked component (7 sub-flows, ~50 threat entries). F1 Authn ships with 3 threats mitigated: rate limit (new), uniform 401 (verified already in place), bcrypt length clamp (verified already in place via Pydantic max_length=72).	2026-04-23 13:25:28 -04:00
anti	8cbb7834ef	feat(web): SMTP victim-domain + stored-mail panels on attacker detail Adds GET /attackers/{uuid}/smtp-targets (viewer) and GET /attackers/{uuid}/mail (admin) endpoints, plus two new sections on the attacker detail page: VICTIM DOMAINS rollup (aggregate-only, federation-gossip-safe) and STORED MAIL with a drawer that decodes headers, lists attachments, and downloads the raw .eml via the existing artifact endpoint (?service=smtp).	2026-04-22 22:33:53 -04:00
anti	d43303251d	feat(profiler): track SMTP victim domains per attacker New SmtpTarget table records each (attacker, domain) pair observed via the SMTP honeypots. Only the domain is stored — local-parts are dropped at ingestion, so this table holds no user-identifying data beyond the target organisation's identity. The profiler worker extracts domains from rcpt_to / rcpt_denied / message_accepted events, normalizes them (lowercase, strip local-part, drop blocked TLDs), and upserts one row per pair with a running count + first_seen / last_seen. Three repo methods shipped: * increment_smtp_target(attacker, domain) — upsert + bump * list_smtp_targets(attacker) — per-attacker view * smtp_target_seen(domain) — cross-attacker aggregate, shaped as the federation-gossip RPC that V2 will expose. The gossip-query shape is load-bearing: each operator can answer "have any of your attackers targeted corp1.com?" without leaking which attackers or when — the aggregate returns a bool + total count + first/last seen, nothing else.	2026-04-22 22:23:27 -04:00
anti	c50448995b	feat(smtp): capture full messages + attachments to disk SMTP template now writes each accepted DATA body as a .eml file into a bind-mounted per-decky quarantine dir and emits a `message_stored` log with sha256, size, decoded headers, and an attachment manifest (filename + sha256 + size + content-type). Attachment hashing uses the decoded payload so operators can match against VT / MalwareBazaar directly. Body accumulator is capped at SMTP_MAX_BODY_BYTES (default 10 MB, matching the EHLO SIZE advert) so a streaming client can't OOM the container. The existing /api/v1/artifacts/{decky}/{stored_as} endpoint now takes an optional ?service= query param (defaults to ssh for back-compat) and can serve .eml files out of the smtp subdir. Forensic metadata rides the normal log pipeline, same as SSH file_captured.	2026-04-22 22:17:50 -04:00
anti	119b4e8724	feat(db): add session_profile table for keystroke-dynamics fingerprints New purpose-built table with schema_version column committed from day one so V2 federation gossip can cluster sessions across operators without retrofitting. Ships with the empty write path (upsert_session_profile); ingestion of keystroke features (IKI moments, control-char rates, digraph SimHash) is tracked as V2 work. Closes gap #2 from SIGNAL_CAPTURE_AUDIT.md.	2026-04-22 21:39:17 -04:00
anti	d3321324eb	feat(sniffer): capture SSH client banner from TCP stream Parse RFC 4253 §4.2 identification strings from the first attacker→decky data segment on TCP/22; emit ssh_client_banner syslog events and bus fan-out. Profiler's sniffer_rollup dedupes observed banners into a new AttackerBehavior.ssh_client_banners JSON column. Closes gap #3 from SIGNAL_CAPTURE_AUDIT.md.	2026-04-22 21:37:01 -04:00
anti	8181f39ae2	feat(profiler): persist raw SSH KEX algorithm ordering Prober already emits kex_algorithms in hassh_fingerprint syslog events, but the raw ordered list was only queryable via the generic bounty store. Add a dedicated AttackerBehavior.kex_order_raw column (TEXT, JSON list) so post-v1 KEX-order fingerprinting has a typed, indexable home. Pipeline: - sniffer_rollup() now consumes hassh_fingerprint events and collects distinct kex_algorithms strings across ports. - build_behavior_record() JSON-encodes the list (NULL when empty). - sqlmodel_repo._deserialize_behavior() parses it back into a list. Closes pre-v1 gap #1 from SIGNAL_CAPTURE_AUDIT.md.	2026-04-22 21:29:46 -04:00
anti	5704e8fcce	fix(topology): delete topology_mutations in delete-cascade delete_topology_cascade manually deletes status_events, edges, deckies and lans but overlooked topology_mutations, so deleting any topology that ever had a mutation enqueued (i.e. edits while active\|degraded) failed with an FK IntegrityError. Add the missing DELETE and extend the cascade test to seed a mutation row.	2026-04-22 17:50:30 -04:00
anti	91111ea7ee	feat(cli): add `decnet init --deinit` to undo a previous bootstrap Reverse of init, step-by-step: systemctl disable --now decnet.target, remove every decnet-*.service + decnet.target unit file, drop the polkit rule, drop the tmpfiles.d entry, daemon-reload, remove /etc/decnet + /etc/decnet/config.ini, /run/decnet, /opt/decnet, and userdel/groupdel the decnet identity. Preserves /var/lib/decnet and /var/log/decnet by default — those hold operator data. Pass `--deinit --purge` to rm -rf them too. Idempotent on a clean host (every step prints [SKIP]). Honours --dry-run. 5 new tests cover the full-undo path, --purge, idempotent clean-host deinit, dry-run side-effect-free behaviour, and the --purge without --deinit guard.	2026-04-22 14:31:56 -04:00
anti	3dae44c652	feat(cli): add `decnet init` one-shot master-host bootstrap Creates the decnet system user/group, installs every unit file from deploy/ into /etc/systemd/system, drops the polkit rule, seeds /opt/decnet + /var/{lib,log}/decnet + /etc/decnet + /run/decnet, writes a placeholder /etc/decnet/config.ini, applies the new tmpfiles.d entry so /run/decnet survives reboots, daemon-reloads, and `systemctl enable --now decnet.target`. Idempotent (re-runs print [SKIP] on already-configured items), --dry-run previews the plan without touching anything, --no-start defers the target start, --force overwrites even matching unit files. Master-only (added to MASTER_ONLY_COMMANDS). 9 orchestration tests cover the non-root gate, dry-run, useradd/ groupadd argv, SKIP on present user/group, unit-file idempotency, --force overwrite, --no-start suppression, happy path, and the "deploy/ not found" error message.	2026-04-22 14:28:11 -04:00
anti	13ea916943	feat(workers): add start + start-all endpoints (systemd supervisor) POST /api/v1/workers/{name}/start — 202 on acceptance, 404 unknown worker, 503 if the unit file is not installed, 502 if systemctl returns non-zero (stderr snippet in detail, full stack logged). Admin only. POST /api/v1/workers/start-all — best-effort: walks the worker list in dependency order (bus → api → data-plane), skips already-active and uninstalled units, aggregates outcomes into {started, already_running, failed[]}. Returns 200 even on partial failure; the caller reads the three lists. Both endpoints delegate to the systemd_control helper, so the attack surface for "what gets executed" is locked to `decnet-<validated-name> .service` at two layers (router KNOWN_WORKERS + helper regex).	2026-04-22 14:12:29 -04:00
anti	0fbb07c2ec	feat(workers): bus-backed Workers panel (registry, control, installed flag) Ships the backend half of Config → Workers: * Worker registry aggregates `system..health` + `system.bus.health` heartbeats into a last-seen dict; OK / STALE / UNKNOWN tiers drop out of a 90s window (3× the 30s heartbeat interval). `GET /api/v1/workers` returns the snapshot plus `bus_connected` (so the UI can explain "all UNKNOWN" when the bus socket is down) and a per-row `installed` flag populated from `systemctl list-unit-files decnet-.service` (cached 30s). `POST /api/v1/workers/{name}/stop` publishes a stop intent on `system.<name>.control`; workers listen via the shared control listener in `bus/publish.py`. * Heartbeat + control listener wired into collector / profiler / sniffer / prober / mutator worker loops. API self-heartbeats too so the panel always has one ground-truth row. * Topic helper `system_control(name)` + tests covering builder validation, control listener shutdown path, and the API surface (auth gating, bus-connected field, unknown-name 404). Adds `StartFailure` / `StartAllResponse` models in anticipation of the upcoming start endpoints (DEBT-034).	2026-04-22 14:10:39 -04:00
anti	fcaac648a4	feat(web): add systemd_control helper for worker unit management Thin async wrapper over `systemctl` — never shell=True, always create_subprocess_exec. Unit names are built from `decnet-<validated-name>.service`; the regex check is defence in depth on top of the router-level KNOWN_WORKERS validation. Exposes start / stop / is_active / list_installed; last is cached for 30s to keep the Workers panel cheap under REFRESH spam. On non-systemd hosts list_installed returns an empty set, so the UI renders with every row marked not-installed instead of 500-ing.	2026-04-22 14:08:35 -04:00
anti	a63708a3d1	test(templates): cover instance_seed helper and update service tests Add tests/service_testing/test_instance_seed.py — pins NODE_NAME to assert determinism of seeded functions and sweeps NODE_NAMEs to assert cross-fleet divergence. Conftest gains load_real_instance_seed() so template tests see the real seeding behavior instead of a stub. Existing template tests updated to pin NODE_NAME and match seeded outputs.	2026-04-22 09:24:28 -04:00
anti	6725197d58	test(web): transcripts API + attacker-transcripts router coverage Paging, truncation surfacing, admin gate, path traversal, sid-regex and decky-mismatch rejection for /transcripts; mirror coverage for /attackers/{uuid}/transcripts. Flips the Session Recording box in the roadmap (sessrec pty relay now shipping end-to-end).	2026-04-21 23:11:40 -04:00
anti	6e522c5a55	feat(web): transcripts API + repository lookups Adds get_attacker_transcripts (mirror of artifacts for session_recorded logs) and get_session_log for sid→shard resolution. New /api/v1/transcripts/{decky}/{sid}?offset=&limit= pages asciinema events out of the shared JSONL day-shard via an mtime-keyed byte-offset index — never scans the whole shard per request. New /api/v1/attackers/{uuid}/transcripts lists sessions for drilldown. Both endpoints admin-gated.	2026-04-21 23:06:39 -04:00
anti	8f25ff677f	feat(engine,api): add orphan topology resource reaper Topology rows deleted without a proper teardown leave Docker containers and bridge networks behind, holding IPAM pools that cause 403 "Pool overlaps" on the next deploy at the same subnet. - engine/reaper.py walks the local Docker daemon, extracts the 8-char topology prefix from every decnet_t_* resource, and force-removes containers + networks whose prefix is not in the repo. - POST /api/v1/topologies/reap-orphans (admin-only) returns a report of live/orphan prefixes and what was removed. - Resources belonging to live topologies are never touched; per-resource errors are captured without aborting the sweep.	2026-04-21 22:13:44 -04:00
anti	85bb0e2f65	fix(engine): roll back partial Docker state on deploy failure When create_bridge_network or compose-up raised mid-deploy, the deployer marked the topology FAILED and re-raised — but left every network it had already created alive. The next deploy attempt tripped over the orphans with 'Pool overlaps with other one on this address space' (IPAM conflict). Track networks created in the current attempt; on exception, tear down the started compose stack (if any), remove the networks in reverse order, and delete the compose file before marking FAILED. Rollback errors are logged but never mask the original failure. Covered by a new regression test that drives a docker client which succeeds once then raises, and asserts every created network is also removed.	2026-04-21 20:23:03 -04:00
anti	c266d1b6e3	feat(mutator,web): add_decky op — create-and-attach in one mutation apply_attach_decky requires an existing decky, so the MazeNET editor had no way to grow a live topology: creating a new decky on active topologies 409'd on the direct-CRUD createDecky call. - Backend: new apply_add_decky that creates the decky row + its home-LAN edge atomically, auto-allocating an IP if none pinned. Post-apply validation still runs. Added to DISPATCH + _MUTATION_OPS Literal + CLI help text. - Tests: 3 new ops tests (happy path, duplicate-name rejection, missing-LAN rejection) plus dispatch coverage update. - Frontend: useTopologyEditor gains addDeckyToLan() composite. Pending routes through createDecky + attachEdge as before; active routes through a single add_decky enqueue. MazeNET.tsx drag-archetype, duplicate, DMZ-gateway, and ctx-menu add-decky paths all use the composite so active topologies stop 409'ing on new-decky drops.	2026-04-21 20:13:39 -04:00
anti	a93cbe76f9	feat(mutator): update_decky payload accepts top-level services list apply_update_decky only merged payload.patch into decky_config. Since services is a separate DB column, there was no way to replace a decky's services list via a mutation. Add a top-level services key to the op payload that maps straight onto the services column. Unblocks the MazeNET editor routing service-add/service-drop actions through the mutation queue on active topologies.	2026-04-21 19:56:58 -04:00
anti	d4d8a2ad0d	feat(correlation): interleave mutation markers into attacker traversals Parser now tags ``mutator`` / ``decky_mutated`` lines with ``kind="mutation"`` so the engine can route them into a sibling ``_mutations`` index keyed by decky name instead of the per-IP attacker index. ``traversals()`` joins the two streams: every attacker gets a ``mutations_during`` list of markers from touched deckies bounded by their first/last-seen window. ``AttackerTraversal.to_dict()`` grows a ``mutations_during`` field and a ``timeline`` that chronologically interleaves hops and markers, so an ``SSH at T5 → mutation at T6 → HTTP at T7`` substrate transition is visible to UI consumers instead of reading as a silent discontinuity. The existing hops-only JSON shape is preserved; old clients that ignore unknown keys keep working.	2026-04-21 19:37:35 -04:00
anti	bf5ed7abbb	feat(engine): emit creation/retirement mutation events on deploy/teardown Close the lifecycle loop for the correlation graph: every decky now enters the substrate with an explicit `trigger=creation` event (old_services=[] ⇒ new_services=<initial>) and leaves it with `trigger=retirement` (old=<current> ⇒ new=[]). With scheduled/operator mutations already flowing through emit_decky_mutated, the entire decky lifecycle is now a well-formed sequence of mutation events — the correlator can fold substrate_state(t) at any T by replaying them. Lazy-imports mutator.events to dodge the engine↔mutator circular dependency. Bus is None at CLI sites; the syslog write is what the correlator consumes. Emission is soft-failing so a broken log path never aborts a deploy.	2026-04-21 19:35:05 -04:00
anti	fa0cdb3ab5	feat(mutator): route mutate_decky through emit_decky_mutated with trigger Mutator now emits one decky_mutated event (RFC 5424 + bus) per successful mutation instead of the inline decky.<id>.state bus publish. The previous state topic published new_services only; mutation events carry old/new/trigger, which is what the correlation engine needs to interleave substrate-change markers into attacker traversals. - mutate_decky gains trigger: MutationTrigger = "operator" and captures old_services before the shuffle; replaces the inline _publish_safely(decky.<id>.state) with emit_decky_mutated(...). - mutate_all derives trigger internally: operator when force or only-filter is set (CLI --all, API mutate-now, UI bus request); scheduled on interval ticks. Passed through to each mutate_decky call. - Tests updated: the old decky.<id>.state assertion is replaced with decky.<id>.mutation topic + mutation payload shape; 3 new tests cover trigger derivation for scheduled / force / only paths. 26 tests in test_mutator.py green; 116 across mutator + topology + bus.	2026-04-21 19:31:31 -04:00
anti	f875350d75	feat(mutator): emit_decky_mutated helper — RFC 5424 + bus in one call First step toward making mutation events first-class nodes in the correlation graph. Today the graph silently reflects post-mutation state with no marker of the transition; this helper lands the emitter the mutator and deploy paths will call. - decnet/mutator/events.py: emit_decky_mutated(bus, *, decky, old_services, new_services, trigger, actor=None, log_path=None) writes an RFC 5424 line (service=mutator, hostname=<decky>, MSGID=decky_mutated, SD params for old/new services + trigger + optional actor) to DECNET_INGEST_LOG_FILE, then fire-and-forget publishes on decky.<id>.mutation. Either side failing is soft — the other path still completes. - MutationTrigger Literal covers creation, retirement, scheduled, operator, behavioral, healer, federation. Reserved values for v2/v3 (behavioral + federation) stay nullable so the schema is stable. - decnet/bus/topics.py: DECKY_MUTATION constant + decky_mutation(id) builder. Distinct from DECKY_STATE ("current shape") because a mutation is a transition event, not a steady-state snapshot. - Empty-set symmetry: creation emits old_services=[], retirement emits new_services=[]. Every decky lifecycle becomes a well-formed fold sequence on the correlator side. - 4 new tests: FakeBus + correlator parser round-trip; creation and retirement empty-set cases; bus=None still writes syslog; unwritable log path doesn't block bus publish. 95 tests green across test_mutator + tests/bus.	2026-04-21 19:29:21 -04:00
anti	e23c6c4ee4	feat(mutator): bus-wake on decky mutate_request; adaptive sleep; heartbeat The flat-fleet mutator was DB-poll-only and noisy — it logged "no active deployment found" every 10s on idle hosts and ran mutate_all at a fixed tick regardless of when the next decky was due. - mutate_all returns seconds-until-next-due; watch loop sleeps min(next_due, poll_interval_secs) with a 1s floor. - "No deployment" is now idle, not an error: edge-triggered log on present<->absent transition instead of every tick. - mutate_decky publishes decky.<name>.state on successful compose so UIs react in real time. - New decky.*.mutate_request subscription lets API/CLI/UI force an immediate mutation of a specific decky without waiting for its interval; target name feeds mutate_all(only={...}). - system.mutator.health heartbeat via run_health_heartbeat helper, bringing the mutator in line with DEBT-031 workers. Tests: next_due return, only= filter, decky.<name>.state publish on success, no publish on compose failure. Full mutator+topology- mutator+bus suite (109) green.	2026-04-21 19:28:01 -04:00
anti	5c0631e12c	feat(agent,forwarder,updater): publish system.<worker>.health heartbeats (DEBT-031 workers 7-9) All three workers now share a run_health_heartbeat helper in decnet.bus.publish. Each publishes system.<worker>.health on a 30s tick with {worker, ts} plus optional per-worker extras. Subscribers can watch system.*.health to see every DECNET worker on a host at once. - agent: heartbeat runs inside the FastAPI lifespan alongside the existing master-facing heartbeat; bus-disabled path is a no-op. - forwarder: heartbeat task spawned at run_forwarder entry, cancelled in the finally block so a crashed master loop never leaks the task. - updater: new FastAPI lifespan hosts the heartbeat. Heartbeat helper swallows extra() failures and is cancellation-safe so lifespan teardown never hangs on it.	2026-04-21 17:02:10 -04:00
anti	cbb394a160	feat(ingester): publish system.log per committed batch (DEBT-031 worker 6) Ingester connects the bus at startup, emits a batch-committed summary (component/flushed/position) after each successful _flush_batch. Zero- row flushes are suppressed so the topic stays meaningful. Complements the collector's per-line system.log publishes: collector signals ingress, ingester signals DB-persisted progress. Federation forwarder (worker 8) will subscribe to the batch-committed leaf to trigger its upstream push. Bus stays optional: publish_safely swallows failures, get_bus() can return None, DECNET_BUS_ENABLED=false leaves the ingestion loop fully functional.	2026-04-21 16:58:49 -04:00
anti	a448dbe283	feat(collector): publish system.log per ingested event (DEBT-031 worker 5) log_collector_worker connects the bus at startup, builds a thread-safe system.log publisher, and hands it to each container-stream thread through _stream_container's new publish_fn parameter. Publishing fires right after the JSON record is written — same rate-limiter path, no extra parsing, compact payload (decky/service/event_type/attacker_ip/ timestamp) so subscribers can redraw without re-reading the DB. Bus stays optional: if get_bus() fails or DECNET_BUS_ENABLED=false the factory returns a no-op publisher and the stream thread calls it unconditionally. Hook failures are logged and never abort the thread.	2026-04-21 16:57:21 -04:00
anti	67c2e30f89	feat(profiler): publish attacker.scored per profile upsert (DEBT-031 worker 4) The profiler worker threads its bus publisher through _WorkerState so _update_profiles can emit a compact attacker.scored event for every upsert. Payload carries the headline counts (event/service/decky/ bounty/credential) plus is_traversal, so the MazeNET attacker pool can redraw without a round-trip. Bus stays optional: publish_attacker=None when DECNET_BUS_ENABLED=false or get_bus() fails, and hook exceptions are logged without breaking the upsert path.	2026-04-21 16:54:40 -04:00
anti	e51b65d7c3	feat(correlation,profiler): publish attacker.observed on first sighting (DEBT-031 worker 3) CorrelationEngine gains an optional publish_fn hook fired once per unique attacker IP. The profiler worker — sole caller of the engine today — carries the bus physically, builds a thread-safe publisher, and wraps it with the attacker.observed topic before handing it in. Bus stays optional: if get_bus() fails or DECNET_BUS_ENABLED=false, the engine runs publish_fn=None and the worker degrades to DB-only. Hook failures log a warning and never break ingestion.	2026-04-21 16:53:03 -04:00
anti	34d9e37ab0	feat(prober): publish attacker.fingerprinted on the bus (DEBT-031) Each successful JARM / HASSH / TCPfp probe fans out an attacker.fingerprinted event; the probe family goes in event.type so a single subscription covers all three. Payload carries the attacker IP, port, and probe-specific hash — enough for the MazeNET live map to render fingerprint info on observed attackers. Lifts the thread-safe publisher helper out of the sniffer worker into decnet/bus/publish.py so the prober (and every future worker with a to_thread hot path) can reuse it without copy-pasting the run_coroutine_threadsafe dance. Sniffer rewires onto the shared helper in passing. Adds ATTACKER_FINGERPRINTED as a new leaf — distinct from ATTACKER_OBSERVED (correlator's first-sight signal) because an active probe result is additional evidence about an already-observed attacker. Note: the plan's decky.{id}.state realism-probe publish path is deferred — the current prober fingerprints attackers, not decky realism. Will revisit when realism probes exist.	2026-04-21 16:47:55 -04:00
anti	7f497ac552	feat(sniffer): publish decky.{id}.traffic on the bus (DEBT-031) SnifferEngine gains an optional publish_fn hook, invoked after the dedup + syslog write for traffic-summary events only (tls_session, tcp_flow_timing, tcp_syn_fingerprint) — intermediate parser artifacts like tls_client_hello stay off the bus. The sniffer worker wires get_bus() + a thread-safe shim that marshals sync calls from the scapy sniff thread back onto the asyncio loop via run_coroutine_threadsafe. Bus failure at startup degrades cleanly to publish-off mode; publish failures at runtime never escape the sniff thread.	2026-04-21 16:35:50 -04:00
anti	f3eaab5d37	refactor(bus): extract publish_safely + extend topics for DEBT-031 Shared publish_safely helper at decnet/bus/publish.py so the nine workers about to be wired into the bus don't each copy-paste the "never raise back at the caller" contract. Mutator drops its private copy and imports the canonical one. topics.py gains the attacker.* hierarchy (observed, scored, session.started, session.ended) and a system_health(worker) builder for per-worker health heartbeats — both prerequisites for the worker rollout under DEBT-031.	2026-04-21 16:32:30 -04:00
anti	1968f6e741	test(mutator,web): cover bus publishes, bus-wake, and SSE events route - tests/topology/test_mutator.py: reconcile_topologies publishes applying+applied on success, applying+failed+status on failure; and stays safe when bus=None. _wake_on_enqueue sets its asyncio.Event on every matching enqueue event. - tests/api/topology/test_mutations.py: POST /mutations publishes mutation.enqueued after a successful DB write, via a FakeBus injected in place of the app-wide bus singleton. - tests/api/topology/test_events_stream.py: SSE route returns 401 unauthenticated, 404 for unknown topologies, and (driving the async generator directly) emits a snapshot on connect plus forwards a published mutation.applied as an `event: mutation.applied` SSE frame.	2026-04-21 14:39:12 -04:00
anti	fbf289ff63	feat(bus): host-local UNIX-socket pub/sub worker (DEBT-029) Land the `decnet bus` worker and `get_bus()` factory. Transport is a host-local UNIX-domain socket (0660, group=decnet); authz is the file mode. Wire framing is a tiny verb-line + 4-byte-BE length + orjson body. NATS-style wildcard topics (`*`, `>`). At-most-once, fire-and-forget — DB stays the source of truth. `FakeBus` / `NullBus` for tests and the disabled path. Cross-host federation is deferred to a future `--bridge-tcp` mode; DEBT-030 is master-only and unblocked.	2026-04-21 13:49:02 -04:00
anti	d9f3824086	test(topology): cover compose labels and tolerate docker filter kwarg test_compose asserts the new decnet.topology.* labels land on both base deckies (role=base, no service marker) and service fragments (service=true). The stub docker client in test_deploy grew a filters kwarg so it keeps matching the real .networks.list(filters=...) call signature now used by the deployer.	2026-04-21 10:24:15 -04:00
anti	0cdcfe2653	feat(agent/collector): topology-label discovery and master-authoritative supersede Legacy fleet deckies live in decnet-state.json; MazeNET topology containers don't. Tag them at compose-time with decnet.topology.service=true and let the collector match on that label. Spin up the agent's log collector on the first successful /topology/apply (not in the lifespan — that would break the no-docker-on-boot invariant) and tear it down with the app. Land log lines in DECNET_AGENT_LOG_FILE, separate from master-side DECNET_INGEST_LOG_FILE, so a dev box running both roles can't forward its own ingest back to itself. When master pushes a topology that differs from whatever is pinned locally, teardown the predecessor and accept the new one. Refusing with 409 left the agent stranded after partial deploys. record_error now persists the hydrated blob so a later teardown can still walk the LAN list — otherwise a half-failed apply strands containers + bridges with no breadcrumb back to them.	2026-04-21 10:23:10 -04:00
anti	12e18b75db	feat(swarm): expose needs_resync on TopologySummary + upsert record_error Two small observability follow-ups to the phase-1 agent/topology wiring: TopologySummary now carries needs_resync so operators can see the heartbeat's resync flag via the topology list/detail API without dropping into the DB. TopologyStore.record_error becomes an upsert — when a docker/compose failure fires during the first materialise (put() never reached), we still land a marker row so GET /topology/state surfaces the error and the next heartbeat carries an empty applied_version_hash. That empty hash is what master's heartbeat check relies on to flag the topology for resync instead of assuming the apply succeeded.	2026-04-21 01:41:30 -04:00
anti	0a14dbc9f4	test(agent): pin no-auto-restore-on-boot invariant for topology cache Four regression tests guarding Step 8 of the agent/topology wiring: - Lifespan startup must not call docker.from_env even with a populated topology.db — replace docker with a boom-stub and assert zero calls. - GET /topology/state returns the cached row verbatim without re-materialising bridges/containers; live observation is read-only. - Static guard: TopologyStore must not grow a restore/replay/reapply method without someone re-reading the module docstring. - Raw sqlite read + a second TopologyStore instance confirm the store is passive — nothing scrubs stale rows on open, which is the behaviour master's resync flow depends on.	2026-04-21 01:37:05 -04:00
anti	e8f9c955b3	feat(swarm): heartbeat-driven topology resync for agent-pinned deployments Agent heartbeats now carry an applied-topology snapshot. The master heartbeat handler compares the reported version_hash against what canonical_hash yields for the hydrated topology pinned to that host and flags Topology.needs_resync on divergence (or when the agent reports no topology at all while master expects one). The mutator watch loop gains reconcile_agent_resyncs, which re-pushes the current hydrated blob via AgentClient.apply_topology without touching status, then clears the flag on success. Push failures leave the flag set so the next tick retries.	2026-04-21 01:35:12 -04:00

1 2 3 4 5 ...

276 Commits