Commit Graph

164 Commits

Author SHA1 Message Date
b4df9ea0a1 fix(swarm-mgmt): bundle URLs target master_host, not dashboard base_url 2026-04-19 04:52:20 -04:00
c6f7de30d2 feat(swarm-mgmt): agent enrollment bundle flow + admin swarm endpoints 2026-04-19 04:25:57 -04:00
37b22b76a5 feat(cli): auto-spawn listener as detached sibling from decnet swarmctl
Mirrors the agent→forwarder pattern: `decnet swarmctl` now fires the
syslog-TLS listener as a detached Popen sibling so a single master
invocation brings the full receive pipeline online. --no-listener opts
out for operators who want to run the listener on a different host (or
under their own systemd unit).

Listener bind host / port come from DECNET_LISTENER_HOST and
DECNET_SWARM_SYSLOG_PORT — both seedable from /etc/decnet/decnet.ini.
PID at $(pid_dir)/listener.pid so operators can kill/restart manually.

decnet.ini.example ships alongside env.config.example as the
documented surface for the new role-scoped config. Mode, forwarder
targets, listener bind, and master ports all live there — no more
memorizing flag trees.

Extends tests/test_auto_spawn.py with two swarmctl cases: listener is
spawned with the expected argv + PID file, and --no-listener suppresses.
2026-04-19 03:25:40 -04:00
43f140a87a feat(cli): auto-spawn forwarder as detached sibling from decnet agent
New _spawn_detached(argv, pid_file) helper uses Popen with
start_new_session=True + close_fds=True + DEVNULL stdio to launch a
DECNET subcommand as a fully independent process. The parent does NOT
wait(); if it dies the child survives under init. This is deliberately
not a supervisor — if the child dies the operator restarts it manually.

_pid_dir() picks /opt/decnet when writable else ~/.decnet, so both
root-run production and non-root dev work without ceremony.

`decnet agent` now auto-spawns `decnet forwarder --daemon ...` as
that detached sibling, pulling master host + syslog port from
DECNET_SWARM_MASTER_HOST / DECNET_SWARM_SYSLOG_PORT. --no-forwarder
opts out. If DECNET_SWARM_MASTER_HOST is unset the auto-spawn is
silently skipped (single-host dev or operator wants to start the
forwarder separately).

tests/test_auto_spawn.py monkeypatches subprocess.Popen and verifies:
the detach kwargs are passed, the PID file exists and contains a
valid positive integer (PID-file corruption is a real operational
headache — catching bad writes at the test level is free), the
--no-forwarder flag suppresses the spawn, and the unset-master-host
path silently skips.
2026-04-19 03:23:42 -04:00
3223bec615 feat(cli): gate master-only commands when DECNET_MODE=agent
- MASTER_ONLY_COMMANDS / MASTER_ONLY_GROUPS frozensets enumerate every
  command a worker host must not see. Comment block at the declaration
  puts the maintenance obligation in front of anyone touching command
  registration.
- _gate_commands_by_mode() filters both app.registered_commands (for
  @app.command() registrations) and app.registered_groups (for
  add_typer sub-apps) so the 'swarm' group disappears along with
  'api', 'swarmctl', 'deploy', etc. on agent hosts.
- _require_master_mode() is the belt-and-braces in-function guard,
  added to the four highest-risk commands (api, swarmctl, deploy,
  teardown). Protects against direct function imports that would
  bypass Typer.
- DECNET_DISALLOW_MASTER=false is the escape hatch for hybrid dev
  hosts that legitimately play both roles.

tests/test_mode_gating.py exercises help-text listings via subprocess
and the defence-in-depth guard via direct import.
2026-04-19 03:20:48 -04:00
65fc9ac2b9 fix(tests): clean up two pre-existing failures before config work
- decnet/agent/app.py /health: drop leftover 'push-test-2' canary
  planted during live VM push verification and never cleaned up;
  test_health_endpoint asserts the exact dict shape.
- tests/test_factory.py: switch the lazy-engine check from
  mysql+aiomysql (not in pyproject) to mysql+asyncmy (the driver the
  project actually ships). The test does not hit the wire so the
  dialect swap is safe.

Both were red on `pytest tests/` before any config/auto-spawn work
began; fixing them here so the upcoming commits land on a green
full-suite baseline.
2026-04-19 03:17:17 -04:00
1e8b73c361 feat(config): add /etc/decnet/decnet.ini loader
New decnet/config_ini.py parses a role-scoped INI file via stdlib
configparser and seeds os.environ via setdefault — real env vars still
win, keeping full back-compat with .env.local flows.

[decnet] holds role-agnostic keys (mode, disallow-master, log-file-path);
the role section matching `mode` is loaded, the other is ignored
silently so a worker never reads master-only keys (and vice versa).

Loader is standalone in this commit — not wired into cli.py yet.
2026-04-19 03:10:51 -04:00
9b1299458d fix(env): resolve DECNET_JWT_SECRET lazily so agent/updater subcommands don't need it
The module-level _require_env('DECNET_JWT_SECRET') call blocked
`decnet agent` and `decnet updater` from starting on workers that
legitimately have no business knowing the master's JWT signing key.

Move the resolution into a module `__getattr__`: only consumers that
actually read `decnet.env.DECNET_JWT_SECRET` trigger the validation,
which in practice means only decnet.web.auth (master-side).

Adds tests/test_env_lazy_jwt.py covering both the in-process lazy path
and an out-of-process `decnet agent --help` subprocess check with a
fully sanitized environment.
2026-04-19 02:43:25 -04:00
a266d6b17e feat(web): Remote Updates API — dashboard endpoints for pushing code to workers
Adds /api/v1/swarm-updates/{hosts,push,push-self,rollback} behind
require_admin. Reuses the existing UpdaterClient + tar_working_tree + the
per-host asyncio.gather pattern from api_deploy_swarm.py; tarball is
built exactly once per /push request and fanned out to every selected
worker. /hosts filters out decommissioned hosts and agent-only
enrollments (no updater bundle = not a target).

Connection drops during /update-self are treated as success — the
updater re-execs itself mid-response, so httpx always raises.

Pydantic models live in decnet/web/db/models.py (single source of
truth). 24 tests cover happy paths, rollback, transport failures,
include_self ordering (skip on rolled-back agents), validation, and
RBAC gating.
2026-04-19 01:01:09 -04:00
f5a5fec607 feat(deploy): systemd units w/ capability-based hardening; updater restarts agent via systemctl
Add deploy/ unit files for every DECNET daemon (agent, updater, api, web,
swarmctl, listener, forwarder). All run as User=decnet with NoNewPrivileges,
ProtectSystem, PrivateTmp, LockPersonality; AmbientCapabilities=CAP_NET_ADMIN
CAP_NET_RAW only on the agent (MACVLAN/scapy). Existing api/web units migrated
to /opt/decnet layout and the same hardening stanza.

Make the updater's _spawn_agent systemd-aware: under systemd (detected via
INVOCATION_ID + systemctl on PATH), `systemctl restart decnet-agent.service`
replaces the Popen path so the new agent inherits the unit's ambient caps
instead of the updater's empty set. _stop_agent becomes a no-op in that mode
to avoid racing systemctl's own stop phase.

Tests cover the dispatcher branch selection, MainPID parsing, and the
systemd no-op stop.
2026-04-19 00:44:06 -04:00
ebeaf08a49 fix(updater): fall back to /proc scan when agent.pid is missing
If the agent was started outside the updater (manually, during dev,
or from a prior systemd unit), there is no agent.pid for _stop_agent
to target, so a successful code install leaves the old in-memory
agent process still serving requests. Scan /proc for any decnet agent
command and SIGTERM all matches so restart is reliable regardless of
how the agent was originally launched.
2026-04-18 23:42:26 -04:00
7765b36c50 feat(updater): remote self-update daemon with auto-rollback
Adds a separate `decnet updater` daemon on each worker that owns the
agent's release directory and installs tarball pushes from the master
over mTLS. A normal `/update` never touches the updater itself, so the
updater is always a known-good rescuer if a bad agent push breaks
/health — the rotation is reversed and the agent restarted against the
previous release. `POST /update-self` handles updater upgrades
explicitly (no auto-rollback).

- decnet/updater/: executor, FastAPI app, uvicorn launcher
- decnet/swarm/updater_client.py, tar_tree.py: master-side push
- cli: `decnet updater`, `decnet swarm update [--host|--all]
  [--include-self] [--dry-run]`, `--updater` on `swarm enroll`
- enrollment API issues a second cert (CN=updater@<host>) signed by the
  same CA; SwarmHost records updater_cert_fingerprint
- tests: executor, app, CLI, tar tree, enroll-with-updater (37 new)
- wiki: Remote-Updates page + sidebar + SWARM-Mode cross-link
2026-04-18 21:40:21 -04:00
8914c27220 feat(swarm): add decnet swarm deckies to list deployed shards by host
`swarm list` only shows enrolled workers — there was no way to see which
deckies are running and where. Adds GET /swarm/deckies on the controller
(joins DeckyShard with SwarmHost for name/address/status) plus the CLI
wrapper with --host / --state filters and --json.
2026-04-18 21:10:07 -04:00
4db9c7464c fix(swarm): relocalize master-built config on worker before deploy
deploy --mode swarm was failing on every heterogeneous fleet: the master
populates config.interface from its own box (detect_interface() → its
default NIC), then ships that verbatim. The worker's deployer then calls
get_host_ip(config.interface), hits 'ip addr show wlp6s0' on a VM whose
NIC is enp0s3, and 500s.

Fix: agent.executor._relocalize() runs on every swarm-mode deploy.
Re-detects the worker's interface/subnet/gateway/host_ip locally and
swaps them into the config before calling deployer.deploy(). When the
worker's subnet doesn't match the master's, decky IPs are re-allocated
from the worker's subnet via allocate_ips() so they're reachable.

Unihost-mode configs are left untouched — they're already built against
the local box and second-guessing them would be wrong.

Validated against anti@192.168.1.13: master dispatched interface=wlp6s0,
agent logged 'relocalized interface=enp0s3', deployer ran successfully,
dry-run returned ok=deployed.

4 new tests cover both branches (matching-subnet preserves decky IPs;
mismatch re-allocates), the end-to-end executor.deploy() path, and the
unihost short-circuit.
2026-04-18 20:41:21 -04:00
411a797120 feat(cli): add decnet swarm check wrapper for POST /swarm/check
The swarmctl API already exposes POST /swarm/check — an active mTLS
probe that refreshes SwarmHost.status + last_heartbeat for every
enrolled worker. The CLI was missing a wrapper, so operators had to
curl the endpoint directly (which is how the VM validation run did it,
and how the wiki Deployment-Modes / SWARM-Mode pages ended up doc'ing
a command that didn't exist yet).

Matches the existing list/enroll/decommission pattern: typer subcommand
under swarm_app, --url override, Rich table output plus --json for
scripting. Three tests: populated table, empty-swarm path, and --json
emission.
2026-04-18 20:28:34 -04:00
bfc7af000a test(swarm): add forwarder/listener resilience scenarios
Covers failure modes the happy-path tests miss:

- log rotation (copytruncate): st_size shrinks under the forwarder, it
  resets offset=0 and reships the new contents instead of getting wedged
  past EOF;
- listener restart: forwarder retries, resumes from the persisted offset,
  and the previously-acked lines are NOT duplicated on the master;
- listener tolerates a well-authenticated client that sends a partial
  octet-count frame and drops — the server must stay up and accept
  follow-on connections;
- peer_cn / fingerprint_from_ssl degrade to 'unknown' / None when no
  peer cert is available (defensive path that otherwise rarely fires).
2026-04-18 19:56:51 -04:00
1e8ca4cc05 feat(swarm-cli): add decnet swarm {enroll,list,decommission} + deploy --mode swarm
New sub-app talks HTTP to the local swarm controller (127.0.0.1:8770 by
default; override with --url or $DECNET_SWARMCTL_URL).

- enroll: POSTs /swarm/enroll, prints fingerprint, optionally writes
  ca.crt/worker.crt/worker.key to --out-dir for scp to the worker.
- list: renders enrolled workers as a rich table (with --status filter).
- decommission: looks up uuid by --name, confirms, DELETEs.

deploy --mode swarm now:
  1. fetches enrolled+active workers from the controller,
  2. round-robin-assigns host_uuid to each decky,
  3. POSTs the sharded DecnetConfig to /swarm/deploy,
  4. renders per-worker pass/fail in a results table.

Exits non-zero if no workers exist or any worker's dispatch failed.
2026-04-18 19:52:37 -04:00
a6430cac4c feat(swarm): add decnet forwarder CLI to run syslog-over-TLS forwarder
The forwarder module existed but had no runner — closes that gap so the
worker-side process can actually be launched and runs isolated from the
agent (asyncio.run + SIGTERM/SIGINT → stop_event).

Guards: refuses to start without a worker cert bundle or a resolvable
master host ($DECNET_SWARM_MASTER_HOST or --master-host).
2026-04-18 19:41:37 -04:00
39d2077a3a feat(swarm): syslog-over-TLS log pipeline (RFC 5425, TCP 6514)
Worker-side log_forwarder tails the local RFC 5424 log file and ships
each line as an octet-counted frame to the master over mTLS. Offset is
persisted in a tiny local SQLite so master outages never cause loss or
duplication — reconnect resumes from the exact byte where the previous
session left off. Impostor workers (cert not signed by DECNET CA) are
rejected at TLS handshake.

Master-side log_listener terminates mTLS on 0.0.0.0:6514, validates the
client cert, extracts the peer CN as authoritative worker provenance,
and appends each frame to the master's ingest log files. Attacker-
controlled syslog HOSTNAME field is ignored — the CA-controlled CN is
the only source of provenance.

7 tests added covering framing codec, offset persistence across
reopens, end-to-end mTLS delivery, crash-resilience (offset survives
restart, no duplicate shipping), and impostor-CA rejection.

DECNET_SWARM_SYSLOG_PORT / DECNET_SWARM_MASTER_HOST env bindings
added.
2026-04-18 19:33:58 -04:00
811136e600 refactor(swarm): one file per endpoint, matching existing router layout
Splits the three grouped router files into eight api_<verb>_<resource>.py
modules under decnet/web/router/swarm/ to match the convention used by
router/fleet/ and router/config/. Shared request/response models live in
_schemas.py. Keeps each endpoint easy to locate and modify without
stepping on siblings.
2026-04-18 19:23:06 -04:00
63b0a58527 feat(swarm): master-side SWARM controller (swarmctl) + agent CLI
Adds decnet/web/swarm_api.py as an independent FastAPI app with routers
for host enrollment, deployment dispatch (sharding DecnetConfig across
enrolled workers via AgentClient), and active health probing. Runs as
its own uvicorn subprocess via 'decnet swarmctl', mirroring the isolation
pattern used by 'decnet api'. Also wires up 'decnet agent' CLI entry for
the worker side.

29 tests added under tests/swarm/test_swarm_api.py cover enrollment
(including bundle generation + duplicate rejection), host CRUD, sharding
correctness, non-swarm-mode rejection, teardown, and health probes with
a stubbed AgentClient.
2026-04-18 19:18:33 -04:00
cd0057c129 feat(swarm): DeckyConfig.host_uuid + fix agent log/status field refs
- decnet.models.DeckyConfig grows an optional 'host_uuid' (the SwarmHost
  that runs this decky). Defaults to None so legacy unihost state files
  deserialize unchanged.
- decnet.agent.executor: replace non-existent config.name references
  with config.mode / config.interface in logs and status payload.
- tests/swarm/test_state_schema.py covers legacy-dict roundtrip, field
  default, and swarm-mode assignments.
2026-04-18 19:10:25 -04:00
0c77cdab32 feat(swarm): master AgentClient — mTLS httpx wrapper around worker API
decnet.swarm.client exposes:
- MasterIdentity / ensure_master_identity(): the master's own CA-signed
  client bundle, issued once into ~/.decnet/ca/master/.
- AgentClient: async-context httpx wrapper that talks to a worker agent
  over mTLS. health/status/deploy/teardown methods mirror the agent API.

SSL context is built from a bare ssl.SSLContext(PROTOCOL_TLS_CLIENT)
instead of httpx.create_ssl_context — the latter layers on default-CA
and purpose logic that broke private-CA mTLS. Server cert is pinned by
CA + chain, not DNS (workers enroll with arbitrary SANs).

tests/swarm/test_client_agent_roundtrip.py spins uvicorn in-process
with real certs on disk and verifies:
- A CA-signed master client passes health + status calls.
- An impostor whose cert comes from a different CA cannot connect.
2026-04-18 19:08:36 -04:00
8257bcc031 feat(swarm): worker agent + fix pre-existing base_repo coverage test
Worker agent (decnet.agent):
- mTLS FastAPI service exposing /deploy, /teardown, /status, /health,
  /mutate. uvicorn enforces CERT_REQUIRED with the DECNET CA pinned.
- executor.py offloads the blocking deployer onto asyncio.to_thread so
  the event loop stays responsive.
- server.py refuses to start without an enrolled bundle in
  ~/.decnet/agent/ — unauthenticated agents are not a supported mode.
- docs/openapi disabled on the agent — narrow attack surface.

tests/test_base_repo.py: DummyRepo was missing get_attacker_artifacts
(pre-existing abstractmethod) and so could not be instantiated. Added
the stub + coverage for the new swarm CRUD surface on BaseRepository.
2026-04-18 07:15:53 -04:00
d3b90679c5 feat(swarm): PKI module — self-managed CA for master/worker mTLS
decnet.swarm.pki provides:
- generate_ca() / ensure_ca() — self-signed root, PKCS8 PEM, 4096-bit.
- issue_worker_cert() — per-worker keypair + cert signed by the CA with
  serverAuth + clientAuth EKU so the same identity backs the agent's
  HTTPS endpoint AND the syslog-over-TLS upstream.
- write_worker_bundle() / load_worker_bundle() — persist with 0600 on
  private keys.
- fingerprint() — SHA-256 DER hex for master-side pinning.

tests/swarm/test_pki.py covers:
- CA idempotency on disk.
- Signed chain validates against CA subject.
- SAN population (DNS + IP).
- Bundle roundtrip with 0600 key perms.
- End-to-end mTLS handshake between two CA-issued peers.
- Cross-CA client rejection (handshake fails).
2026-04-18 07:09:58 -04:00
7864c72948 test(ssh): add unit coverage for emit_capture RFC 5424 output
Exercises the JSON → syslog formatter end to end: flat fields ride as
SD params, bulky nested metadata collapses into the meta_json_b64 blob,
and the event_type / hostname / service mapping lands in the right
RFC 5424 header slots.
2026-04-18 05:37:42 -04:00
8bdc5b98c9 feat(collector): parse real PROCID and extract IPs from logger kv pairs
- Relaxed RFC 5424 regex to accept either NILVALUE or a numeric PROCID;
  sshd / sudo go through rsyslog with their real PID, while
  syslog_bridge emitters keep using '-'.
- Added a fallback pass that scans the MSG body for IP-shaped
  key=value tokens. This rescues attacker attribution for plain logger
  callers like the SSH PROMPT_COMMAND shim, which emits
  'CMD … src=IP …' without SD-element params.
2026-04-18 05:37:08 -04:00
41fd496128 feat(web): attacker artifacts endpoint + UI drawer
Adds the server-side wiring and frontend UI to surface files captured
by the SSH honeypot for a given attacker.

- New repository method get_attacker_artifacts (abstract + SQLModel
  impl) that joins the attacker's IP to `file_captured` log rows.
- New route GET /attackers/{uuid}/artifacts.
- New router /artifacts/{decky}/{service}/{stored_as} that streams a
  quarantined file back to an authenticated viewer.
- AttackerDetail grows an ArtifactDrawer panel with per-file metadata
  (sha256, size, orig_path) and a download action.
- ssh service fragment now sets NODE_NAME=decky_name so logs and the
  host-side artifacts bind-mount share the same decky identifier.
2026-04-18 05:36:48 -04:00
39dafaf384 feat(ssh-stealth): hide capture artifacts via XOR+gzip entrypoint blob
The /opt/emit_capture.py, /opt/syslog_bridge.py, and
/usr/libexec/udev/journal-relay files were plaintext and world-readable
to any attacker root-shelled into the SSH honeypot — revealing the full
capture logic on a single cat.

Pack all three into /entrypoint.sh as XOR+gzip+base64 blobs at build
time (_build_stealth.py), then decode in-memory at container start and
exec the capture loop from a bash -c string. No .py files under /opt,
no journal-relay file under /usr/libexec/udev, no argv_zap name
anywhere. The LD_PRELOAD shim is installed as
/usr/lib/x86_64-linux-gnu/libudev-shared.so.1 — sits next to the real
libudev.so.1 and blends into the multiarch layout.

A 1-byte random XOR key is chosen at image build so a bare
'base64 -d | gunzip' probe on the visible entrypoint returns binary
noise instead of readable Python.

Docker-dependent tests live under tests/docker/ behind a new 'docker'
pytest marker (excluded from the default run, same pattern as fuzz /
live / bench).
2026-04-18 05:34:50 -04:00
b0e00a6cc4 fix(ssh-capture): drop relay FIFO, rsyslog→/proc/1/fd/1 direct
The named pipe at /run/systemd/journal/syslog-relay had two problems
beyond its argv leak: any root-in-container process could (a) `cat`
the pipe and watch the live SIEM feed, and (b) write to it and inject
forged log lines. Since an attacker with a shell is already root
inside the honeypot, file permissions can't fix it.

Point rsyslog's auth/user actions directly at /proc/1/fd/1 — the
container-stdout fd Docker attached to PID 1 — and delete the
mkfifo + cat relay from the entrypoint. No pipe on disk, nothing to
read, nothing to inject, and one fewer cloaked process in `ps`.
2026-04-18 02:12:32 -04:00
2843aafa1a fix(ssh-capture): hide watcher bash argv and sanitize script header
Two leaks remained after the inotifywait argv fix:

1. The bash running journal-relay showed its argv[1] (the script path)
   in /proc/PID/cmdline, producing a line like
     'journal-relay /usr/libexec/udev/journal-relay'
   Apply argv_zap.so to that bash too.

2. argv_zap previously hardcoded PR_SET_NAME to 'kmsg-watch', which was
   wrong for any caller other than inotifywait. The comm name now comes
   from ARGV_ZAP_COMM so each caller can pick its own (kmsg-watch for
   inotifywait, journal-relay for the watcher bash).

3. The capture.sh header started with 'SSH honeypot file-catcher' —
   fatal if an attacker runs 'cat' on it. Rewritten as a plausible
   systemd-journal relay helper; stray 'attacker' / 'honeypot' words
   in mid-script comments stripped too.
2026-04-18 02:06:36 -04:00
766eeb3d83 feat(ssh): add ping/nmap/ca-certificates to base image
A lived-in Linux box ships with iputils-ping, ca-certificates, and nmap
available. Their absence is a cheap tell, and they're handy for letting
the attacker move laterally in ways we want to observe. iproute2 (ip a)
was already installed for attribution — noted here for completeness.
2026-04-18 01:53:33 -04:00
f462835373 feat(ssh-capture): LD_PRELOAD shim to zero inotifywait argv
The kmsg-watch (inotifywait) process was the last honest giveaway in
`ps aux` — its watch paths and event flags betrayed the honeypot. The
argv_zap.so shim hooks __libc_start_main, heap-copies argv for the real
main, then memsets the contiguous argv[1..] region to NUL so the kernel's
cmdline reader returns just argv[0].

gcc is installed and purged in the same Docker layer to keep the image
slim. The shim also calls prctl(PR_SET_NAME) so /proc/self/comm mirrors
the argv[0] disguise.
2026-04-18 01:52:30 -04:00
8dd4c78b33 refactor: strip DECNET tokens from container-visible surface
Rename the container-side logging module decnet_logging → syslog_bridge
(canonical at templates/syslog_bridge.py, synced into each template by
the deployer). Drop the stale per-template copies; setuptools find was
picking them up anyway. Swap useradd/USER/chown "decnet" for "logrelay"
so no obvious token appears in the rendered container image.

Apply the same cloaking pattern to the telnet template that SSH got:
syslog pipe moves to /run/systemd/journal/syslog-relay and the relay
is cat'd via exec -a "systemd-journal-fwd". rsyslog.d conf rename
99-decnet.conf → 50-journal-forward.conf. SSH capture script:
/var/decnet/captured → /var/lib/systemd/coredump (real systemd path),
logger tag decnet-capture → systemd-journal. Compose volume updated
to match the new in-container quarantine path.

SD element ID shifts decnet@55555 → relay@55555; synced across
collector, parser, sniffer, prober, formatter, tests, and docs so the
host-side pipeline still matches what containers emit.
2026-04-17 22:57:53 -04:00
69510fb880 fix(ssh-capture): cloak syslog relay pipe and cat process
Rename the rsyslog→stdout pipe from /var/run/decnet-logs (dead giveaway)
to /run/systemd/journal/syslog-relay, and launch the relay via
exec -a "systemd-journal-fwd" so ps shows a plausible systemd forwarder
instead of a bare cat. Casual ps/ls inspection now shows nothing
with "decnet" in the name.
2026-04-17 22:51:34 -04:00
09d9f8595e fix(ssh-capture): disguise watcher as udev helper in ps output
Old ps output was a dead giveaway: two "decnet-capture" bash procs
and a raw "inotifywait". Install script at /usr/libexec/udev/journal-relay
and invoke inotifywait through a /usr/libexec/udev/kmsg-watch symlink so
both now render as plausible udev/journal helpers under casual inspection.
2026-04-17 22:44:47 -04:00
a773dddd5c feat(ssh): capture attacker-dropped files with session attribution
inotifywait watches writable paths in the SSH decky and mirrors any
file close_write/moved_to into a per-decky host-mounted quarantine dir.
Each artifact carries a .meta.json with attacker attribution resolved
by walking the writer PID's PPid chain to the sshd session leader,
then cross-referencing ss and utmp for source IP/user/login time.
Also emits an RFC 5424 syslog line per capture for SIEM correlation.
2026-04-17 22:20:05 -04:00
255c2e5eb7 perf: cache auth user-lookup and admin list_users
The per-request SELECT users WHERE uuid=? in require_role was the
hidden tax behind every authed endpoint — it kept _execute at ~60%
across the profile even after the page caches landed. Even /health
(with its DB and Docker probes cached) was still 52% _execute from
this one query.

- dependencies.py: 10s TTL cache on get_user_by_uuid, well below JWT
  expiry. invalidate_user_cache(uuid) is called on password change,
  role change, and user delete.
- api_get_config.py: 5s TTL cache on the admin branch's list_users()
  (previously fetched every /config call). Invalidated on user
  create/update/delete.
- api_change_pass.py + api_manage_users.py: invalidation hooks on
  all user-mutating endpoints.
2026-04-17 19:56:39 -04:00
2dd86fb3bb perf: cache /bounty, /logs/histogram, /deckies; bump /config TTL to 5s
Round-2 follow-up: profile at 500c/u showed _execute still dominating
the uncached read endpoints (/bounty 76%, /logs/histogram 73%,
/deckies 56%). Same router-level TTL pattern as /stats — 5s window,
asyncio.Lock to collapse concurrent calls into one DB hit.

- /bounty: cache default unfiltered page (limit=50, offset=0,
  bounty_type=None, search=None). Filtered requests bypass.
- /logs/histogram: cache default (interval_minutes=15, no filters).
  Filtered / non-default interval requests bypass.
- /deckies: cache full response (endpoint takes no params).
- /config: bump _STATE_TTL from 1.0 to 5.0 — admin writes are rare,
  1s was too short for bursts to coalesce at high concurrency.
2026-04-17 19:30:11 -04:00
3cc5ba36e8 fix(cli): keep FileNotFoundError handling on decnet api
Popen moved inside the try so a missing uvicorn falls through to the
existing error message instead of crashing the CLI. test_cli was still
patching the old subprocess.run entrypoint; switched both api command
tests to patch subprocess.Popen / os.killpg to match the current path.
2026-04-17 19:09:15 -04:00
6301504c0e perf(api): TTL-cache /stats + unfiltered pagination counts
Every /stats call ran SELECT count(*) FROM logs + SELECT count(DISTINCT
attacker_ip) FROM logs; every /logs and /attackers call ran an
unfiltered count for the paginator. At 500 concurrent users these
serialize through aiosqlite's worker threads and dominate wall time.

Cache at the router layer (repo stays dialect-agnostic):
  - /stats response: 5s TTL
  - /logs total (only when no filters): 2s TTL
  - /attackers total (only when no filters): 2s TTL

Filtered paths bypass the cache. Pattern reused from api_get_config
and api_get_health (asyncio.Lock + time.monotonic window + lazy lock).
2026-04-17 19:09:15 -04:00
b5d7bf818f feat(health): 3-tier status (healthy / degraded / unhealthy)
Only database, docker, and ingestion_worker now count as critical
(→ 503 unhealthy). attacker/sniffer/collector failures drop overall
status to degraded (still 200) so the dashboard doesn't panic when a
non-essential worker isn't running.
2026-04-17 17:48:42 -04:00
a10aee282f perf(ingester): batch log writes into bulk commits
The ingester now accumulates up to DECNET_BATCH_SIZE rows (default 100)
or DECNET_BATCH_MAX_WAIT_MS (default 250ms) before flushing through
repo.add_logs — one transaction, one COMMIT per batch instead of per
row. Under attacker traffic this collapses N commits into ⌈N/100⌉ and
takes most of the SQLite writer-lock contention off the hot path.

Flush semantics are cancel-safe: _position only advances after a batch
commits successfully, and the flush helper bails without touching the
DB if the enclosing task is being cancelled (lifespan teardown).
Un-flushed lines stay in the file and are re-read on next startup.

Tests updated to assert on add_logs (bulk) instead of the per-row
add_log that the ingester no longer uses, plus a new test that 250
lines flush in ≤5 calls.
2026-04-17 16:37:34 -04:00
11b9e85874 feat(db): bulk add_logs for one-commit ingestion batches
Adds BaseRepository.add_logs (default: loops add_log for backwards
compatibility) and a real single-session/single-commit implementation
on SQLModelRepository. Introduces DECNET_BATCH_SIZE (default 100) and
DECNET_BATCH_MAX_WAIT_MS (default 250) so the ingester can flush on
either a size or a time bound when it adopts the new method.

Ingester wiring is deferred to a later pass — the single-log path was
deadlocking tests when flushed during lifespan teardown, so this change
ships the DB primitive alone.
2026-04-17 16:23:09 -04:00
d3f4bbb62b perf(locust): skip change-password in on_start when not required
Previously every user did login → change-pass → re-login in on_start
regardless of whether the server actually required a password change.
With bcrypt at ~250ms/call that's 3 bcrypt-bound requests per user.
At 2500 users the on_start queue was ~10k bcrypt ops — users never
escaped warmup, so @task endpoints never fired.

Login already returns must_change_password; only run the change-pass
+ re-login dance when the server says we have to. Cuts on_start from
3 requests to 1 for every user after the first DB initialization.
2026-04-17 15:15:59 -04:00
f1e14280c0 perf: 1s TTL cache for /health DB probe and /config state reads
Locust hit /health and /config on every @task(3), so each request was
firing repo.get_total_logs() and two repo.get_state() calls against
aiosqlite — filling the driver queue for data that changes on the order
of seconds, not milliseconds.

Both caches follow the shape already used by the existing Docker cache:
- asyncio.Lock with double-checked TTL so concurrent callers collapse
  into one DB hit per 1s window.
- _reset_* helpers called from tests/api/conftest.py::setup_db so the
  module-level cache can't leak across tests.

tests/test_health_config_cache.py asserts 50 concurrent callers
produce exactly 1 repo call, and the cache expires after TTL.
2026-04-17 15:05:18 -04:00
3945e72e11 perf: run bcrypt on a thread so it doesn't block the event loop
verify_password / get_password_hash are CPU-bound and take ~250ms each
at rounds=12. Called directly from async endpoints, they stall every
other coroutine for that window — the single biggest single-worker
bottleneck on the login path.

Adds averify_password / ahash_password that wrap the sync versions in
asyncio.to_thread. Sync versions stay put because _ensure_admin_user and
tests still use them.

5 call sites updated: login, change-password, create-user, reset-password.
tests/test_auth_async.py asserts parallel averify runs concurrently (~1x
of a single verify, not 2x).
2026-04-17 14:52:22 -04:00
bd406090a7 fix: re-seed admin password when still unfinalized (must_change_password=True)
_ensure_admin_user was strict insert-if-missing: once a stale hash landed
in decnet.db (e.g. from a deploy that used a different DECNET_ADMIN_PASSWORD),
login silently 401'd because changing the env var later had no effect.

Now on startup: if the admin still has must_change_password=True (they
never finalized their own password), re-sync the hash from the current
env var. Once the admin sets a real password, we leave it alone.

Found via locustfile.py login storm — see tests/test_admin_seed.py.

Note: this commit also bundles uncommitted pool-management work already
present in sqlmodel_repo.py from prior sessions.
2026-04-17 14:49:13 -04:00
cb12e7c475 fix: logging handler must not crash its caller on reopen failure
When decnet.system.log is root-owned (e.g. created by a pre-fix 'sudo
decnet deploy') and a subsequent non-root process tries to log, the
InodeAwareRotatingFileHandler raised PermissionError out of emit(),
which propagated up through logger.debug/info and killed the collector's
log stream loop ('log stream ended ... reason=[Errno 13]').

Now matches stdlib behaviour: wrap _open() in try/except OSError and
defer to handleError() on failure. Adds a regression test.

Also: scripts/profile/view.sh 'pyinstrument' keyword was matching
memray-flamegraph-*.html files. Exclude the memray-* prefix.
2026-04-17 14:01:36 -04:00
bf4afac70f fix: RotatingFileHandler reopens on external deletion/rotation
Mirrors the inode-check fix from 935a9a5 (collector worker) for the
stdlib-handler-based log paths. Both decnet.system.log (config.py) and
decnet.log (logging/file_handler.py) now use a subclass that stats the
target path before each emit and reopens on inode/device mismatch —
matching the behavior of stdlib WatchedFileHandler while preserving
size-based rotation.

Previously: rm decnet.system.log → handler kept writing to the orphaned
inode until maxBytes triggered; all lines between were lost.
2026-04-17 13:42:15 -04:00