Teardowns were synchronous all the way through: POST blocked on the
worker's docker-compose-down cycle (seconds to minutes), the frontend
locked tearingDown to a single string so only one button could be armed
at a time, and operators couldn't queue a second teardown until the
first returned. On a flaky worker that meant staring at a spinner for
the whole RTT.
Backend: POST /swarm/hosts/{uuid}/teardown returns 202 the instant the
request is validated. Affected shards flip to state='tearing_down'
synchronously before the response so the UI reflects progress
immediately, then the actual AgentClient call + DB cleanup run in an
asyncio.create_task (tracked in a module-level set to survive GC and
to be drainable by tests). On failure the shard flips to
'teardown_failed' with the error recorded — nothing is re-raised,
since there's no caller to catch it.
Frontend: swap tearingDown / decommissioning from 'string | null' to
'Set<string>'. Each button tracks its own in-flight state; the poll
loop picks up the final shard state from the backend. Multiple
teardowns can now be queued without blocking each other.
Submitting an INI with a single [decky1] was silently redeploying the
deckies from the *previous* deploy too. POST /deckies/deploy merged the
new INI into the stored DecnetConfig by name, so a 1-decky INI on top of
a prior 3-decky run still pushed 3 deckies to the worker. Those stale
decky2/decky3 kept their old IPs, collided on the parent NIC, and the
agent failed with 'Address already in use' — the deploy the operator
never asked for.
The INI is the source of truth for which deckies exist this deploy.
Full replace: config.deckies = list(new_decky_configs). Operators who
want to add more deckies should list them all in the INI.
Update the deploy-limit test to reflect the new replace semantics, and
add a regression test asserting prior state is dropped.
Agents already exposed POST /teardown; the master was missing the plumbing
to reach it. Add:
- POST /api/v1/swarm/hosts/{uuid}/teardown — admin-gated. Body
{decky_id: str|null}: null tears the whole host, a value tears one decky.
On worker failure the master returns 502 and leaves DB shards intact so
master and agent stay aligned.
- BaseRepository.delete_decky_shard(name) + sqlmodel impl for per-decky
cleanup after a single-decky teardown.
- SwarmHosts page: "Teardown all" button (keeps host enrolled).
- SwarmDeckies page: per-row "Teardown" button.
Also exclude setuptools' build/ staging dir from the enrollment tarball —
`pip install -e` on the master generates build/lib/decnet_web/node_modules
and the bundle walker was leaking it to agents. Align pyproject's bandit
exclude with the git-hook invocation so both skip decnet/templates/.
Agents now ship with collector/prober/sniffer as systemd services; mutator,
profiler, web, and API stay master-only (profiler rebuilds attacker profiles
against the master DB — no per-host DB exists). Expand _EXCLUDES to drop the
full decnet/web, decnet/mutator, decnet/profiler, and decnet_web trees from
the enrollment bundle.
Updater now calls _heal_path_symlink + _sync_systemd_units after rotation so
fleets pick up new unit files and /usr/local/bin/decnet tracks the shared venv
without a manual reinstall. daemon-reload runs once per update when any unit
changed.
Fix _service_registry matchers to accept systemd-style /usr/local/bin/decnet
cmdlines (psutil returns a list — join to string before substring-checking)
so agent-mode `decnet status` reports collector/prober/sniffer correctly.
Previously `decnet status` on an agent showed every microservice as DOWN
because deploy's auto-spawn was unihost-scoped and the agent CLI gate
hid the per-host commands. Now:
- collect, probe, profiler, sniffer drop out of MASTER_ONLY_COMMANDS
(they run per-host; master-side work stays master-gated).
- mutate stays master-only (it orchestrates swarm-wide respawns).
- decnet/mutator/ excluded from agent tarballs — never invoked there.
- decnet/web exclusion tightened: ship db/ + auth.py + dependencies.py
(profiler needs the repo singleton), drop api.py, swarm_api.py,
ingester.py, router/, templates/.
- Four new systemd unit templates (decnet-collector/prober/profiler/
sniffer) shipped in every enrollment tarball.
- enroll_bootstrap.sh enables + starts all four alongside agent and
forwarder at install time.
- updater restarts the aux units on code push so they pick up the new
release (best-effort — legacy enrollments without the units won't
fail the update).
- status table hides Mutator + API rows in agent mode.
Agents never run the FastAPI master app (decnet/web/) or serve the React
frontend (decnet_web/) — they run decnet.agent, decnet.updater, and
decnet.forwarder, none of which import decnet.web. Shipping the master
tree bloats every enrollment payload and needlessly widens the worker's
attack surface.
Excluded paths are unreachable on the worker (all cli.py imports of
decnet.web are inside master-only command bodies that the agent-mode
gate strips). Tests assert neither tree leaks into the tarball.
Bootstrap used to end with `decnet updater --daemon` which forks and
detaches — invisible to systemctl, no auto-restart, dies on reboot.
Ships a decnet-updater.service template matching the pattern of the
other units (Restart=on-failure, log to /var/log/decnet/decnet.updater.log,
certs from /etc/decnet/updater, install tree at /opt/decnet), bundles
it alongside agent/forwarder/engine units, and the installer now
`systemctl enable --now`s it when --with-updater is set.
Wi-Fi APs bind one MAC per associated station, so VirtualBox/VMware
guests bridged over Wi-Fi rotate the VM's DHCP lease when Docker's
macvlan starts emitting container-MAC frames through the vNIC. Adds a
`use_ipvlan` toggle on the Agent Enrollment tab (mirrors the updater
daemon checkbox): flips the flag on SwarmHost, bakes `ipvlan=true` into
the agent's decnet.ini, and `_worker_config` forces ipvlan=True on the
per-host shard at dispatch. Safe no-op on wired/bare-metal agents.
Deckies merged in from a prior deployment's saved state kept their
original host_uuid — which dispatch_decnet_config then 404'd on if that
host had since been decommissioned or re-enrolled at a different uuid.
Before round-robin assignment, drop any host_uuid that isn't in the live
swarm_hosts set so orphaned entries get reassigned instead of exploding
with 'unknown host_uuid'.
POST /deckies/deploy now branches on DECNET_MODE + enrolled host presence:
when the caller is a master with at least one reachable swarm host, round-
robin host_uuids are assigned over new deckies and the config is dispatched
via AgentClient. Falls back to local docker-compose otherwise.
Extracts the dispatch loop from api_deploy_swarm into dispatch_decnet_config
so both endpoints share the same shard/dispatch/persist path. Adds
GET /system/deployment-mode for the UI to show 'will shard across N hosts'
vs 'will deploy locally' before the operator clicks deploy.
Rename log-file-path -> log-directory (maps to DECNET_LOG_DIRECTORY). Bundle
now ships three systemd units rendered with agent_name/master_host and installs
them into /etc/systemd/system/. Bootstrap replaces direct 'decnet X --daemon'
calls with systemctl enable --now. Each unit pins DECNET_SYSTEM_LOGS so agent,
forwarder, and deckies logs land at decnet.{agent,forwarder}.log and decnet.log
under /var/log/decnet.
Adds /api/v1/swarm-updates/{hosts,push,push-self,rollback} behind
require_admin. Reuses the existing UpdaterClient + tar_working_tree + the
per-host asyncio.gather pattern from api_deploy_swarm.py; tarball is
built exactly once per /push request and fanned out to every selected
worker. /hosts filters out decommissioned hosts and agent-only
enrollments (no updater bundle = not a target).
Connection drops during /update-self are treated as success — the
updater re-execs itself mid-response, so httpx always raises.
Pydantic models live in decnet/web/db/models.py (single source of
truth). 24 tests cover happy paths, rollback, transport failures,
include_self ordering (skip on rolled-back agents), validation, and
RBAC gating.
Adds the server-side wiring and frontend UI to surface files captured
by the SSH honeypot for a given attacker.
- New repository method get_attacker_artifacts (abstract + SQLModel
impl) that joins the attacker's IP to `file_captured` log rows.
- New route GET /attackers/{uuid}/artifacts.
- New router /artifacts/{decky}/{service}/{stored_as} that streams a
quarantined file back to an authenticated viewer.
- AttackerDetail grows an ArtifactDrawer panel with per-file metadata
(sha256, size, orig_path) and a download action.
- ssh service fragment now sets NODE_NAME=decky_name so logs and the
host-side artifacts bind-mount share the same decky identifier.
The per-request SELECT users WHERE uuid=? in require_role was the
hidden tax behind every authed endpoint — it kept _execute at ~60%
across the profile even after the page caches landed. Even /health
(with its DB and Docker probes cached) was still 52% _execute from
this one query.
- dependencies.py: 10s TTL cache on get_user_by_uuid, well below JWT
expiry. invalidate_user_cache(uuid) is called on password change,
role change, and user delete.
- api_get_config.py: 5s TTL cache on the admin branch's list_users()
(previously fetched every /config call). Invalidated on user
create/update/delete.
- api_change_pass.py + api_manage_users.py: invalidation hooks on
all user-mutating endpoints.
Round-2 follow-up: profile at 500c/u showed _execute still dominating
the uncached read endpoints (/bounty 76%, /logs/histogram 73%,
/deckies 56%). Same router-level TTL pattern as /stats — 5s window,
asyncio.Lock to collapse concurrent calls into one DB hit.
- /bounty: cache default unfiltered page (limit=50, offset=0,
bounty_type=None, search=None). Filtered requests bypass.
- /logs/histogram: cache default (interval_minutes=15, no filters).
Filtered / non-default interval requests bypass.
- /deckies: cache full response (endpoint takes no params).
- /config: bump _STATE_TTL from 1.0 to 5.0 — admin writes are rare,
1s was too short for bursts to coalesce at high concurrency.
Every /stats call ran SELECT count(*) FROM logs + SELECT count(DISTINCT
attacker_ip) FROM logs; every /logs and /attackers call ran an
unfiltered count for the paginator. At 500 concurrent users these
serialize through aiosqlite's worker threads and dominate wall time.
Cache at the router layer (repo stays dialect-agnostic):
- /stats response: 5s TTL
- /logs total (only when no filters): 2s TTL
- /attackers total (only when no filters): 2s TTL
Filtered paths bypass the cache. Pattern reused from api_get_config
and api_get_health (asyncio.Lock + time.monotonic window + lazy lock).
Only database, docker, and ingestion_worker now count as critical
(→ 503 unhealthy). attacker/sniffer/collector failures drop overall
status to degraded (still 200) so the dashboard doesn't panic when a
non-essential worker isn't running.
Adds BaseRepository.add_logs (default: loops add_log for backwards
compatibility) and a real single-session/single-commit implementation
on SQLModelRepository. Introduces DECNET_BATCH_SIZE (default 100) and
DECNET_BATCH_MAX_WAIT_MS (default 250) so the ingester can flush on
either a size or a time bound when it adopts the new method.
Ingester wiring is deferred to a later pass — the single-log path was
deadlocking tests when flushed during lifespan teardown, so this change
ships the DB primitive alone.
Locust hit /health and /config on every @task(3), so each request was
firing repo.get_total_logs() and two repo.get_state() calls against
aiosqlite — filling the driver queue for data that changes on the order
of seconds, not milliseconds.
Both caches follow the shape already used by the existing Docker cache:
- asyncio.Lock with double-checked TTL so concurrent callers collapse
into one DB hit per 1s window.
- _reset_* helpers called from tests/api/conftest.py::setup_db so the
module-level cache can't leak across tests.
tests/test_health_config_cache.py asserts 50 concurrent callers
produce exactly 1 repo call, and the cache expires after TTL.
verify_password / get_password_hash are CPU-bound and take ~250ms each
at rounds=12. Called directly from async endpoints, they stall every
other coroutine for that window — the single biggest single-worker
bottleneck on the login path.
Adds averify_password / ahash_password that wrap the sync versions in
asyncio.to_thread. Sync versions stay put because _ensure_admin_user and
tests still use them.
5 call sites updated: login, change-password, create-user, reset-password.
tests/test_auth_async.py asserts parallel averify runs concurrently (~1x
of a single verify, not 2x).
Root cause: test_schemathesis.py mutates decnet.web.auth.SECRET_KEY at
module-level import time, poisoning JWT verification for all other tests
in the same process — even when fuzz tests are deselected.
- Add pytest_ignore_collect hook in tests/api/conftest.py to skip
collecting test_schemathesis.py unless -m fuzz is selected
- Add --dist loadscope to addopts so xdist groups by module (protects
module-scoped fixtures in live tests)
- Remove now-unnecessary xdist_group markers from live test classes
The live test modules set DECNET_CONTRACT_TEST=true at module level,
which persisted across xdist workers and caused the mutate endpoint
to short-circuit before the mock was reached. Clear the env var in
affected tests with monkeypatch.delenv.
- Fixed CLI tests by patching local imports at source (psutil, os, Path).
- Fixed Collector tests by globalizing docker.from_env mock.
- Stabilized SSE stream tests via AsyncMock and immediate generator termination to prevent hangs.
- Achieved >80% coverage on CLI (84%), Collector (97%), and DB Repository (100%).
- Implemented SMTP Relay service tests (100%).
- Rebuild repo.engine and repo.session_factory per-test using unique
in-memory SQLite URIs — fixes KeyError: 'access_token' caused by
stale session_factory pointing at production DB
- Add @pytest.mark.fuzz to all Hypothesis and Schemathesis tests;
default run excludes them (addopts = -m 'not fuzz')
- Add missing fuzz tests to bounty, fleet, histogram, and repository
- Use tmp_path for state file in patch_state_file/mock_state_file to
eliminate file-path race conditions under xdist parallelism
- Set default addopts: -v -q -x -n logical (26 tests in ~7s)