157 Commits

Author SHA1 Message Date
201d246c07 fix(ci): fix indentation on ci.yaml
Some checks failed
CI / Lint (ruff) (push) Successful in 14s
CI / SAST (bandit) (push) Successful in 15s
CI / Test (Standard) (3.11) (push) Has been cancelled
CI / Test (Live) (3.11) (push) Has been cancelled
CI / Test (Fuzz) (3.11) (push) Has been cancelled
CI / Merge dev → testing (push) Has been cancelled
CI / Prepare Merge to Main (push) Has been cancelled
CI / Finalize Merge to Main (push) Has been cancelled
CI / Dependency audit (pip-audit) (push) Has been cancelled
2026-04-20 16:46:30 -04:00
47cd200e1d feat(mazenet): repo methods for topology/LAN/decky/edge/status events
Adds topology CRUD to BaseRepository (NotImplementedError defaults) and
implements them in SQLModelRepository: create/get/list/delete topologies,
add/update/list LANs and TopologyDeckies, add/list edges, plus an atomic
update_topology_status that appends a TopologyStatusEvent in the same
transaction.  Cascade delete sweeps children before the topology row.

Covered by tests/topology/test_repo.py (roundtrip, per-topology name
uniqueness, status event log, cascade delete, status filter) and an
extension to tests/test_base_repo.py for the NotImplementedError surface.
2026-04-20 16:43:49 -04:00
096a35b24a feat(mazenet): add topology schema to models.py
Introduces five new SQLModel tables for MazeNET (nested deception
topologies): Topology, LAN, TopologyDecky, TopologyEdge, and
TopologyStatusEvent.  DeckyShard is intentionally not touched —
TopologyDecky is a purpose-built sibling for MazeNET's lifecycle
(topology-scoped UUIDs, per-topology name uniqueness).

Part of MazeNET v1 (nested self-container network-of-networks).
2026-04-20 16:40:10 -04:00
8a2876fe86 fix(api): document missing HTTP status codes on router endpoints
All checks were successful
CI / Lint (ruff) (push) Successful in 16s
CI / SAST (bandit) (push) Successful in 18s
CI / Dependency audit (pip-audit) (push) Successful in 26s
CI / Test (Standard) (3.11) (push) Successful in 2m41s
CI / Test (Live) (3.11) (push) Successful in 1m6s
CI / Test (Fuzz) (3.11) (push) Successful in 1h9m14s
CI / Finalize Merge to Main (push) Has been skipped
CI / Merge dev → testing (push) Successful in 12s
CI / Prepare Merge to Main (push) Has been skipped
Schemathesis was failing CI on routes that returned status codes not
declared in their OpenAPI responses= dicts. Adds the missing codes
across swarm_updates, swarm_mgmt, swarm, fleet and attackers routers.

Also adds 400 to every POST/PUT/PATCH that accepts a JSON body —
Starlette returns 400 on malformed/non-UTF8 bodies before FastAPI's
422 validation runs, which schemathesis fuzzing trips every time.

No handler logic changed.
2026-04-20 15:25:02 -04:00
3e8e4c9e1c fix(ci): run less harsh tests on CI, let local runners run harder ones 2026-04-20 14:07:34 -04:00
64bc6fcb1d chores(pyproj): modified some values 2026-04-20 13:22:49 -04:00
af9d59d3ee fixed(api): documentation 2026-04-20 13:20:42 -04:00
4197441c01 fix(ci): skip live service isolation
Some checks failed
CI / Lint (ruff) (push) Successful in 12s
CI / SAST (bandit) (push) Successful in 15s
CI / Dependency audit (pip-audit) (push) Successful in 22s
CI / Test (Standard) (3.11) (push) Successful in 2m47s
CI / Test (Live) (3.11) (push) Successful in 1m7s
CI / Test (Fuzz) (3.11) (push) Failing after 45m40s
CI / Merge dev → testing (push) Has been skipped
CI / Prepare Merge to Main (push) Has been skipped
CI / Finalize Merge to Main (push) Has been skipped
2026-04-20 13:14:48 -04:00
1b70d6db87 fix(ci): added skipif on mysql absence
Some checks failed
CI / Lint (ruff) (push) Successful in 12s
CI / SAST (bandit) (push) Successful in 15s
CI / Dependency audit (pip-audit) (push) Successful in 24s
CI / Test (Standard) (3.11) (push) Successful in 2m51s
CI / Test (Live) (3.11) (push) Failing after 1m2s
CI / Test (Fuzz) (3.11) (push) Has been skipped
CI / Merge dev → testing (push) Has been skipped
CI / Prepare Merge to Main (push) Has been skipped
CI / Finalize Merge to Main (push) Has been skipped
2026-04-20 13:07:31 -04:00
038596776a feat(ci): added live mysql service on test-live
Some checks failed
CI / Lint (ruff) (push) Successful in 12s
CI / SAST (bandit) (push) Successful in 15s
CI / Dependency audit (pip-audit) (push) Successful in 23s
CI / Test (Standard) (3.11) (push) Successful in 2m49s
CI / Test (Live) (3.11) (push) Failing after 1m1s
CI / Test (Fuzz) (3.11) (push) Has been skipped
CI / Merge dev → testing (push) Has been skipped
CI / Prepare Merge to Main (push) Has been skipped
CI / Finalize Merge to Main (push) Has been skipped
2026-04-20 12:54:03 -04:00
692ac35ee4 modification(versions): drop 3.12 tests and support only 3.11
Some checks failed
CI / Lint (ruff) (push) Successful in 17s
CI / SAST (bandit) (push) Successful in 19s
CI / Dependency audit (pip-audit) (push) Successful in 26s
CI / Test (Standard) (3.11) (push) Successful in 2m58s
CI / Test (Live) (3.11) (push) Failing after 58s
CI / Test (Fuzz) (3.11) (push) Has been skipped
CI / Merge dev → testing (push) Has been skipped
CI / Prepare Merge to Main (push) Has been skipped
CI / Finalize Merge to Main (push) Has been skipped
2026-04-20 12:43:03 -04:00
f064690452 fixed(tests): jwt_lazy
Some checks failed
CI / Lint (ruff) (push) Successful in 12s
CI / SAST (bandit) (push) Successful in 15s
CI / Dependency audit (pip-audit) (push) Successful in 23s
CI / Test (Standard) (3.11) (push) Successful in 5m6s
CI / Test (Standard) (3.12) (push) Failing after 3h14m38s
CI / Test (Live) (3.11) (push) Has been cancelled
CI / Test (Fuzz) (3.11) (push) Has been cancelled
CI / Merge dev → testing (push) Has been cancelled
CI / Prepare Merge to Main (push) Has been cancelled
CI / Finalize Merge to Main (push) Has been cancelled
2026-04-20 02:26:54 -04:00
dd82cd3f39 fixed(tests): mode_gating
Some checks failed
CI / Lint (ruff) (push) Successful in 12s
CI / SAST (bandit) (push) Successful in 15s
CI / Dependency audit (pip-audit) (push) Successful in 23s
CI / Test (Standard) (3.11) (push) Failing after 5m0s
CI / Test (Live) (3.11) (push) Has been cancelled
CI / Test (Fuzz) (3.11) (push) Has been cancelled
CI / Merge dev → testing (push) Has been cancelled
CI / Prepare Merge to Main (push) Has been cancelled
CI / Finalize Merge to Main (push) Has been cancelled
CI / Test (Standard) (3.12) (push) Has been cancelled
2026-04-20 02:18:11 -04:00
ff3e376726 modified(actions): modified actions to bypass bandit on decnet/templates
Some checks failed
CI / Lint (ruff) (push) Successful in 12s
CI / SAST (bandit) (push) Successful in 16s
CI / Dependency audit (pip-audit) (push) Successful in 31s
CI / Test (Standard) (3.11) (push) Failing after 4m52s
CI / Test (Live) (3.11) (push) Has been cancelled
CI / Test (Fuzz) (3.11) (push) Has been cancelled
CI / Merge dev → testing (push) Has been cancelled
CI / Prepare Merge to Main (push) Has been cancelled
CI / Finalize Merge to Main (push) Has been cancelled
CI / Test (Standard) (3.12) (push) Has been cancelled
2026-04-20 02:05:36 -04:00
47f2ca8d5f added(tests): schemathesis contract fuzzing at the agent and swarmctl level
Some checks failed
CI / Lint (ruff) (push) Successful in 17s
CI / SAST (bandit) (push) Failing after 19s
CI / Dependency audit (pip-audit) (push) Failing after 38s
CI / Test (Standard) (3.11) (push) Has been skipped
CI / Test (Standard) (3.12) (push) Has been skipped
CI / Test (Live) (3.11) (push) Has been skipped
CI / Test (Fuzz) (3.11) (push) Has been skipped
CI / Merge dev → testing (push) Has been skipped
CI / Prepare Merge to Main (push) Has been skipped
CI / Finalize Merge to Main (push) Has been skipped
2026-04-20 01:27:39 -04:00
da3e675f86 fix(tests): fixed locust fixtures and rampups, since >100 generally isn't very well managed 2026-04-20 01:26:56 -04:00
2febd921bc fix(models): added lenght validation to the common name, which per RFC 5280 must be max =< 64 2026-04-20 01:26:07 -04:00
12b5c25cd7 fix(agent-routes): added undocumented responses 2026-04-20 01:24:05 -04:00
5b70a34c94 fix(routes): added undocumented responses 2026-04-20 01:23:07 -04:00
4abfac1a98 fix: monkeypatch test db URL 2026-04-20 00:03:12 -04:00
9eca33938d chore: deleted swp file 2026-04-19 23:51:59 -04:00
195580c74d test: fix templates paths, CLI gating, and stress-suite harness
- tests/**: update templates/ → decnet/templates/ paths after module move
- tests/mysql_spinup.sh: use root:root and asyncmy driver
- tests/test_auto_spawn.py: patch decnet.cli.utils._pid_dir (package split)
- tests/test_cli.py: set DECNET_MODE=master in api-command tests
- tests/stress/conftest.py: run locust out-of-process via its CLI + CSV
  stats shim to avoid urllib3 RecursionError from late gevent monkey-patch;
  raise uvicorn startup timeout to 60s, accept 401 from auth-gated health,
  strip inherited DECNET_* env, surface stderr on 0-request runs
- tests/stress/test_stress.py: loosen baseline thresholds to match hw
2026-04-19 23:50:53 -04:00
262a84ca53 refactor(cli): split decnet/cli.py monolith into decnet/cli/ package
The 1,878-line cli.py held every Typer command plus process/HTTP helpers
and mode-gating logic. Split into one module per command using a
register(app) pattern so submodules never import app at module scope,
eliminating circular-import risk.

- utils.py: process helpers, _http_request, _kill_all_services, console, log
- gating.py: MASTER_ONLY_* sets, _require_master_mode, _gate_commands_by_mode
- deploy.py: deploy + _deploy_swarm (tightly coupled)
- lifecycle.py: status, teardown, redeploy
- workers.py: probe, collect, mutate, correlate
- inventory.py, swarm.py, db.py, and one file per remaining command

__init__.py calls register(app) on each module then runs the mode gate
last, and re-exports the private symbols tests patch against
(_db_reset_mysql_async, _kill_all_services, _require_master_mode, etc.).

Test patches retargeted to the submodule where each name now resolves.
Enroll-bundle tarball test updated to assert decnet/cli/__init__.py.

No behavioral change.
2026-04-19 22:42:52 -04:00
d1b7e94325 fix(swarm): inject peer cert into ASGI scope for uvicorn <= 0.44
Uvicorn's h11/httptools HTTP protocols don't populate scope['extensions']['tls'], so /swarm/heartbeat's per-request cert pinning was 403ing every call despite CERT_REQUIRED validating the cert at handshake. Patch RequestResponseCycle.__init__ on both protocol modules to read the peer cert off the asyncio transport and write DER bytes into scope['extensions']['tls']['client_cert_chain']. Importing the module from swarm_api.py auto-installs the patch in the swarmctl uvicorn worker before any request is served.
2026-04-19 22:09:11 -04:00
33d954a61c feat(web-ui): unify SwarmDeckies into DeckyFleet with swarm card mode
DeckyFleet now branches on /system/deployment-mode: in swarm mode it
pulls /swarm/deckies and normalises DeckyShardView into the shared
Decky shape so the same card grid renders either way. Swarm cards gain
a host badge (host_name @ address), a state pill (running/degraded/
tearing_down/failed/teardown_failed with matching colors), an inline
last_error snippet, and a two-click arm/commit Teardown button lifted
from the old SwarmDeckies component. Mutate + interval controls are
hidden in swarm mode since the worker /mutate endpoint still 501s —
swarm-side rotation is a separate ticket.

Drops the standalone /swarm/deckies route + nav entry; SwarmDeckies.tsx
is deleted. The SWARM nav group keeps SwarmHosts, Remote Updates, and
Agent Enrollment.
2026-04-19 21:53:26 -04:00
bf01804736 feat(agent): periodic heartbeat loop posting status to swarmctl
New decnet.agent.heartbeat asyncio loop wired into the agent FastAPI
lifespan. Every 30 s the worker POSTs executor.status() to the master's
/swarm/heartbeat with its DECNET_HOST_UUID for self-identity; the
existing agent mTLS bundle provides the client cert the master pins
against SwarmHost.client_cert_fingerprint.

start() is a silent no-op when identity env (HOST_UUID, MASTER_HOST) is
unset or the worker bundle is missing, so dev runs and un-enrolled hosts
don't crash the agent app. On non-204 responses the loop logs loudly but
keeps ticking — an operator may re-enrol mid-session, and fail-closed
pinning shouldn't be self-silencing.
2026-04-19 21:49:34 -04:00
62f7c88b90 feat(swarmctl): --tls with auto-issued or BYOC server cert
swarmctl CLI gains --tls/--cert/--key/--client-ca flags. With --tls the
controller runs uvicorn under HTTPS + mTLS (CERT_REQUIRED) so worker
heartbeats can reach it cross-host. Default is still 127.0.0.1 plaintext
for backwards compat with the master-CLI enrollment flow.

Auto-issue path (no --cert/--key given): a server cert signed by the
existing DECNET CA is issued once and parked under ~/.decnet/swarmctl/.
Workers already ship that CA's ca.crt from the enroll bundle, so they
verify the endpoint with no extra trust config. BYOC via --cert/--key
when the operator wants a publicly-trusted or externally-managed cert.
The auto-cert path is idempotent across restarts to keep a stable
fingerprint for any long-lived mTLS sessions.
2026-04-19 21:46:32 -04:00
e411063075 feat(swarm): ship host_uuid + swarmctl-port in agent enroll bundle
The rendered /etc/decnet/decnet.ini now carries host-uuid and
swarmctl-port in [agent], which config_ini seeds into DECNET_HOST_UUID
and DECNET_SWARMCTL_PORT. Gives the worker a stable self-identity for
the heartbeat loop — the INI never has to be rewritten because cert
pinning is the real gate (a rotated UUID with a matching CA-signed
cert would still be blocked by SHA-256 fingerprint mismatch against
the stored SwarmHost row).

Also adds DECNET_MASTER_HOST so the agent can find the swarmctl URL
via the INI's existing master-host key.
2026-04-19 21:44:23 -04:00
148e51011c feat(swarm): agent→master heartbeat with per-host cert pinning
New POST /swarm/heartbeat on the swarm controller. Workers post every
~30s with the output of executor.status(); the master bumps
SwarmHost.last_heartbeat and re-upserts each DeckyShard with a fresh
DeckyConfig snapshot and runtime-derived state (running/degraded).

Security: CA-signed mTLS alone is not sufficient — a decommissioned
worker's still-valid cert could resurrect ghost shards. The endpoint
extracts the presented peer cert (primary: scope["extensions"]["tls"],
fallback: transport.get_extra_info("ssl_object")) and SHA-256-pins it
to the SwarmHost.client_cert_fingerprint stored for the claimed
host_uuid. Extraction is factored into _extract_peer_fingerprint so
tests can exercise both uvicorn scope shapes and the both-unavailable
fail-closed path without mocking uvicorn's TLS pipeline.

Adds get_swarm_host_by_fingerprint to the repo interface (SQLModel
impl reuses the indexed client_cert_fingerprint column).
2026-04-19 21:37:15 -04:00
3ebd206bca feat(swarm): persist DeckyConfig snapshot per shard + enrich list API
Dispatch now writes the full serialised DeckyConfig into
DeckyShard.decky_config (plus decky_ip as a cheap extract), so the
master can render the same rich per-decky card the local-fleet view
uses — hostname, distro, archetype, service_config, mutate_interval,
last_mutated — without round-tripping to the worker on every page
render. DeckyShardView gains the corresponding fields; the repository
flattens the snapshot at read time. Pre-migration rows keep working
(fields fall through as None/defaults).

Columns are additive + nullable so SQLModel.metadata.create_all handles
the change on both SQLite and MySQL. Backfill happens organically on
the next dispatch or (in a follow-up) agent heartbeat.
2026-04-19 21:29:45 -04:00
f576564f02 fix(agent): also wipe /etc/decnet during self-destruct 2026-04-19 21:04:31 -04:00
00d5799a79 fix(agent): escape systemd cgroup when spawning self-destruct reaper
The reaper was being SIGTERM'd mid-rm because `start_new_session=True`
only forks a new POSIX session — it does not escape decnet-agent.service's
cgroup. When the reaper ran `systemctl stop decnet-agent`, systemd
tore down the whole cgroup (reaper included) before `rm -rf /opt/decnet*`
finished, leaving the install on disk.

Spawn the reaper via `systemd-run --collect --unit decnet-reaper-<pid>`
so it runs in a fresh transient scope, outside the agent unit. Falls
back to bare Popen for non-systemd hosts.
2026-04-19 21:00:43 -04:00
14250cacad feat(swarm): self-destruct agent on decommission
Decommissioning a worker from the dashboard (or swarm controller) now
asks the agent to wipe its own install before the master forgets it.
The agent stops decky containers + every decnet-* systemd unit, then
deletes /opt/decnet*, /etc/systemd/system/decnet-*, /var/lib/decnet/*,
and /usr/local/bin/decnet*. Logs under /var/log are preserved.

The reaper runs as a detached /tmp script (start_new_session=True) so
it survives the agent process being killed. Self-destruct dispatch is
best-effort — a dead worker doesn't block master-side cleanup.
2026-04-19 20:47:09 -04:00
9d68bb45c7 feat(web): async teardowns — 202 + background task, UI allows parallel queue
Teardowns were synchronous all the way through: POST blocked on the
worker's docker-compose-down cycle (seconds to minutes), the frontend
locked tearingDown to a single string so only one button could be armed
at a time, and operators couldn't queue a second teardown until the
first returned. On a flaky worker that meant staring at a spinner for
the whole RTT.

Backend: POST /swarm/hosts/{uuid}/teardown returns 202 the instant the
request is validated. Affected shards flip to state='tearing_down'
synchronously before the response so the UI reflects progress
immediately, then the actual AgentClient call + DB cleanup run in an
asyncio.create_task (tracked in a module-level set to survive GC and
to be drainable by tests). On failure the shard flips to
'teardown_failed' with the error recorded — nothing is re-raised,
since there's no caller to catch it.

Frontend: swap tearingDown / decommissioning from 'string | null' to
'Set<string>'. Each button tracks its own in-flight state; the poll
loop picks up the final shard state from the backend. Multiple
teardowns can now be queued without blocking each other.
2026-04-19 20:30:56 -04:00
07ec4bc269 fix(fleet): INI fully replaces prior decky state on redeploy
Submitting an INI with a single [decky1] was silently redeploying the
deckies from the *previous* deploy too. POST /deckies/deploy merged the
new INI into the stored DecnetConfig by name, so a 1-decky INI on top of
a prior 3-decky run still pushed 3 deckies to the worker. Those stale
decky2/decky3 kept their old IPs, collided on the parent NIC, and the
agent failed with 'Address already in use' — the deploy the operator
never asked for.

The INI is the source of truth for which deckies exist this deploy.
Full replace: config.deckies = list(new_decky_configs). Operators who
want to add more deckies should list them all in the INI.

Update the deploy-limit test to reflect the new replace semantics, and
add a regression test asserting prior state is dropped.
2026-04-19 20:24:29 -04:00
a63301c7a3 fix(web): replace window.confirm with two-click arm/commit on swarm actions
Teardown and Decommission buttons were silently dead in the browser.
Root cause: every handler started with 'if (!window.confirm(...)) return;'
and browsers permanently disable confirm() for a tab once the user ticks
'Prevent this page from creating additional dialogs'. That returns false
with no UI, the handler early-exits, and no request is ever fired — no
network traffic, no console error, no backend activity.

Swap to an inline two-click pattern: first click arms the button (label
flips to 'Click again to confirm', resets after 4s); second click within
the window commits. Same safety against misclicks, zero dependency on
browser-native dialog primitives.
2026-04-19 20:16:51 -04:00
df18cb44cc fix(swarm): don't paint healthy deckies as failed when a shard-sibling fails
docker compose up is partial-success-friendly — a build failure on one
service doesn't roll back the others. But the master was catching the
agent's 500 and tagging every decky in the shard as 'failed' with the
same error message. From the UI that looked like all three deckies died
even though two were live on the worker.

On dispatch exception, probe the agent's /status to learn which deckies
actually have running containers, and upsert per-decky state accordingly.
Only fall back to marking the whole shard failed if the status probe
itself is unreachable.

Enhance agent.executor.status() to include a 'runtime' map keyed by
decky name with per-service container state, so the master has something
concrete to consult.
2026-04-19 20:11:08 -04:00
91549e6936 fix(deploy): prevent 'Address already in use' from stale IPAM and half-torn-down containers
Two compounding root causes produced the recurring 'Address already in use'
error on redeploy:

1. _ensure_network only compared driver+name; if a prior deploy's IPAM
   pool drifted (different subnet/gateway/range), Docker kept handing out
   addresses from the old pool and raced the real LAN. Now also compares
   Subnet/Gateway/IPRange and rebuilds on drift.

2. A prior half-failed 'up' could leave containers still holding the IPs
   and ports the new run wants. Run 'compose down --remove-orphans' as a
   best-effort pre-up cleanup so IPAM starts from a clean state.

Also surface docker compose stderr to the structured log on failure so
the agent's journal captures Docker's actual message (which IP, which
port) instead of just the exit code.
2026-04-19 19:59:06 -04:00
e8e11b2896 feat(web-ui): show decky IP on SwarmDeckies, drop compose-hash column
Operators want to know what address to poke when triaging a swarm decky;
the compose-hash column was debug scaffolding that never paid off.

DeckyShard has no IP column (the deploy-time IP lives on DecnetConfig),
so the list endpoint resolves it at read time by joining shards against
the stored deployment state by decky_name. Missing lookups render as "—"
rather than erroring — the list stays useful even after a master restart
that hasn't persisted a config yet.
2026-04-19 19:48:27 -04:00
585541016f fix(engine): teardown(decky_id=...) built malformed service names
The nested list-comp `[f"{id}-{svc}" for svc in [d.services for d ...]]`
iterated over a list of lists, so `svc` was the whole services list and
f-string stringified it -> `decky3-['sip']`. docker compose saw "no such
service" and the per-decky teardown failed 500.

Flatten: find the matching decky once, then iterate its services. Noop
early on unknown decky_id and on empty service lists. Regression test
asserts the emitted compose args have no '[' or quote characters.
2026-04-19 19:42:42 -04:00
5dad1bb315 feat(swarm): remote teardown API + UI (per-decky and per-host)
Agents already exposed POST /teardown; the master was missing the plumbing
to reach it. Add:

- POST /api/v1/swarm/hosts/{uuid}/teardown — admin-gated. Body
  {decky_id: str|null}: null tears the whole host, a value tears one decky.
  On worker failure the master returns 502 and leaves DB shards intact so
  master and agent stay aligned.
- BaseRepository.delete_decky_shard(name) + sqlmodel impl for per-decky
  cleanup after a single-decky teardown.
- SwarmHosts page: "Teardown all" button (keeps host enrolled).
- SwarmDeckies page: per-row "Teardown" button.

Also exclude setuptools' build/ staging dir from the enrollment tarball —
`pip install -e` on the master generates build/lib/decnet_web/node_modules
and the bundle walker was leaking it to agents. Align pyproject's bandit
exclude with the git-hook invocation so both skip decnet/templates/.
2026-04-19 19:39:28 -04:00
6708f26e6b fix(packaging): move templates/ into decnet/ package so they ship with pip install
The docker build contexts and syslog_bridge.py lived at repo root, which
meant setuptools (include = ["decnet*"]) never shipped them. Agents
installed via `pip install $RELEASE_DIR` got site-packages/decnet/** but no
templates/, so every deploy blew up in deployer._sync_logging_helper with
FileNotFoundError on templates/syslog_bridge.py.

Move templates/ -> decnet/templates/ and declare it as setuptools
package-data. Path resolutions in services/*.py and engine/deployer.py drop
one .parent since templates now lives beside the code. Test fixtures,
bandit exclude path, and coverage omit glob updated to match.
2026-04-19 19:30:04 -04:00
2bef3edb72 feat(swarm): unbundle master-only code from agent tarball + sync systemd units on update
Agents now ship with collector/prober/sniffer as systemd services; mutator,
profiler, web, and API stay master-only (profiler rebuilds attacker profiles
against the master DB — no per-host DB exists). Expand _EXCLUDES to drop the
full decnet/web, decnet/mutator, decnet/profiler, and decnet_web trees from
the enrollment bundle.

Updater now calls _heal_path_symlink + _sync_systemd_units after rotation so
fleets pick up new unit files and /usr/local/bin/decnet tracks the shared venv
without a manual reinstall. daemon-reload runs once per update when any unit
changed.

Fix _service_registry matchers to accept systemd-style /usr/local/bin/decnet
cmdlines (psutil returns a list — join to string before substring-checking)
so agent-mode `decnet status` reports collector/prober/sniffer correctly.
2026-04-19 19:19:17 -04:00
d2cf1e8b3a feat(updater): sync systemd unit files and daemon-reload on update
The bootstrap installer copies etc/systemd/system/*.service into
/etc/systemd/system at enrollment time, but the updater was skipping
that step — a code push could not ship a new unit (e.g. the four
per-host microservices added this session) or change ExecStart on an
existing one. systemctl alone doesn't re-read unit files; daemon-reload
is required.

run_update / run_update_self now call _sync_systemd_units after
rotation: diff each .service file against the live copy, atomically
replace changed ones, then issue a single `systemctl daemon-reload`.
No-op on legacy tarballs that don't ship etc/systemd/system/.
2026-04-19 19:07:24 -04:00
6d7877c679 feat(swarm): per-host microservices as systemd units, mutator off agents
Previously `decnet status` on an agent showed every microservice as DOWN
because deploy's auto-spawn was unihost-scoped and the agent CLI gate
hid the per-host commands. Now:

  - collect, probe, profiler, sniffer drop out of MASTER_ONLY_COMMANDS
    (they run per-host; master-side work stays master-gated).
  - mutate stays master-only (it orchestrates swarm-wide respawns).
  - decnet/mutator/ excluded from agent tarballs — never invoked there.
  - decnet/web exclusion tightened: ship db/ + auth.py + dependencies.py
    (profiler needs the repo singleton), drop api.py, swarm_api.py,
    ingester.py, router/, templates/.
  - Four new systemd unit templates (decnet-collector/prober/profiler/
    sniffer) shipped in every enrollment tarball.
  - enroll_bootstrap.sh enables + starts all four alongside agent and
    forwarder at install time.
  - updater restarts the aux units on code push so they pick up the new
    release (best-effort — legacy enrollments without the units won't
    fail the update).
  - status table hides Mutator + API rows in agent mode.
2026-04-19 18:58:48 -04:00
ee9ade4cd5 feat(enroll): strip master API and frontend from agent tarball
Agents never run the FastAPI master app (decnet/web/) or serve the React
frontend (decnet_web/) — they run decnet.agent, decnet.updater, and
decnet.forwarder, none of which import decnet.web. Shipping the master
tree bloats every enrollment payload and needlessly widens the worker's
attack surface.

Excluded paths are unreachable on the worker (all cli.py imports of
decnet.web are inside master-only command bodies that the agent-mode
gate strips). Tests assert neither tree leaks into the tarball.
2026-04-19 18:47:03 -04:00
dad29249de fix(updater): align bootstrap layout with updater; log update phases
The bootstrap was installing into /opt/decnet/.venv with an editable
`pip install -e .`, and /usr/local/bin/decnet pointed there. The updater
writes releases to /opt/decnet/releases/active/ with a shared venv at
/opt/decnet/venv — a parallel tree nothing on the box actually runs.
Result: updates appeared to succeed (release dir rotated, SHA changed)
but systemd kept executing the untouched bootstrap code.

Changes:
  - Bootstrap now installs directly into /opt/decnet/releases/active
    with the shared venv at /opt/decnet/venv and /opt/decnet/current
    symlinked. Same layout the updater rotates in and out of.
  - /usr/local/bin/decnet -> /opt/decnet/venv/bin/decnet.
  - run_update / run_update_self heal /usr/local/bin/decnet on every
    push so already-enrolled hosts recover on the next update instead
    of needing a re-enroll.
  - run_update / run_update_self now log each phase (receive, extract,
    pip install, rotate, restart, probe) so the updater log actually
    shows what happened.
2026-04-19 18:39:11 -04:00
f91ba9a16e feat(cli): allow decnet status in agent mode
Agents run deckies locally and need to inspect their own state. Removed
`status` from MASTER_ONLY_COMMANDS so it survives the agent-mode gate.
Useful for validating remote updater pushes from the master.
2026-04-19 18:29:41 -04:00
43b92c7bd6 fix(updater): restart agent+forwarder+self via systemd on push
Three holes in the systemd integration:
1. _spawn_agent_via_systemd only restarted decnet-agent.service, leaving
   decnet-forwarder.service running the pre-update code (same /opt/decnet
   tree, stale import cache).
2. run_update_self used os.execv regardless of environment — the re-execed
   process kept the updater's existing cgroup/capability inheritance but
   systemd would notice MainPID change and mark the unit degraded.
3. No path to surface a failed forwarder restart (legacy enrollments have
   no forwarder unit).

Now: agent restart first, forwarder restart as best-effort (logged but
non-fatal so legacy workers still update), MainPID still read from the
agent unit. For update-self under systemd, spawn a detached sleep+
systemctl restart so the HTTP response flushes before the unit cycles.
2026-04-19 18:23:10 -04:00
a0a241f65d feat(enroll): decnet-updater now runs under systemd, not a --daemon fork
Bootstrap used to end with `decnet updater --daemon` which forks and
detaches — invisible to systemctl, no auto-restart, dies on reboot.
Ships a decnet-updater.service template matching the pattern of the
other units (Restart=on-failure, log to /var/log/decnet/decnet.updater.log,
certs from /etc/decnet/updater, install tree at /opt/decnet), bundles
it alongside agent/forwarder/engine units, and the installer now
`systemctl enable --now`s it when --with-updater is set.
2026-04-19 18:19:24 -04:00
42b5e4cd06 fix(network): replace decnet_lan when driver differs (macvlan<->ipvlan)
The create helpers short-circuited on name alone, so a prior macvlan
deploy left Docker's decnet_lan network in place. A subsequent ipvlan
deploy would no-op the network create, then container attach would try
to add a macvlan port on enp0s3 that already had an ipvlan slave —
EBUSY, agent 500, docker ps empty.

Now: when the existing network's driver disagrees with the requested
one, disconnect any live containers and DROP it before recreating.
Parent-NIC can host one driver at a time.

Also: setup_host_{macvlan,ipvlan} opportunistically delete the opposite
host-side helper so we don't leave cruft across driver swaps.
2026-04-19 18:12:28 -04:00
6245786289 fix(cli): db-reset now drops swarm_hosts + decky_shards
_DB_RESET_TABLES was missing the swarm tables, so drop-tables mode left
them intact. create_all doesn't alter columns on existing tables, so any
schema change to SwarmHost (like use_ipvlan) never took effect after a
reset. Child FK first (decky_shards -> swarm_hosts).
2026-04-19 18:04:35 -04:00
5df995fda1 feat(enroll): opt-in IPvlan per-agent for Wi-Fi-bridged VMs
Wi-Fi APs bind one MAC per associated station, so VirtualBox/VMware
guests bridged over Wi-Fi rotate the VM's DHCP lease when Docker's
macvlan starts emitting container-MAC frames through the vNIC. Adds a
`use_ipvlan` toggle on the Agent Enrollment tab (mirrors the updater
daemon checkbox): flips the flag on SwarmHost, bakes `ipvlan=true` into
the agent's decnet.ini, and `_worker_config` forces ipvlan=True on the
per-host shard at dispatch. Safe no-op on wired/bare-metal agents.
2026-04-19 17:57:45 -04:00
6d7567b6bb fix(fleet): reset stale host_uuid on carried-over deckies before dispatch
Deckies merged in from a prior deployment's saved state kept their
original host_uuid — which dispatch_decnet_config then 404'd on if that
host had since been decommissioned or re-enrolled at a different uuid.
Before round-robin assignment, drop any host_uuid that isn't in the live
swarm_hosts set so orphaned entries get reassigned instead of exploding
with 'unknown host_uuid'.
2026-04-19 06:27:34 -04:00
dbaccde143 fix(swarm-updates): offload tarball build to worker thread
tar_working_tree (walks repo + gzips several MB) and detect_git_sha
(shells out) were called directly on the event loop, so /swarm-updates/push
and /swarm-updates/push-self froze every other request until the tarball
was ready. Wrap both in asyncio.to_thread.
2026-04-19 06:21:27 -04:00
b883f24ba2 fix(engine): pin docker compose project name to avoid empty-basename failure
systemd daemons run with WorkingDirectory=/ by default; docker compose
derives the project name from basename(cwd), which is empty at '/', and
aborts with 'project name must not be empty'. Pass -p decnet explicitly
so the project name is independent of cwd, and set WorkingDirectory=/opt/decnet
on the three DECNET units so compose artifacts (decnet-compose.yml,
build contexts) also land in the install dir.
2026-04-19 06:17:30 -04:00
79db999030 feat(fleet): auto-swarm deploy — shard across enrolled workers when master
POST /deckies/deploy now branches on DECNET_MODE + enrolled host presence:
when the caller is a master with at least one reachable swarm host, round-
robin host_uuids are assigned over new deckies and the config is dispatched
via AgentClient. Falls back to local docker-compose otherwise.

Extracts the dispatch loop from api_deploy_swarm into dispatch_decnet_config
so both endpoints share the same shard/dispatch/persist path. Adds
GET /system/deployment-mode for the UI to show 'will shard across N hosts'
vs 'will deploy locally' before the operator clicks deploy.
2026-04-19 06:09:08 -04:00
cb1a1d1270 fix(fleet): defer DecnetConfig build until deckies are expanded
Stateless /api/v1/deckies/deploy previously instantiated DecnetConfig with
deckies=[] so it could merge entries later — but DecnetConfig.deckies is
min_length=1, so Pydantic raised and the global handler mapped it to 422
'Internal data consistency error'. Construct the config after
build_deckies_from_ini returns at least one DeckyConfig.
2026-04-19 06:02:26 -04:00
899ea559d9 feat(enroll): systemd units for agent/forwarder/engine + log-directory INI key
Rename log-file-path -> log-directory (maps to DECNET_LOG_DIRECTORY). Bundle
now ships three systemd units rendered with agent_name/master_host and installs
them into /etc/systemd/system/. Bootstrap replaces direct 'decnet X --daemon'
calls with systemctl enable --now. Each unit pins DECNET_SYSTEM_LOGS so agent,
forwarder, and deckies logs land at decnet.{agent,forwarder}.log and decnet.log
under /var/log/decnet.
2026-04-19 05:46:08 -04:00
e67b6d7f73 refactor(swarm-mgmt): move agent/updater certs to /etc/decnet (root-owned) 2026-04-19 05:32:39 -04:00
bc5f43c3f7 feat(swarm-mgmt): probe-on-read for GET /swarm/hosts heartbeat + status 2026-04-19 05:26:35 -04:00
ff4c993617 refactor(swarm-mgmt): backfill host address from agent's .tgz source IP 2026-04-19 05:20:29 -04:00
e32fdf9cbf feat(swarm-mgmt): agent_host + updater opt-in; prevent duplicate forwarder spawn 2026-04-19 05:12:55 -04:00
95ae175e1b fix(swarm-mgmt): exclude .env from bundle, chmod +x decnet, mkdir log 2026-04-19 04:58:55 -04:00
b4df9ea0a1 fix(swarm-mgmt): bundle URLs target master_host, not dashboard base_url 2026-04-19 04:52:20 -04:00
02f07c7962 feat(web-ui): SWARM nav group + Hosts/Deckies/AgentEnrollment pages 2026-04-19 04:29:07 -04:00
c6f7de30d2 feat(swarm-mgmt): agent enrollment bundle flow + admin swarm endpoints 2026-04-19 04:25:57 -04:00
37b22b76a5 feat(cli): auto-spawn listener as detached sibling from decnet swarmctl
Mirrors the agent→forwarder pattern: `decnet swarmctl` now fires the
syslog-TLS listener as a detached Popen sibling so a single master
invocation brings the full receive pipeline online. --no-listener opts
out for operators who want to run the listener on a different host (or
under their own systemd unit).

Listener bind host / port come from DECNET_LISTENER_HOST and
DECNET_SWARM_SYSLOG_PORT — both seedable from /etc/decnet/decnet.ini.
PID at $(pid_dir)/listener.pid so operators can kill/restart manually.

decnet.ini.example ships alongside env.config.example as the
documented surface for the new role-scoped config. Mode, forwarder
targets, listener bind, and master ports all live there — no more
memorizing flag trees.

Extends tests/test_auto_spawn.py with two swarmctl cases: listener is
spawned with the expected argv + PID file, and --no-listener suppresses.
2026-04-19 03:25:40 -04:00
43f140a87a feat(cli): auto-spawn forwarder as detached sibling from decnet agent
New _spawn_detached(argv, pid_file) helper uses Popen with
start_new_session=True + close_fds=True + DEVNULL stdio to launch a
DECNET subcommand as a fully independent process. The parent does NOT
wait(); if it dies the child survives under init. This is deliberately
not a supervisor — if the child dies the operator restarts it manually.

_pid_dir() picks /opt/decnet when writable else ~/.decnet, so both
root-run production and non-root dev work without ceremony.

`decnet agent` now auto-spawns `decnet forwarder --daemon ...` as
that detached sibling, pulling master host + syslog port from
DECNET_SWARM_MASTER_HOST / DECNET_SWARM_SYSLOG_PORT. --no-forwarder
opts out. If DECNET_SWARM_MASTER_HOST is unset the auto-spawn is
silently skipped (single-host dev or operator wants to start the
forwarder separately).

tests/test_auto_spawn.py monkeypatches subprocess.Popen and verifies:
the detach kwargs are passed, the PID file exists and contains a
valid positive integer (PID-file corruption is a real operational
headache — catching bad writes at the test level is free), the
--no-forwarder flag suppresses the spawn, and the unset-master-host
path silently skips.
2026-04-19 03:23:42 -04:00
3223bec615 feat(cli): gate master-only commands when DECNET_MODE=agent
- MASTER_ONLY_COMMANDS / MASTER_ONLY_GROUPS frozensets enumerate every
  command a worker host must not see. Comment block at the declaration
  puts the maintenance obligation in front of anyone touching command
  registration.
- _gate_commands_by_mode() filters both app.registered_commands (for
  @app.command() registrations) and app.registered_groups (for
  add_typer sub-apps) so the 'swarm' group disappears along with
  'api', 'swarmctl', 'deploy', etc. on agent hosts.
- _require_master_mode() is the belt-and-braces in-function guard,
  added to the four highest-risk commands (api, swarmctl, deploy,
  teardown). Protects against direct function imports that would
  bypass Typer.
- DECNET_DISALLOW_MASTER=false is the escape hatch for hybrid dev
  hosts that legitimately play both roles.

tests/test_mode_gating.py exercises help-text listings via subprocess
and the defence-in-depth guard via direct import.
2026-04-19 03:20:48 -04:00
2b1b962849 feat(env): run decnet.ini loader at package import; expose DECNET_MODE
- decnet/__init__.py now calls load_ini_config() on first import of any
  decnet.* module, seeding os.environ via setdefault() so env.py's
  module-level reads pick up INI values before the shell had to export
  them. Real env vars still win.
- env.py exposes DECNET_MODE (default 'master') and
  DECNET_DISALLOW_MASTER (default true), consumed by the upcoming
  master-command gating in cli.py.

Back-compat: missing /etc/decnet/decnet.ini is a no-op. Existing
.env.local + flag-based launches behave identically.
2026-04-19 03:17:25 -04:00
65fc9ac2b9 fix(tests): clean up two pre-existing failures before config work
- decnet/agent/app.py /health: drop leftover 'push-test-2' canary
  planted during live VM push verification and never cleaned up;
  test_health_endpoint asserts the exact dict shape.
- tests/test_factory.py: switch the lazy-engine check from
  mysql+aiomysql (not in pyproject) to mysql+asyncmy (the driver the
  project actually ships). The test does not hit the wire so the
  dialect swap is safe.

Both were red on `pytest tests/` before any config/auto-spawn work
began; fixing them here so the upcoming commits land on a green
full-suite baseline.
2026-04-19 03:17:17 -04:00
1e8b73c361 feat(config): add /etc/decnet/decnet.ini loader
New decnet/config_ini.py parses a role-scoped INI file via stdlib
configparser and seeds os.environ via setdefault — real env vars still
win, keeping full back-compat with .env.local flows.

[decnet] holds role-agnostic keys (mode, disallow-master, log-file-path);
the role section matching `mode` is loaded, the other is ignored
silently so a worker never reads master-only keys (and vice versa).

Loader is standalone in this commit — not wired into cli.py yet.
2026-04-19 03:10:51 -04:00
9b1299458d fix(env): resolve DECNET_JWT_SECRET lazily so agent/updater subcommands don't need it
The module-level _require_env('DECNET_JWT_SECRET') call blocked
`decnet agent` and `decnet updater` from starting on workers that
legitimately have no business knowing the master's JWT signing key.

Move the resolution into a module `__getattr__`: only consumers that
actually read `decnet.env.DECNET_JWT_SECRET` trigger the validation,
which in practice means only decnet.web.auth (master-side).

Adds tests/test_env_lazy_jwt.py covering both the in-process lazy path
and an out-of-process `decnet agent --help` subprocess check with a
fully sanitized environment.
2026-04-19 02:43:25 -04:00
7894b9e073 feat(web-ui): Remote Updates dashboard page — push code to workers from the UI
React component for /swarm-updates: per-host table polled every 10s,
row actions for Push Update / Update Updater / Rollback, a fleet-wide
'Push to All' modal with the include_self toggle, and toast feedback
per result.

Admin-only (both server-gated and UI-gated). Unreachable hosts surface
as an explicit state; actions are disabled on them. Rollback is
disabled when the worker has no previous release slot (previous_sha
null from /hosts).
2026-04-19 01:03:04 -04:00
a266d6b17e feat(web): Remote Updates API — dashboard endpoints for pushing code to workers
Adds /api/v1/swarm-updates/{hosts,push,push-self,rollback} behind
require_admin. Reuses the existing UpdaterClient + tar_working_tree + the
per-host asyncio.gather pattern from api_deploy_swarm.py; tarball is
built exactly once per /push request and fanned out to every selected
worker. /hosts filters out decommissioned hosts and agent-only
enrollments (no updater bundle = not a target).

Connection drops during /update-self are treated as success — the
updater re-execs itself mid-response, so httpx always raises.

Pydantic models live in decnet/web/db/models.py (single source of
truth). 24 tests cover happy paths, rollback, transport failures,
include_self ordering (skip on rolled-back agents), validation, and
RBAC gating.
2026-04-19 01:01:09 -04:00
f5a5fec607 feat(deploy): systemd units w/ capability-based hardening; updater restarts agent via systemctl
Add deploy/ unit files for every DECNET daemon (agent, updater, api, web,
swarmctl, listener, forwarder). All run as User=decnet with NoNewPrivileges,
ProtectSystem, PrivateTmp, LockPersonality; AmbientCapabilities=CAP_NET_ADMIN
CAP_NET_RAW only on the agent (MACVLAN/scapy). Existing api/web units migrated
to /opt/decnet layout and the same hardening stanza.

Make the updater's _spawn_agent systemd-aware: under systemd (detected via
INVOCATION_ID + systemctl on PATH), `systemctl restart decnet-agent.service`
replaces the Popen path so the new agent inherits the unit's ambient caps
instead of the updater's empty set. _stop_agent becomes a no-op in that mode
to avoid racing systemctl's own stop phase.

Tests cover the dispatcher branch selection, MainPID parsing, and the
systemd no-op stop.
2026-04-19 00:44:06 -04:00
40d3e86e55 fix(updater): bootstrap fresh venv with deps; rebuild self-update argv from env
- _run_pip: on first venv use, install decnet with its full dep tree so the
  bootstrapped environment actually has typer/fastapi/uvicorn. Subsequent
  updates keep --no-deps for a near-no-op refresh.
- run_update_self: do not reuse sys.argv to re-exec the updater. Inside the
  live process, sys.argv is the uvicorn subprocess invocation (--ssl-keyfile
  etc.), which 'decnet updater' CLI rejects. Reconstruct the operator-visible
  command from env vars set by updater.server.run.
2026-04-18 23:51:41 -04:00
ebeaf08a49 fix(updater): fall back to /proc scan when agent.pid is missing
If the agent was started outside the updater (manually, during dev,
or from a prior systemd unit), there is no agent.pid for _stop_agent
to target, so a successful code install leaves the old in-memory
agent process still serving requests. Scan /proc for any decnet agent
command and SIGTERM all matches so restart is reliable regardless of
how the agent was originally launched.
2026-04-18 23:42:26 -04:00
7765b36c50 feat(updater): remote self-update daemon with auto-rollback
Adds a separate `decnet updater` daemon on each worker that owns the
agent's release directory and installs tarball pushes from the master
over mTLS. A normal `/update` never touches the updater itself, so the
updater is always a known-good rescuer if a bad agent push breaks
/health — the rotation is reversed and the agent restarted against the
previous release. `POST /update-self` handles updater upgrades
explicitly (no auto-rollback).

- decnet/updater/: executor, FastAPI app, uvicorn launcher
- decnet/swarm/updater_client.py, tar_tree.py: master-side push
- cli: `decnet updater`, `decnet swarm update [--host|--all]
  [--include-self] [--dry-run]`, `--updater` on `swarm enroll`
- enrollment API issues a second cert (CN=updater@<host>) signed by the
  same CA; SwarmHost records updater_cert_fingerprint
- tests: executor, app, CLI, tar tree, enroll-with-updater (37 new)
- wiki: Remote-Updates page + sidebar + SWARM-Mode cross-link
2026-04-18 21:40:21 -04:00
8914c27220 feat(swarm): add decnet swarm deckies to list deployed shards by host
`swarm list` only shows enrolled workers — there was no way to see which
deckies are running and where. Adds GET /swarm/deckies on the controller
(joins DeckyShard with SwarmHost for name/address/status) plus the CLI
wrapper with --host / --state filters and --json.
2026-04-18 21:10:07 -04:00
4db9c7464c fix(swarm): relocalize master-built config on worker before deploy
deploy --mode swarm was failing on every heterogeneous fleet: the master
populates config.interface from its own box (detect_interface() → its
default NIC), then ships that verbatim. The worker's deployer then calls
get_host_ip(config.interface), hits 'ip addr show wlp6s0' on a VM whose
NIC is enp0s3, and 500s.

Fix: agent.executor._relocalize() runs on every swarm-mode deploy.
Re-detects the worker's interface/subnet/gateway/host_ip locally and
swaps them into the config before calling deployer.deploy(). When the
worker's subnet doesn't match the master's, decky IPs are re-allocated
from the worker's subnet via allocate_ips() so they're reachable.

Unihost-mode configs are left untouched — they're already built against
the local box and second-guessing them would be wrong.

Validated against anti@192.168.1.13: master dispatched interface=wlp6s0,
agent logged 'relocalized interface=enp0s3', deployer ran successfully,
dry-run returned ok=deployed.

4 new tests cover both branches (matching-subnet preserves decky IPs;
mismatch re-allocates), the end-to-end executor.deploy() path, and the
unihost short-circuit.
2026-04-18 20:41:21 -04:00
411a797120 feat(cli): add decnet swarm check wrapper for POST /swarm/check
The swarmctl API already exposes POST /swarm/check — an active mTLS
probe that refreshes SwarmHost.status + last_heartbeat for every
enrolled worker. The CLI was missing a wrapper, so operators had to
curl the endpoint directly (which is how the VM validation run did it,
and how the wiki Deployment-Modes / SWARM-Mode pages ended up doc'ing
a command that didn't exist yet).

Matches the existing list/enroll/decommission pattern: typer subcommand
under swarm_app, --url override, Rich table output plus --json for
scripting. Three tests: populated table, empty-swarm path, and --json
emission.
2026-04-18 20:28:34 -04:00
3da5a2c4ee feat(cli): add decnet listener + --agent-dir on agent
New `decnet listener` command runs the master-side RFC 5425 syslog-TLS
receiver as a standalone process (mirrors `decnet api` / `decnet swarmctl`
pattern, SIGTERM/SIGINT handlers, --daemon support).

`decnet agent` now accepts --agent-dir so operators running the worker
agent under sudo/root can point at a bundle outside /root/.decnet/agent
(the HOME under sudo propagation).

Both flags were needed to stand up the full SWARM pipeline end-to-end on
a throwaway VM: mTLS control plane reachable, syslog-over-TLS wire
confirmed via tcpdump, master-crash/resume proved with zero loss and
zero duplication across 10 forwarded lines.

pyproject: bump asyncmy floor to 0.2.11 (resolver already pulled this in).
2026-04-18 20:15:25 -04:00
bfc7af000a test(swarm): add forwarder/listener resilience scenarios
Covers failure modes the happy-path tests miss:

- log rotation (copytruncate): st_size shrinks under the forwarder, it
  resets offset=0 and reships the new contents instead of getting wedged
  past EOF;
- listener restart: forwarder retries, resumes from the persisted offset,
  and the previously-acked lines are NOT duplicated on the master;
- listener tolerates a well-authenticated client that sends a partial
  octet-count frame and drops — the server must stay up and accept
  follow-on connections;
- peer_cn / fingerprint_from_ssl degrade to 'unknown' / None when no
  peer cert is available (defensive path that otherwise rarely fires).
2026-04-18 19:56:51 -04:00
1e8ca4cc05 feat(swarm-cli): add decnet swarm {enroll,list,decommission} + deploy --mode swarm
New sub-app talks HTTP to the local swarm controller (127.0.0.1:8770 by
default; override with --url or $DECNET_SWARMCTL_URL).

- enroll: POSTs /swarm/enroll, prints fingerprint, optionally writes
  ca.crt/worker.crt/worker.key to --out-dir for scp to the worker.
- list: renders enrolled workers as a rich table (with --status filter).
- decommission: looks up uuid by --name, confirms, DELETEs.

deploy --mode swarm now:
  1. fetches enrolled+active workers from the controller,
  2. round-robin-assigns host_uuid to each decky,
  3. POSTs the sharded DecnetConfig to /swarm/deploy,
  4. renders per-worker pass/fail in a results table.

Exits non-zero if no workers exist or any worker's dispatch failed.
2026-04-18 19:52:37 -04:00
a6430cac4c feat(swarm): add decnet forwarder CLI to run syslog-over-TLS forwarder
The forwarder module existed but had no runner — closes that gap so the
worker-side process can actually be launched and runs isolated from the
agent (asyncio.run + SIGTERM/SIGINT → stop_event).

Guards: refuses to start without a worker cert bundle or a resolvable
master host ($DECNET_SWARM_MASTER_HOST or --master-host).
2026-04-18 19:41:37 -04:00
39d2077a3a feat(swarm): syslog-over-TLS log pipeline (RFC 5425, TCP 6514)
Worker-side log_forwarder tails the local RFC 5424 log file and ships
each line as an octet-counted frame to the master over mTLS. Offset is
persisted in a tiny local SQLite so master outages never cause loss or
duplication — reconnect resumes from the exact byte where the previous
session left off. Impostor workers (cert not signed by DECNET CA) are
rejected at TLS handshake.

Master-side log_listener terminates mTLS on 0.0.0.0:6514, validates the
client cert, extracts the peer CN as authoritative worker provenance,
and appends each frame to the master's ingest log files. Attacker-
controlled syslog HOSTNAME field is ignored — the CA-controlled CN is
the only source of provenance.

7 tests added covering framing codec, offset persistence across
reopens, end-to-end mTLS delivery, crash-resilience (offset survives
restart, no duplicate shipping), and impostor-CA rejection.

DECNET_SWARM_SYSLOG_PORT / DECNET_SWARM_MASTER_HOST env bindings
added.
2026-04-18 19:33:58 -04:00
e2d6f857b5 refactor(swarm): move router DTOs into decnet/web/db/models.py
_schemas.py was a local exception to the codebase convention. The rest
of the app keeps all API request/response DTOs in decnet/web/db/models.py
alongside UserResponse, DeployIniRequest, etc. — the swarm endpoints now
follow the same convention (SwarmEnrollRequest, SwarmHostView, etc).
Deletes decnet/web/router/swarm/_schemas.py.
2026-04-18 19:28:15 -04:00
811136e600 refactor(swarm): one file per endpoint, matching existing router layout
Splits the three grouped router files into eight api_<verb>_<resource>.py
modules under decnet/web/router/swarm/ to match the convention used by
router/fleet/ and router/config/. Shared request/response models live in
_schemas.py. Keeps each endpoint easy to locate and modify without
stepping on siblings.
2026-04-18 19:23:06 -04:00
63b0a58527 feat(swarm): master-side SWARM controller (swarmctl) + agent CLI
Adds decnet/web/swarm_api.py as an independent FastAPI app with routers
for host enrollment, deployment dispatch (sharding DecnetConfig across
enrolled workers via AgentClient), and active health probing. Runs as
its own uvicorn subprocess via 'decnet swarmctl', mirroring the isolation
pattern used by 'decnet api'. Also wires up 'decnet agent' CLI entry for
the worker side.

29 tests added under tests/swarm/test_swarm_api.py cover enrollment
(including bundle generation + duplicate rejection), host CRUD, sharding
correctness, non-swarm-mode rejection, teardown, and health probes with
a stubbed AgentClient.
2026-04-18 19:18:33 -04:00
cd0057c129 feat(swarm): DeckyConfig.host_uuid + fix agent log/status field refs
- decnet.models.DeckyConfig grows an optional 'host_uuid' (the SwarmHost
  that runs this decky). Defaults to None so legacy unihost state files
  deserialize unchanged.
- decnet.agent.executor: replace non-existent config.name references
  with config.mode / config.interface in logs and status payload.
- tests/swarm/test_state_schema.py covers legacy-dict roundtrip, field
  default, and swarm-mode assignments.
2026-04-18 19:10:25 -04:00
0c77cdab32 feat(swarm): master AgentClient — mTLS httpx wrapper around worker API
decnet.swarm.client exposes:
- MasterIdentity / ensure_master_identity(): the master's own CA-signed
  client bundle, issued once into ~/.decnet/ca/master/.
- AgentClient: async-context httpx wrapper that talks to a worker agent
  over mTLS. health/status/deploy/teardown methods mirror the agent API.

SSL context is built from a bare ssl.SSLContext(PROTOCOL_TLS_CLIENT)
instead of httpx.create_ssl_context — the latter layers on default-CA
and purpose logic that broke private-CA mTLS. Server cert is pinned by
CA + chain, not DNS (workers enroll with arbitrary SANs).

tests/swarm/test_client_agent_roundtrip.py spins uvicorn in-process
with real certs on disk and verifies:
- A CA-signed master client passes health + status calls.
- An impostor whose cert comes from a different CA cannot connect.
2026-04-18 19:08:36 -04:00
8257bcc031 feat(swarm): worker agent + fix pre-existing base_repo coverage test
Worker agent (decnet.agent):
- mTLS FastAPI service exposing /deploy, /teardown, /status, /health,
  /mutate. uvicorn enforces CERT_REQUIRED with the DECNET CA pinned.
- executor.py offloads the blocking deployer onto asyncio.to_thread so
  the event loop stays responsive.
- server.py refuses to start without an enrolled bundle in
  ~/.decnet/agent/ — unauthenticated agents are not a supported mode.
- docs/openapi disabled on the agent — narrow attack surface.

tests/test_base_repo.py: DummyRepo was missing get_attacker_artifacts
(pre-existing abstractmethod) and so could not be instantiated. Added
the stub + coverage for the new swarm CRUD surface on BaseRepository.
2026-04-18 07:15:53 -04:00
d3b90679c5 feat(swarm): PKI module — self-managed CA for master/worker mTLS
decnet.swarm.pki provides:
- generate_ca() / ensure_ca() — self-signed root, PKCS8 PEM, 4096-bit.
- issue_worker_cert() — per-worker keypair + cert signed by the CA with
  serverAuth + clientAuth EKU so the same identity backs the agent's
  HTTPS endpoint AND the syslog-over-TLS upstream.
- write_worker_bundle() / load_worker_bundle() — persist with 0600 on
  private keys.
- fingerprint() — SHA-256 DER hex for master-side pinning.

tests/swarm/test_pki.py covers:
- CA idempotency on disk.
- Signed chain validates against CA subject.
- SAN population (DNS + IP).
- Bundle roundtrip with 0600 key perms.
- End-to-end mTLS handshake between two CA-issued peers.
- Cross-CA client rejection (handshake fails).
2026-04-18 07:09:58 -04:00
6657d3e097 feat(swarm): add SwarmHost and DeckyShard tables + repo CRUD
Introduces the master-side persistence layer for swarm mode:
- SwarmHost: enrolled worker metadata, cert fingerprint, heartbeat.
- DeckyShard: per-decky host assignment, state, last error.
Repo methods are added as default-raising on BaseRepository so unihost
deployments are untouched; SQLModelRepository implements them (shared
between the sqlite and mysql subclasses per the existing pattern).
2026-04-18 07:09:29 -04:00
293da364a6 chores: fix linting 2026-04-18 06:46:10 -04:00
d5e6ca1949 chore(gitignore): ignore sqlite WAL files and per-subsystem runtime logs
decnet.collector.log / decnet.system.log and the *.db-shm / *.db-wal
sidecars produced by the sqlite WAL journal were slipping through the
existing rules. Extend the patterns so runtime state doesn't show up
in git status.
2026-04-18 05:38:09 -04:00
a97696fa23 chore: add env.config.example documenting DECNET env vars
Reference template for .env / .env.local showing every variable that
decnet/env.py consumes, with short rationale per section (system
logging, embedded workers, profiling, API server, …). Copy to .env
and fill in secrets; .env itself stays gitignored.
2026-04-18 05:37:50 -04:00
7864c72948 test(ssh): add unit coverage for emit_capture RFC 5424 output
Exercises the JSON → syslog formatter end to end: flat fields ride as
SD params, bulky nested metadata collapses into the meta_json_b64 blob,
and the event_type / hostname / service mapping lands in the right
RFC 5424 header slots.
2026-04-18 05:37:42 -04:00
47a0480994 feat(web-ui): event-body parser and dashboard/live-logs polish
Frontend now handles syslog lines from producers that don't use
structured-data (notably the SSH PROMPT_COMMAND hook, which emits
'CMD uid=0 user=root src=IP pwd=… cmd=…' as a plain logger message).
A new parseEventBody utility splits the body into head + key/value
pairs and preserves the final value verbatim so commands stay intact.

Dashboard and LiveLogs use this parser to render consistent pills
whether the structure came from SD params or from the MSG body.
2026-04-18 05:37:31 -04:00
2bf886e18e feat(sniffer): probe ipvlan host iface when macvlan is absent
The host-side sniffer interface depends on the deploy's driver choice
(--ipvlan flag). Instead of hardcoding HOST_MACVLAN_IFACE, probe both
names and pick whichever exists; warn and disable cleanly if neither
is present. Explicit DECNET_SNIFFER_IFACE still wins.
2026-04-18 05:37:20 -04:00
8bdc5b98c9 feat(collector): parse real PROCID and extract IPs from logger kv pairs
- Relaxed RFC 5424 regex to accept either NILVALUE or a numeric PROCID;
  sshd / sudo go through rsyslog with their real PID, while
  syslog_bridge emitters keep using '-'.
- Added a fallback pass that scans the MSG body for IP-shaped
  key=value tokens. This rescues attacker attribution for plain logger
  callers like the SSH PROMPT_COMMAND shim, which emits
  'CMD … src=IP …' without SD-element params.
2026-04-18 05:37:08 -04:00
aa39be909a feat(templates): ship syslog_bridge.py to every service template
Each honeypot container now carries its own copy of the shared RFC 5424
formatter. Services that previously rolled their own ad-hoc syslog
lines can now import syslog_line / write_syslog_file for a consistent
SD-element format that the collector already knows how to parse.
2026-04-18 05:36:56 -04:00
41fd496128 feat(web): attacker artifacts endpoint + UI drawer
Adds the server-side wiring and frontend UI to surface files captured
by the SSH honeypot for a given attacker.

- New repository method get_attacker_artifacts (abstract + SQLModel
  impl) that joins the attacker's IP to `file_captured` log rows.
- New route GET /attackers/{uuid}/artifacts.
- New router /artifacts/{decky}/{service}/{stored_as} that streams a
  quarantined file back to an authenticated viewer.
- AttackerDetail grows an ArtifactDrawer panel with per-file metadata
  (sha256, size, orig_path) and a download action.
- ssh service fragment now sets NODE_NAME=decky_name so logs and the
  host-side artifacts bind-mount share the same decky identifier.
2026-04-18 05:36:48 -04:00
39dafaf384 feat(ssh-stealth): hide capture artifacts via XOR+gzip entrypoint blob
The /opt/emit_capture.py, /opt/syslog_bridge.py, and
/usr/libexec/udev/journal-relay files were plaintext and world-readable
to any attacker root-shelled into the SSH honeypot — revealing the full
capture logic on a single cat.

Pack all three into /entrypoint.sh as XOR+gzip+base64 blobs at build
time (_build_stealth.py), then decode in-memory at container start and
exec the capture loop from a bash -c string. No .py files under /opt,
no journal-relay file under /usr/libexec/udev, no argv_zap name
anywhere. The LD_PRELOAD shim is installed as
/usr/lib/x86_64-linux-gnu/libudev-shared.so.1 — sits next to the real
libudev.so.1 and blends into the multiarch layout.

A 1-byte random XOR key is chosen at image build so a bare
'base64 -d | gunzip' probe on the visible entrypoint returns binary
noise instead of readable Python.

Docker-dependent tests live under tests/docker/ behind a new 'docker'
pytest marker (excluded from the default run, same pattern as fuzz /
live / bench).
2026-04-18 05:34:50 -04:00
b0e00a6cc4 fix(ssh-capture): drop relay FIFO, rsyslog→/proc/1/fd/1 direct
The named pipe at /run/systemd/journal/syslog-relay had two problems
beyond its argv leak: any root-in-container process could (a) `cat`
the pipe and watch the live SIEM feed, and (b) write to it and inject
forged log lines. Since an attacker with a shell is already root
inside the honeypot, file permissions can't fix it.

Point rsyslog's auth/user actions directly at /proc/1/fd/1 — the
container-stdout fd Docker attached to PID 1 — and delete the
mkfifo + cat relay from the entrypoint. No pipe on disk, nothing to
read, nothing to inject, and one fewer cloaked process in `ps`.
2026-04-18 02:12:32 -04:00
2843aafa1a fix(ssh-capture): hide watcher bash argv and sanitize script header
Two leaks remained after the inotifywait argv fix:

1. The bash running journal-relay showed its argv[1] (the script path)
   in /proc/PID/cmdline, producing a line like
     'journal-relay /usr/libexec/udev/journal-relay'
   Apply argv_zap.so to that bash too.

2. argv_zap previously hardcoded PR_SET_NAME to 'kmsg-watch', which was
   wrong for any caller other than inotifywait. The comm name now comes
   from ARGV_ZAP_COMM so each caller can pick its own (kmsg-watch for
   inotifywait, journal-relay for the watcher bash).

3. The capture.sh header started with 'SSH honeypot file-catcher' —
   fatal if an attacker runs 'cat' on it. Rewritten as a plausible
   systemd-journal relay helper; stray 'attacker' / 'honeypot' words
   in mid-script comments stripped too.
2026-04-18 02:06:36 -04:00
766eeb3d83 feat(ssh): add ping/nmap/ca-certificates to base image
A lived-in Linux box ships with iputils-ping, ca-certificates, and nmap
available. Their absence is a cheap tell, and they're handy for letting
the attacker move laterally in ways we want to observe. iproute2 (ip a)
was already installed for attribution — noted here for completeness.
2026-04-18 01:53:33 -04:00
f462835373 feat(ssh-capture): LD_PRELOAD shim to zero inotifywait argv
The kmsg-watch (inotifywait) process was the last honest giveaway in
`ps aux` — its watch paths and event flags betrayed the honeypot. The
argv_zap.so shim hooks __libc_start_main, heap-copies argv for the real
main, then memsets the contiguous argv[1..] region to NUL so the kernel's
cmdline reader returns just argv[0].

gcc is installed and purged in the same Docker layer to keep the image
slim. The shim also calls prctl(PR_SET_NAME) so /proc/self/comm mirrors
the argv[0] disguise.
2026-04-18 01:52:30 -04:00
e356829234 fix(ssh-capture): drop bash token from journal-relay ps line
exec -a replaces argv[0] so ps shows 'journal-relay /usr/libexec/udev/journal-relay'
instead of '/bin/bash /usr/libexec/udev/journal-relay' — no interpreter
hint on the watcher process.
2026-04-18 01:45:38 -04:00
a5d6860124 fix(ssh-capture): collapse duplicate journal-relay bash in ps
inotify | while spawns a subshell for the tail of the pipeline, so
two bash processes (the script itself and the while-loop subshell)
showed up under /usr/libexec/udev/journal-relay in ps aux. Enable
lastpipe so the while loop runs in the main shell — ps now shows
one bash plus the inotify child, matching a simple udev helper.
2026-04-17 23:04:33 -04:00
8dd4c78b33 refactor: strip DECNET tokens from container-visible surface
Rename the container-side logging module decnet_logging → syslog_bridge
(canonical at templates/syslog_bridge.py, synced into each template by
the deployer). Drop the stale per-template copies; setuptools find was
picking them up anyway. Swap useradd/USER/chown "decnet" for "logrelay"
so no obvious token appears in the rendered container image.

Apply the same cloaking pattern to the telnet template that SSH got:
syslog pipe moves to /run/systemd/journal/syslog-relay and the relay
is cat'd via exec -a "systemd-journal-fwd". rsyslog.d conf rename
99-decnet.conf → 50-journal-forward.conf. SSH capture script:
/var/decnet/captured → /var/lib/systemd/coredump (real systemd path),
logger tag decnet-capture → systemd-journal. Compose volume updated
to match the new in-container quarantine path.

SD element ID shifts decnet@55555 → relay@55555; synced across
collector, parser, sniffer, prober, formatter, tests, and docs so the
host-side pipeline still matches what containers emit.
2026-04-17 22:57:53 -04:00
69510fb880 fix(ssh-capture): cloak syslog relay pipe and cat process
Rename the rsyslog→stdout pipe from /var/run/decnet-logs (dead giveaway)
to /run/systemd/journal/syslog-relay, and launch the relay via
exec -a "systemd-journal-fwd" so ps shows a plausible systemd forwarder
instead of a bare cat. Casual ps/ls inspection now shows nothing
with "decnet" in the name.
2026-04-17 22:51:34 -04:00
09d9f8595e fix(ssh-capture): disguise watcher as udev helper in ps output
Old ps output was a dead giveaway: two "decnet-capture" bash procs
and a raw "inotifywait". Install script at /usr/libexec/udev/journal-relay
and invoke inotifywait through a /usr/libexec/udev/kmsg-watch symlink so
both now render as plausible udev/journal helpers under casual inspection.
2026-04-17 22:44:47 -04:00
bfb3edbd4a fix(ssh-capture): add ss-only attribution fallback
fuser and /proc fd walks race scp/wget/sftp — by close_write the writer
has already closed the fd, so pid-chain attribution always resolved to
unknown for non-interactive drops. Fall back to the ss snapshot: one
established session → ss-only, multiple → ss-ambiguous (still record
src_ip from the first, analysts cross-check concurrent_sessions).
2026-04-17 22:36:06 -04:00
a773dddd5c feat(ssh): capture attacker-dropped files with session attribution
inotifywait watches writable paths in the SSH decky and mirrors any
file close_write/moved_to into a per-decky host-mounted quarantine dir.
Each artifact carries a .meta.json with attacker attribution resolved
by walking the writer PID's PPid chain to the sshd session leader,
then cross-referencing ss and utmp for source IP/user/login time.
Also emits an RFC 5424 syslog line per capture for SIEM correlation.
2026-04-17 22:20:05 -04:00
edc5c59f93 docs(profiles): archive locust run artifacts under development/profiles
Commit-by-commit evidence of the perf work: each CSV is the raw
Locust output for the commit hash in its filename, plus the four
fb69a06 variants (single worker, tracing on/off, single-core pinned,
12 workers) referenced in the README baseline table.
2026-04-17 22:05:35 -04:00
1f758a3669 chore(profile): tolerate null/empty frames in walk_self_time
Some pyinstrument frame trees contain branches where an identifier is
missing (typically at the very top or with certain async boundaries),
which crashed the aggregator with a KeyError mid-run. Short-circuit
on None frames and missing identifiers so a single ugly HTML no
longer kills the summary of the other few hundred.
2026-04-17 22:04:29 -04:00
6c22f9ba59 fix(deps): add cryptography for asyncmy MySQL auth
asyncmy needs cryptography for caching_sha2_password (the MySQL 8
default auth plugin). Without it, connection handshake fails the
moment the server negotiates the modern plugin.
2026-04-17 22:04:24 -04:00
20fa1f9a63 docs: record single-worker / multi-worker perf baseline
Capture Locust numbers from the fb69a06 branch across five
configurations so future regressions have something to measure against.

- 500u tracing-on single-worker: ~960 RPS / p99 2.9 s
- 1500u tracing-on single-worker: ~880 RPS / p99 9.5 s
- 1500u tracing-off single-worker: ~990 RPS / p99 8.4 s
- 1500u tracing-off pinned to one core: ~46 RPS / p99 122 s
- 1500u tracing-off 12 workers: ~1585 RPS / p99 4.2 s

Also note MySQL max_connections math (pool_size * max_overflow *
workers = 720) to explain why the default 151 needs bumping, and the
Python 3.14 GC segfault so nobody repeats that mistake.
2026-04-17 22:03:50 -04:00
fb69a06ab3 fix(db): detach session cleanup onto fresh task on cancellation
Previous attempt (shield + sync invalidate fallback) didn't work
because shield only protects against cancellation from *other* tasks.
When the caller task itself is cancelled mid-query, its next await
re-raises CancelledError as soon as the shielded coroutine yields —
rollback inside session.close() never completes, the aiomysql
connection is orphaned, and the pool logs 'non-checked-in connection'
when GC finally reaches it.

Hand exception-path cleanup to loop.create_task() so the new task
isn't subject to the caller's pending cancellation. close() (and the
invalidate() fallback for a dead connection) runs to completion.
Success path is unchanged — still awaits close() inline so callers
see commit visibility and pool release before proceeding.
2026-04-17 21:13:43 -04:00
1446f6da94 fix(db): invalidate pool connection when cancelled close fails
Under high-concurrency MySQL load, uvicorn cancels request tasks when
clients disconnect.  If cancellation lands mid-query, session.close()
tries to ROLLBACK on a connection that aiomysql has already marked as
closed — raising InterfaceError("Cancelled during execution") and
leaving the connection checked-out until GC, which the pool then
warns about as a 'non-checked-in connection'.

The old fallback tried sync.rollback() + sync.close(), but those still
go through the async driver and fail the same way on a dead connection.
Replace them with session.sync_session.invalidate(), which just flips
the pool's internal record — no I/O, so it can't be cancelled — and
tells the pool to drop the connection immediately instead of waiting
for garbage collection.
2026-04-17 21:04:04 -04:00
e967aaabfb perf: cache get_user_by_username on the login hot path
Locust @task(2) hammers /auth/login in steady state on top of the
on_start burst. After caching the uuid-keyed user lookup and every
other read endpoint, login alone accounted for 47% of total
_execute at 500c/u — pure DB queueing on SELECT users WHERE
username=?.

5s TTL, positive hits only (misses bypass so a freshly-created
user can log in immediately). Password verify still runs against
the cached hash, so security is unchanged — the only staleness
window is: a changed password accepts the old password for up to
5s until invalidate_user_cache fires (it's called on every write).
2026-04-17 20:36:39 -04:00
255c2e5eb7 perf: cache auth user-lookup and admin list_users
The per-request SELECT users WHERE uuid=? in require_role was the
hidden tax behind every authed endpoint — it kept _execute at ~60%
across the profile even after the page caches landed. Even /health
(with its DB and Docker probes cached) was still 52% _execute from
this one query.

- dependencies.py: 10s TTL cache on get_user_by_uuid, well below JWT
  expiry. invalidate_user_cache(uuid) is called on password change,
  role change, and user delete.
- api_get_config.py: 5s TTL cache on the admin branch's list_users()
  (previously fetched every /config call). Invalidated on user
  create/update/delete.
- api_change_pass.py + api_manage_users.py: invalidation hooks on
  all user-mutating endpoints.
2026-04-17 19:56:39 -04:00
2dd86fb3bb perf: cache /bounty, /logs/histogram, /deckies; bump /config TTL to 5s
Round-2 follow-up: profile at 500c/u showed _execute still dominating
the uncached read endpoints (/bounty 76%, /logs/histogram 73%,
/deckies 56%). Same router-level TTL pattern as /stats — 5s window,
asyncio.Lock to collapse concurrent calls into one DB hit.

- /bounty: cache default unfiltered page (limit=50, offset=0,
  bounty_type=None, search=None). Filtered requests bypass.
- /logs/histogram: cache default (interval_minutes=15, no filters).
  Filtered / non-default interval requests bypass.
- /deckies: cache full response (endpoint takes no params).
- /config: bump _STATE_TTL from 1.0 to 5.0 — admin writes are rare,
  1s was too short for bursts to coalesce at high concurrency.
2026-04-17 19:30:11 -04:00
3106d03135 perf(db): default pool_pre_ping=false for SQLite
SQLite is a local file — a SELECT 1 per session checkout is pure
overhead. Env var DECNET_DB_POOL_PRE_PING stays for anyone running
on a network-mounted volume. MySQL backend keeps its current default.
2026-04-17 19:11:07 -04:00
3cc5ba36e8 fix(cli): keep FileNotFoundError handling on decnet api
Popen moved inside the try so a missing uvicorn falls through to the
existing error message instead of crashing the CLI. test_cli was still
patching the old subprocess.run entrypoint; switched both api command
tests to patch subprocess.Popen / os.killpg to match the current path.
2026-04-17 19:09:15 -04:00
6301504c0e perf(api): TTL-cache /stats + unfiltered pagination counts
Every /stats call ran SELECT count(*) FROM logs + SELECT count(DISTINCT
attacker_ip) FROM logs; every /logs and /attackers call ran an
unfiltered count for the paginator. At 500 concurrent users these
serialize through aiosqlite's worker threads and dominate wall time.

Cache at the router layer (repo stays dialect-agnostic):
  - /stats response: 5s TTL
  - /logs total (only when no filters): 2s TTL
  - /attackers total (only when no filters): 2s TTL

Filtered paths bypass the cache. Pattern reused from api_get_config
and api_get_health (asyncio.Lock + time.monotonic window + lazy lock).
2026-04-17 19:09:15 -04:00
de4b64d857 perf(auth): avoid duplicate user lookup in require_role
require_role._check previously chained from get_current_user, which
already loaded the user — then looked it up again. Inline the decode +
single user fetch + must_change_password + role check so every
authenticated request costs one SELECT users WHERE uuid=? instead of
two.
2026-04-17 17:48:42 -04:00
b5d7bf818f feat(health): 3-tier status (healthy / degraded / unhealthy)
Only database, docker, and ingestion_worker now count as critical
(→ 503 unhealthy). attacker/sniffer/collector failures drop overall
status to degraded (still 200) so the dashboard doesn't panic when a
non-essential worker isn't running.
2026-04-17 17:48:42 -04:00
257f780d0f docs(bugs): document SSE /api/v1/stream BrokenPipe storm (BUG-003) 2026-04-17 17:48:42 -04:00
a10aee282f perf(ingester): batch log writes into bulk commits
The ingester now accumulates up to DECNET_BATCH_SIZE rows (default 100)
or DECNET_BATCH_MAX_WAIT_MS (default 250ms) before flushing through
repo.add_logs — one transaction, one COMMIT per batch instead of per
row. Under attacker traffic this collapses N commits into ⌈N/100⌉ and
takes most of the SQLite writer-lock contention off the hot path.

Flush semantics are cancel-safe: _position only advances after a batch
commits successfully, and the flush helper bails without touching the
DB if the enclosing task is being cancelled (lifespan teardown).
Un-flushed lines stay in the file and are re-read on next startup.

Tests updated to assert on add_logs (bulk) instead of the per-row
add_log that the ingester no longer uses, plus a new test that 250
lines flush in ≤5 calls.
2026-04-17 16:37:34 -04:00
11b9e85874 feat(db): bulk add_logs for one-commit ingestion batches
Adds BaseRepository.add_logs (default: loops add_log for backwards
compatibility) and a real single-session/single-commit implementation
on SQLModelRepository. Introduces DECNET_BATCH_SIZE (default 100) and
DECNET_BATCH_MAX_WAIT_MS (default 250) so the ingester can flush on
either a size or a time bound when it adopts the new method.

Ingester wiring is deferred to a later pass — the single-log path was
deadlocking tests when flushed during lifespan teardown, so this change
ships the DB primitive alone.
2026-04-17 16:23:09 -04:00
45039bd621 fix(cache): lazy-init TTL cache locks to survive event-loop turnover
A module-level asyncio.Lock binds to the loop it was first awaited on.
Under pytest-anyio (and xdist) each test spins up a new loop; any later
test that hit /health or /config would wait on a lock owned by a dead
loop and the whole worker would hang.

Create the lock on first use and drop it in the test-reset helpers so a
fresh loop always gets a fresh lock.
2026-04-17 16:23:00 -04:00
4ea1c2ff4f fix(health): move Docker client+ping off the event loop
Under CPU saturation the sync docker.from_env()/ping() calls could miss
their socket timeout, cache _docker_healthy=False, and return 503 for
the full 5s TTL window. Both calls now run on a thread so the event
loop keeps serving other requests while Docker is being probed.
2026-04-17 15:43:51 -04:00
bb8d782e42 fix(cli): kill uvicorn worker tree on Ctrl+C
With --workers > 1, SIGINT from the terminal raced uvicorn's supervisor:
some workers got signaled directly, the supervisor respawned them, and
the result behaved like a forkbomb. Start uvicorn in its own session and
signal the whole process group (SIGTERM → 10s grace → SIGKILL) when we
catch KeyboardInterrupt.
2026-04-17 15:32:08 -04:00
342916ca63 feat(cli): expose --workers on decnet api
Forwards straight to uvicorn's --workers. Default stays at 1 so the
single-worker efficiency direction is preserved; raising it is available
for threat-actor load scenarios where the honeypot needs to soak real
attack traffic without queueing on one event loop.
2026-04-17 15:22:45 -04:00
d3f4bbb62b perf(locust): skip change-password in on_start when not required
Previously every user did login → change-pass → re-login in on_start
regardless of whether the server actually required a password change.
With bcrypt at ~250ms/call that's 3 bcrypt-bound requests per user.
At 2500 users the on_start queue was ~10k bcrypt ops — users never
escaped warmup, so @task endpoints never fired.

Login already returns must_change_password; only run the change-pass
+ re-login dance when the server says we have to. Cuts on_start from
3 requests to 1 for every user after the first DB initialization.
2026-04-17 15:15:59 -04:00
32340bea0d perf: migrate hot-path JSON serialization to orjson
stdlib json was FastAPI's default. Every response body, every SSE frame,
and every add_log/state/payload write paid the stdlib encode cost.

- pyproject.toml: add orjson>=3.10 as a core dep.
- decnet/web/api.py: default_response_class=ORJSONResponse on the
  FastAPI app, so every endpoint return goes through orjson without
  touching call sites. Explicit JSONResponse sites in the validation
  exception handlers migrated to ORJSONResponse for consistency.
- health endpoint's explicit JSONResponse → ORJSONResponse.
- SSE stream (api_stream_events.py): 6 json.dumps call sites →
  orjson.dumps(...).decode() — the per-event frames that fire on every
  sse tick.
- sqlmodel_repo.py: encode sites on the log-insert path switched to
  orjson (fields, payload, state value). Parser sites (json.loads)
  left as-is for now — not on the measured hot path.
2026-04-17 15:07:28 -04:00
f1e14280c0 perf: 1s TTL cache for /health DB probe and /config state reads
Locust hit /health and /config on every @task(3), so each request was
firing repo.get_total_logs() and two repo.get_state() calls against
aiosqlite — filling the driver queue for data that changes on the order
of seconds, not milliseconds.

Both caches follow the shape already used by the existing Docker cache:
- asyncio.Lock with double-checked TTL so concurrent callers collapse
  into one DB hit per 1s window.
- _reset_* helpers called from tests/api/conftest.py::setup_db so the
  module-level cache can't leak across tests.

tests/test_health_config_cache.py asserts 50 concurrent callers
produce exactly 1 repo call, and the cache expires after TTL.
2026-04-17 15:05:18 -04:00
931f33fb06 perf: cache Docker daemon ping in /health (5s TTL)
Creating a new docker.from_env() client per /health request opened a
fresh unix-socket connection each time. Under load that's wasteful and
hammers dockerd.

Keep a module-level client + last-check timestamp; actually ping every
5 seconds, return cached state in between. Reset helper provided for
tests.
2026-04-17 15:01:53 -04:00
467511e997 db: switch MySQL driver to asyncmy, env-tune pool, serialize DDL
- aiomysql → asyncmy on both sides of the URL/import (faster, maintained).
- Pool sizing now reads DECNET_DB_POOL_SIZE / MAX_OVERFLOW / RECYCLE /
  PRE_PING for both SQLite and MySQL engines so stress runs can bump
  without code edits.
- MySQL initialize() now wraps schema DDL in a GET_LOCK advisory lock so
  concurrent uvicorn workers racing create_all() don't hit 'Table was
  skipped since its definition is being modified by concurrent DDL'.
- sqlite & mysql repo get_log_histogram use the shared _session() helper
  instead of session_factory() for consistency with the rest of the repo.
- SSE stream_events docstring updated to asyncmy.
2026-04-17 15:01:49 -04:00
3945e72e11 perf: run bcrypt on a thread so it doesn't block the event loop
verify_password / get_password_hash are CPU-bound and take ~250ms each
at rounds=12. Called directly from async endpoints, they stall every
other coroutine for that window — the single biggest single-worker
bottleneck on the login path.

Adds averify_password / ahash_password that wrap the sync versions in
asyncio.to_thread. Sync versions stay put because _ensure_admin_user and
tests still use them.

5 call sites updated: login, change-password, create-user, reset-password.
tests/test_auth_async.py asserts parallel averify runs concurrently (~1x
of a single verify, not 2x).
2026-04-17 14:52:22 -04:00
bd406090a7 fix: re-seed admin password when still unfinalized (must_change_password=True)
_ensure_admin_user was strict insert-if-missing: once a stale hash landed
in decnet.db (e.g. from a deploy that used a different DECNET_ADMIN_PASSWORD),
login silently 401'd because changing the env var later had no effect.

Now on startup: if the admin still has must_change_password=True (they
never finalized their own password), re-sync the hash from the current
env var. Once the admin sets a real password, we leave it alone.

Found via locustfile.py login storm — see tests/test_admin_seed.py.

Note: this commit also bundles uncommitted pool-management work already
present in sqlmodel_repo.py from prior sessions.
2026-04-17 14:49:13 -04:00
e22d057e68 added: scripts/profile/aggregate_requests.py — roll up pyinstrument request profiles
Parses every HTML in profiles/, reattributes [self]/[await] synthetic
leaves to their parent function, and reports per-endpoint wall-time
(mean/p50/p95/max) plus top hot functions by cumulative self-time.

Makes post-locust profile dirs actually readable — otherwise they're
just a pile of hundred-plus HTML files.
2026-04-17 14:48:59 -04:00
cb12e7c475 fix: logging handler must not crash its caller on reopen failure
When decnet.system.log is root-owned (e.g. created by a pre-fix 'sudo
decnet deploy') and a subsequent non-root process tries to log, the
InodeAwareRotatingFileHandler raised PermissionError out of emit(),
which propagated up through logger.debug/info and killed the collector's
log stream loop ('log stream ended ... reason=[Errno 13]').

Now matches stdlib behaviour: wrap _open() in try/except OSError and
defer to handleError() on failure. Adds a regression test.

Also: scripts/profile/view.sh 'pyinstrument' keyword was matching
memray-flamegraph-*.html files. Exclude the memray-* prefix.
2026-04-17 14:01:36 -04:00
c29ca977fd added: scripts/profile/classify_usage.py — classify memray usage_over_time.csv
Reads the memray usage CSV and emits a verdict based on tail-drop-from-
peak: CLIMB-AND-DROP, MOSTLY-RELEASED, or SUSTAINED-AT-PEAK. Deliberately
ignores net-growth-vs-baseline since any active workload grows vs. a cold
interpreter — that metric is misleading as a leak signal.
2026-04-17 13:54:37 -04:00
bf4afac70f fix: RotatingFileHandler reopens on external deletion/rotation
Mirrors the inode-check fix from 935a9a5 (collector worker) for the
stdlib-handler-based log paths. Both decnet.system.log (config.py) and
decnet.log (logging/file_handler.py) now use a subclass that stats the
target path before each emit and reopens on inode/device mismatch —
matching the behavior of stdlib WatchedFileHandler while preserving
size-based rotation.

Previously: rm decnet.system.log → handler kept writing to the orphaned
inode until maxBytes triggered; all lines between were lost.
2026-04-17 13:42:15 -04:00
4b15b7eb35 fix: chown log files to sudo-invoking user so non-root API can append
'sudo decnet deploy' needs root for MACVLAN, but the log files it creates
(decnet.log and decnet.system.log) end up owned by root. A subsequent
non-root 'decnet api' then crashes on PermissionError appending to them.

New decnet.privdrop helper reads SUDO_UID/SUDO_GID and chowns files/dirs
back to the invoking user. Best-effort: no-op when not root, not under
sudo, path missing, or chown fails. Applied at both log-file creation
sites (config.py system log, logging/file_handler.py syslog file).
2026-04-17 13:39:09 -04:00
140d2fbaad fix: gate embedded sniffer behind DECNET_EMBED_SNIFFER (default off)
The API's lifespan unconditionally spawned a MACVLAN sniffer task, which
duplicated the standalone 'decnet sniffer --daemon' process that
'decnet deploy' always starts — causing two workers to sniff the same
interface, double events, and wasted CPU.

Mirror the existing DECNET_EMBED_PROFILER pattern: sniffer is OFF by
default, opt in explicitly. Static regression tests guard against
accidental removal of the gate.
2026-04-17 13:35:43 -04:00
064c8760b6 fix: memray run needs --trace-python-allocators for frame attribution
Without it, 'Total number of frames seen: 0' in memray stats and flamegraphs
render empty / C-only. Also added --follow-fork so uvicorn workers spawned
as child processes are tracked.
2026-04-17 13:24:55 -04:00
6572c5cbaf added: scripts/profile/view.sh — auto-pick newest artifact and open viewer
Dispatches by extension: .prof -> snakeviz, memray .bin -> memray flamegraph
(overridable via VIEW=table|tree|stats|summary|leaks), .svg/.html -> xdg-open.
Positional arg can be a file path or a type keyword (cprofile, memray, pyspy,
pyinstrument).
2026-04-17 13:20:05 -04:00
ba448bae13 docs: py-spy 0.4.1 lacks Python 3.14 support; wrapper aborts early
Root cause of 'No python processes found in process <pid>': py-spy needs
per-release ABI knowledge and 0.4.1 (latest PyPI) predates 3.14. Wrapper
now detects the interpreter and points users at pyinstrument/memray/cProfile.
2026-04-17 13:17:23 -04:00
1a18377b0a fix: mysql url builder tests expect asyncmy, not aiomysql
The builder in decnet/web/db/mysql/database.py emits 'mysql+asyncmy://' URLs
(asyncmy is the declared dep in pyproject.toml). Tests were stale from a
prior aiomysql era.
2026-04-17 13:13:36 -04:00
319c1dbb61 added: profiling toolchain (py-spy, pyinstrument, pytest-benchmark, memray, snakeviz)
New `profile` optional-deps group, opt-in Pyinstrument ASGI middleware
gated by DECNET_PROFILE_REQUESTS, bench marker + tests/perf/ micro-benchmarks
for repository hot paths, and scripts/profile/ helpers for py-spy/cProfile/memray.
2026-04-17 13:13:00 -04:00
c1d8102253 modified: DEVELOPMENT roadmap. one step closer to v1 2026-04-16 11:39:07 -04:00
443 changed files with 27336 additions and 4286 deletions

View File

@@ -28,7 +28,7 @@ jobs:
with: with:
python-version: "3.11" python-version: "3.11"
- run: pip install bandit - run: pip install bandit
- run: bandit -r decnet/ -ll -x decnet/services/registry.py - run: bandit -r decnet/ -ll -x decnet/services/registry.py -x decnet/templates/
pip-audit: pip-audit:
name: Dependency audit (pip-audit) name: Dependency audit (pip-audit)
@@ -40,7 +40,7 @@ jobs:
python-version: "3.11" python-version: "3.11"
- run: pip install pip-audit - run: pip install pip-audit
- run: pip install -e .[dev] - run: pip install -e .[dev]
- run: pip-audit --skip-editable - run: pip-audit --skip-editable --ignore-vuln CVE-2025-65896
test-standard: test-standard:
name: Test (Standard) name: Test (Standard)
@@ -48,7 +48,7 @@ jobs:
needs: [lint, bandit, pip-audit] needs: [lint, bandit, pip-audit]
strategy: strategy:
matrix: matrix:
python-version: ["3.11", "3.12"] python-version: ["3.11"]
steps: steps:
- uses: actions/checkout@v4 - uses: actions/checkout@v4
- uses: actions/setup-python@v5 - uses: actions/setup-python@v5
@@ -61,6 +61,19 @@ jobs:
name: Test (Live) name: Test (Live)
runs-on: ubuntu-latest runs-on: ubuntu-latest
needs: [test-standard] needs: [test-standard]
services:
mysql:
image: mysql:8.0
env:
MYSQL_ROOT_PASSWORD: root
MYSQL_DATABASE: decnet_test
ports:
- 3307:3306
options: >-
--health-cmd="mysqladmin ping -h 127.0.0.1"
--health-interval=10s
--health-timeout=5s
--health-retries=5
strategy: strategy:
matrix: matrix:
python-version: ["3.11"] python-version: ["3.11"]
@@ -71,6 +84,12 @@ jobs:
python-version: ${{ matrix.python-version }} python-version: ${{ matrix.python-version }}
- run: pip install -e .[dev] - run: pip install -e .[dev]
- run: pytest -m live - run: pytest -m live
env:
DECNET_MYSQL_HOST: 127.0.0.1
DECNET_MYSQL_PORT: 3307
DECNET_MYSQL_USER: root
DECNET_MYSQL_PASSWORD: root
DECNET_MYSQL_DATABASE: decnet_test
test-fuzz: test-fuzz:
name: Test (Fuzz) name: Test (Fuzz)
@@ -86,6 +105,8 @@ jobs:
python-version: ${{ matrix.python-version }} python-version: ${{ matrix.python-version }}
- run: pip install -e .[dev] - run: pip install -e .[dev]
- run: pytest -m fuzz - run: pytest -m fuzz
env:
SCHEMATHESIS_CONFIG: schemathesis.ci.toml
merge-to-testing: merge-to-testing:
name: Merge dev → testing name: Merge dev → testing

8
.gitignore vendored
View File

@@ -1,6 +1,7 @@
.venv/ .venv/
logs/ logs/
.claude/ .claude/*
CLAUDE.md
__pycache__/ __pycache__/
*.pyc *.pyc
*.pyo *.pyo
@@ -17,8 +18,13 @@ linterfails.log
webmail webmail
windows1 windows1
*.db *.db
*.db-shm
*.db-wal
decnet.*.log
decnet.json decnet.json
.env* .env*
.env.local .env.local
.coverage .coverage
.hypothesis/ .hypothesis/
profiles/*
tests/test_decnet.db*

View File

@@ -1,58 +0,0 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Commands
```bash
# Install (dev)
pip install -e .
# List registered service plugins
decnet services
# Dry-run (generates compose, no containers)
decnet deploy --mode unihost --deckies 3 --randomize-services --dry-run
# Full deploy (requires root for MACVLAN)
sudo decnet deploy --mode unihost --deckies 5 --interface eth0 --randomize-services
sudo decnet deploy --mode unihost --deckies 3 --services ssh,smb --log-target 192.168.1.5:5140
# Status / teardown
decnet status
sudo decnet teardown --all
sudo decnet teardown --id decky-01
```
## Project Overview
DECNET is a honeypot/deception network framework. It deploys fake machines (called **deckies**) with realistic services (RDP, SMB, SSH, FTP, etc.) to lure and profile attackers. All attacker interactions are aggregated to an isolated logging network (ELK stack / SIEM).
## Deployment Models
**UNIHOST** — one real host spins up _n_ deckies via a container orchestrator. Simpler, single-machine deployment.
**SWARM (MULTIHOST)**_n_ real hosts each running deckies. Orchestrated via Ansible/sshpass or similar tooling.
## Core Technology Choices
- **Containers**: Docker Compose is the starting point but other orchestration frameworks should be evaluated if they serve the project better. `debian:bookworm-slim` is the default base image; mixing in Ubuntu, CentOS, or other distros is encouraged to make the decoy network look heterogeneous.
- **Networking**: Deckies need to appear as real machines on the LAN (own MACs/IPs). MACVLAN and IPVLAN are candidates; the right driver depends on the host environment. WSL has known limitations — bare metal or a VM is preferred for testing.
- **Log pipeline**: Logstash → ELK stack → SIEM (isolated network, not reachable from decoy network)
## Architecture Constraints
- The decoy network must be reachable from the outside (attacker-facing).
- The logging/aggregation network must be isolated from the decoy network.
- A publicly accessible real server acts as the bridge between the two networks.
- Deckies should differ in exposed services and OS fingerprints to appear as a heterogeneous network.
- **IMPORTANT**: The system now strictly enforces dependency injection for storage. Do not import `SQLiteRepository` directly in new features; instead, use `get_repository()` from the factory or the FastAPI `get_repo` dependency.
## Development and testing
- For every new feature, pytests must me made.
- Pytest is the main testing framework in use.
- NEVER pass broken code to the user.
- Broken means: not running, not passing 100% tests, etc.
- After tests pass with 100%, always git commit your changes.
- NEVER add "Co-Authored-By" or any Claude attribution lines to git commit messages.

104
GEMINI.md
View File

@@ -1,104 +0,0 @@
# DECNET (Deception Network) Project Context
DECNET is a high-fidelity honeypot framework designed to deploy heterogeneous fleets of fake machines (called **deckies**) that appear as real hosts on a local network.
## Project Overview
- **Core Purpose:** To lure, profile, and log attacker interactions within a controlled, deceptive environment.
- **Key Technology:** Linux-native container networking (MACVLAN/IPvlan) combined with Docker to give each decoy its own MAC address, IP, and realistic TCP/IP stack behavior.
- **Main Components:**
- **Deckies:** Group of containers sharing a network namespace (one base container + multiple service containers).
- **Archetypes:** Pre-defined machine profiles (e.g., `windows-workstation`, `linux-server`) that bundle services and OS fingerprints.
- **Services:** Modular honeypot plugins (SSH, SMB, RDP, etc.) built as `BaseService` subclasses.
- **OS Fingerprinting:** Sysctl-based TCP/IP stack tuning to spoof OS detection (nmap).
- **Logging Pipeline:** RFC 5424 syslog forwarding to an isolated SIEM/ELK stack.
## Technical Stack
- **Language:** Python 3.11+
- **CLI Framework:** [Typer](https://typer.tiangolo.com/)
- **Data Validation:** [Pydantic v2](https://docs.pydantic.dev/)
- **Orchestration:** Docker Engine 24+ (via Docker SDK for Python)
- **Networking:** MACVLAN (default) or IPvlan L2 (for WiFi/restricted environments).
- **Testing:** Pytest (100% pass requirement).
- **Formatting/Linting:** Ruff, Bandit (SAST), pip-audit.
## Architecture
```text
Host NIC (eth0)
└── MACVLAN Bridge
├── Decky-01 (192.168.1.10) -> [Base] + [SSH] + [HTTP]
├── Decky-02 (192.168.1.11) -> [Base] + [SMB] + [RDP]
└── ...
```
- **Base Container:** Owns the IP/MAC, sets `sysctls` for OS spoofing, and runs `sleep infinity`.
- **Service Containers:** Use `network_mode: service:<base>` to share the identity and networking of the base container.
- **Isolation:** Decoy traffic is strictly separated from the logging network.
## Key Commands
### Development & Maintenance
- **Install (Dev):**
- `rm .venv -rf`
- `python3 -m venv .venv`
- `source .venv/bin/activate`
- `pip install -e .`
- **Run Tests:** `pytest` (Run before any commit)
- **Linting:** `ruff check .`
- **Security Scan:** `bandit -r decnet/`
- **Web Git:** git.resacachile.cl (Gitea)
### CLI Usage
- **List Services:** `decnet services`
- **List Archetypes:** `decnet archetypes`
- **Dry Run (Compose Gen):** `decnet deploy --deckies 3 --randomize-services --dry-run`
- **Deploy (Full):** `sudo .venv/bin/decnet deploy --interface eth0 --deckies 5 --randomize-services`
- **Status:** `decnet status`
- **Teardown:** `sudo .venv/bin/decnet teardown --all`
## Development Conventions
- **Code Style:**
- Strict adherence to Ruff/PEP8.
- **Always use typed variables**. If any non-types variables are found, they must be corrected.
- The correct way is `x: int = 1`, never `x : int = 1`.
- If assignment is present, always use a space between the type and the equal sign `x: int = 1`.
- **Never** use lowercase L (l), uppercase o (O) or uppercase i (i) in single-character names.
- **Internal vars are to be declared with an underscore** (_internal_variable_name).
- **Internal to internal vars are to be declared with double underscore** (__internal_variable_name).
- Always use snake_case for code.
- Always use PascalCase for classes and generics.
- **Testing:** New features MUST include a `pytest` case. 100% test pass rate is mandatory before merging.
- **Plugin System:**
- New services go in `decnet/services/<name>.py`.
- Subclass `decnet.services.base.BaseService`.
- The registry uses auto-discovery; no manual registration required.
- **Configuration:**
- Use Pydantic models in `decnet/config.py` for any new settings.
- INI file parsing is handled in `decnet/ini_loader.py`.
- **State Management:**
- Runtime state is persisted in `decnet-state.json`.
- Do not modify this file manually.
- **General Development Guidelines**:
- **Never** commit broken code, or before running `pytest`s or `bandit` at the project level.
- **No matter how small** the changes, they must be committed.
- **If new features are addedd** new tests must be added, too.
- **Never present broken code to the user**. Test, validate, then present.
- **Extensive testing** for every function must be created.
- **Always develop in the `dev` branch, never in `main`.**
- **Test in the `testing` branch.**
- **IMPORTANT**: The system now strictly enforces dependency injection for storage. Do not import `SQLiteRepository` directly in new features; instead, use `get_repository()` from the factory or the FastAPI `get_repo` dependency.
## Directory Structure
- `decnet/`: Main source code.
- `services/`: Honeypot service implementations.
- `logging/`: Syslog formatting and forwarding logic.
- `correlation/`: (In Progress) Logic for grouping attacker events.
- `templates/`: Dockerfiles and entrypoint scripts for services.
- `tests/`: Pytest suite.
- `pyproject.toml`: Dependency and entry point definitions.
- `CLAUDE.md`: Claude-specific environment guidance.
- `DEVELOPMENT.md`: Roadmap and TODOs.

110
README.md
View File

@@ -508,6 +508,10 @@ DECNET_WEB_HOST=0.0.0.0
DECNET_WEB_PORT=8080 DECNET_WEB_PORT=8080
DECNET_ADMIN_USER=admin DECNET_ADMIN_USER=admin
DECNET_ADMIN_PASSWORD=admin DECNET_ADMIN_PASSWORD=admin
# Database pool tuning (applies to both SQLite and MySQL)
DECNET_DB_POOL_SIZE=20 # base pool connections (default: 20)
DECNET_DB_MAX_OVERFLOW=40 # extra connections under burst (default: 40)
``` ```
Copy `.env.example` to `.env.local` and modify it to suit your environment. Copy `.env.example` to `.env.local` and modify it to suit your environment.
@@ -676,6 +680,112 @@ The test suite covers:
Every new feature requires passing tests before merging. Every new feature requires passing tests before merging.
### Stress Testing
A [Locust](https://locust.io)-based stress test suite lives in `tests/stress/`. It hammers every API endpoint with realistic traffic patterns to find throughput ceilings and latency degradation.
```bash
# Run via pytest (starts its own server)
pytest -m stress tests/stress/ -v -x -n0 -s
# Crank it up
STRESS_USERS=2000 STRESS_SPAWN_RATE=200 STRESS_DURATION=120 pytest -m stress tests/stress/ -v -x -n0 -s
# Standalone Locust web UI against a running server
locust -f tests/stress/locustfile.py --host http://localhost:8000
```
| Env var | Default | Description |
|---|---|---|
| `STRESS_USERS` | `500` | Total simulated users |
| `STRESS_SPAWN_RATE` | `50` | Users spawned per second |
| `STRESS_DURATION` | `60` | Test duration in seconds |
| `STRESS_WORKERS` | CPU count (max 4) | Uvicorn workers for the test server |
| `STRESS_MIN_RPS` | `500` | Minimum RPS to pass baseline test |
| `STRESS_MAX_P99_MS` | `200` | Maximum p99 latency (ms) to pass |
| `STRESS_SPIKE_USERS` | `1000` | Users for thundering herd test |
| `STRESS_SUSTAINED_USERS` | `200` | Users for sustained load test |
#### Measured baseline
Reference numbers from recent Locust runs against a MySQL backend
(asyncmy driver). All runs hold zero failures throughout.
**Single worker** (unless noted):
| Metric | 500u, tracing on | 1500u, tracing on | 1500u, tracing **off** | 1500u, tracing off, **pinned to 1 core** | 1500u, tracing off, **12 workers** |
|---|---|---|---|---|---|
| Requests served | 396,672 | 232,648 | 277,214 | 3,532 | 308,024 |
| Failures | 0 | 0 | 0 | 0 | 0 |
| Throughput (current RPS) | ~960 | ~880 | ~990 | ~46 | ~1,585 |
| Average latency | 465 ms | 1,774 ms | 1,489 ms | 21.7 s | 930 ms |
| Median (p50) | 100 ms | 690 ms | 340 ms | 270 ms | 700 ms |
| p95 | 1.9 s | 6.5 s | 5.7 s | 115 s | 2.7 s |
| p99 | 2.9 s | 9.5 s | 8.4 s | 122 s | 4.2 s |
| Max observed | 8.3 s | 24.4 s | 20.9 s | 124.5 s | 16.5 s |
Ramp is 15 users/s for the 500u column, 40 users/s otherwise.
Takeaways:
- **Tracing off**: at 1500 users, flipping `DECNET_TRACING=false`
halves p50 (690 → 340 ms) and pushes RPS from ~880 past the
500-user figure on a single worker.
- **12 workers**: RPS scales ~1.6× over a single worker (~990 →
~1585). Sublinear because the workload is DB-bound — MySQL and the
connection pool become the new ceiling, not Python. p99 drops from
8.4 s to 4.2 s.
- **Connection math**: `DECNET_DB_POOL_SIZE=20` × `DECNET_DB_MAX_OVERFLOW=40`
× 12 workers = 720 connections at peak. MySQL's default
`max_connections=151` needs bumping (we used 2000) before running
multi-worker load.
- **Single-core pinning**: ~46 RPS with p95 near two minutes. Interesting
as a "physics floor" datapoint — not a production config.
Top endpoints by volume: `/api/v1/attackers`, `/api/v1/deckies`,
`/api/v1/bounty`, `/api/v1/logs/histogram`, `/api/v1/config`,
`/api/v1/health`, `/api/v1/auth/login`, `/api/v1/logs`.
Notes on tuning:
- **Python 3.14 is currently a no-go for the API server.** Under heavy
concurrent async load the reworked 3.14 GC segfaults inside
`mark_all_reachable` (observed in `_PyGC_Collect` during pending-GC
on 3.14.3). Stick to Python 3.113.13 until upstream stabilises.
- Router-level TTL caches on hot count/stats endpoints (`/stats`,
`/logs` count, `/attackers` count, `/bounty`, `/logs/histogram`,
`/deckies`, `/config`) collapse concurrent duplicate work onto a
single DB hit per window — essential to reach this RPS on one worker.
- Turning off request tracing (`DECNET_TRACING=false`) is the next
free headroom: tracing was still on during the run above.
- On SQLite, `DECNET_DB_POOL_PRE_PING=false` skips the per-checkout
`SELECT 1`. On MySQL, keep it `true` — network disconnects are real.
#### System tuning: open file limit
Under heavy load (500+ concurrent users), the server will exhaust the default Linux open file limit (`ulimit -n`), causing `OSError: [Errno 24] Too many open files`. Most distros default to **1024**, which is far too low for stress testing or production use.
**Before running stress tests:**
```bash
# Check current limit
ulimit -n
# Bump for this shell session
ulimit -n 65536
```
**Permanent fix** — add to `/etc/security/limits.conf`:
```
* soft nofile 65536
* hard nofile 65536
```
Or for systemd-managed services, add `LimitNOFILE=65536` to the unit file.
> This applies to production deployments too — any server handling hundreds of concurrent connections needs a raised file descriptor limit.
# AI Disclosure # AI Disclosure
This project has been made with lots, and I mean lots of help from AIs. While most of the design was made by me, most of the coding was done by AI models. This project has been made with lots, and I mean lots of help from AIs. While most of the design was made by me, most of the coding was done by AI models.

64
decnet.ini.example Normal file
View File

@@ -0,0 +1,64 @@
; /etc/decnet/decnet.ini — DECNET host configuration
;
; Copy to /etc/decnet/decnet.ini and edit. Values here seed os.environ at
; CLI startup via setdefault() — real env vars still win, so you can
; override any value on the shell without editing this file.
;
; A missing file is fine; every daemon has sensible defaults. The main
; reason to use this file is to skip typing the same flags on every
; `decnet` invocation and to pin a host's role via `mode`.
[decnet]
; mode = agent | master
; agent — worker host (runs `decnet agent`, `decnet forwarder`, `decnet updater`).
; Master-only commands (api, swarmctl, swarm, deploy, teardown, ...)
; are hidden from `decnet --help` and refuse to run.
; master — central server (runs `decnet api`, `decnet web`, `decnet swarmctl`,
; `decnet listener`). All commands visible.
mode = agent
; disallow-master = true (default when mode=agent)
; Set to false for hybrid dev hosts that legitimately run both roles.
disallow-master = true
; log-directory — root for DECNET's per-component logs. Systemd units set
; DECNET_SYSTEM_LOGS=<log-directory>/decnet.<component>.log so agent, forwarder,
; and engine each get their own file. The forwarder tails decnet.log.
log-directory = /var/log/decnet
; ─── Agent-only settings (read when mode=agent) ───────────────────────────
[agent]
; Where the master's syslog-TLS listener lives. DECNET_SWARM_MASTER_HOST.
master-host = 192.168.1.50
; Master listener port (RFC 5425 default 6514). DECNET_SWARM_SYSLOG_PORT.
swarm-syslog-port = 6514
; Bind address/port for this worker's agent API (mTLS).
agent-port = 8765
; Cert bundle dir — must contain ca.crt, worker.crt, worker.key from enroll.
; DECNET_AGENT_DIR — honored by the forwarder child as well.
agent-dir = /home/anti/.decnet/agent
; Updater cert bundle (required for `decnet updater`).
updater-dir = /home/anti/.decnet/updater
; ─── Master-only settings (read when mode=master) ─────────────────────────
[master]
; Main API (REST for the React dashboard). DECNET_API_HOST / _PORT.
api-host = 0.0.0.0
api-port = 8000
; React dev-server dashboard (`decnet web`). DECNET_WEB_HOST / _PORT.
web-host = 0.0.0.0
web-port = 8080
; Swarm controller (master-internal). DECNET_SWARMCTL_HOST isn't exposed
; under that name today — this block is the forward-compatible spelling.
; swarmctl-host = 127.0.0.1
; swarmctl-port = 8770
; Syslog-over-TLS listener bind address and port. DECNET_LISTENER_HOST and
; DECNET_SWARM_SYSLOG_PORT. The listener is auto-spawned by `decnet swarmctl`.
listener-host = 0.0.0.0
swarm-syslog-port = 6514
; Master CA dir (for enroll / swarm cert issuance).
; ca-dir = /home/anti/.decnet/ca
; JWT secret for the web API. MUST be set; 32+ bytes. Keep out of git.
; jwt-secret = REPLACE_ME_WITH_A_32_BYTE_SECRET

View File

@@ -0,0 +1,12 @@
"""DECNET — honeypot deception-network framework.
This __init__ runs once, on the first `import decnet.*`. It seeds
os.environ from /etc/decnet/decnet.ini (if present) so that later
module-level reads in decnet.env pick up the INI values as if they had
been exported by the shell. Real env vars always win via setdefault().
Kept minimal on purpose — any heavier work belongs in a submodule.
"""
from decnet.config_ini import load_ini_config as _load_ini_config
_load_ini_config()

7
decnet/agent/__init__.py Normal file
View File

@@ -0,0 +1,7 @@
"""DECNET worker agent — runs on every SWARM worker host.
Exposes an mTLS-protected FastAPI service the master's SWARM controller
calls to deploy, mutate, and tear down deckies locally. The agent reuses
the existing `decnet.engine.deployer` code path unchanged, so a worker runs
deckies the same way `decnet deploy --mode unihost` does today.
"""

144
decnet/agent/app.py Normal file
View File

@@ -0,0 +1,144 @@
"""Worker-side FastAPI app.
Protected by mTLS at the ASGI/uvicorn transport layer: uvicorn is started
with ``--ssl-ca-certs`` + ``--ssl-cert-reqs 2`` (CERT_REQUIRED), so any
client that cannot prove a cert signed by the DECNET CA is rejected before
reaching a handler. Once past the TLS handshake, all peers are trusted
equally (the only entity holding a CA-signed cert is the master
controller).
Endpoints mirror the existing unihost CLI verbs:
* ``POST /deploy`` — body: serialized ``DecnetConfig``
* ``POST /teardown`` — body: optional ``{"decky_id": "..."}``
* ``POST /mutate`` — body: ``{"decky_id": "...", "services": [...]}``
* ``GET /status`` — deployment snapshot
* ``GET /health`` — liveness probe, does NOT require mTLS? No — mTLS
still required; master pings it with its cert.
"""
from __future__ import annotations
from contextlib import asynccontextmanager
from typing import Optional
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from decnet.agent import executor as _exec
from decnet.agent import heartbeat as _heartbeat
from decnet.config import DecnetConfig
from decnet.logging import get_logger
log = get_logger("agent.app")
@asynccontextmanager
async def _lifespan(app: FastAPI):
# Best-effort: if identity/bundle plumbing isn't configured (e.g. dev
# runs or non-enrolled hosts), heartbeat.start() is a silent no-op.
_heartbeat.start()
try:
yield
finally:
await _heartbeat.stop()
app = FastAPI(
title="DECNET SWARM Agent",
version="0.1.0",
docs_url=None, # no interactive docs on worker — narrow attack surface
redoc_url=None,
openapi_url=None,
lifespan=_lifespan,
responses={
400: {"description": "Malformed request body"},
500: {"description": "Executor error"},
},
)
# ------------------------------------------------------------------ schemas
class DeployRequest(BaseModel):
config: DecnetConfig = Field(..., description="Full DecnetConfig to materialise on this worker")
dry_run: bool = False
no_cache: bool = False
class TeardownRequest(BaseModel):
decky_id: Optional[str] = None
class MutateRequest(BaseModel):
decky_id: str
services: list[str]
# ------------------------------------------------------------------ routes
@app.get("/health")
async def health() -> dict[str, str]:
return {"status": "ok"}
@app.get("/status")
async def status() -> dict:
return await _exec.status()
@app.post(
"/deploy",
responses={500: {"description": "Deployer raised an exception materialising the config"}},
)
async def deploy(req: DeployRequest) -> dict:
try:
await _exec.deploy(req.config, dry_run=req.dry_run, no_cache=req.no_cache)
except Exception as exc:
log.exception("agent.deploy failed")
raise HTTPException(status_code=500, detail=str(exc)) from exc
return {"status": "deployed", "deckies": len(req.config.deckies)}
@app.post(
"/teardown",
responses={500: {"description": "Teardown raised an exception"}},
)
async def teardown(req: TeardownRequest) -> dict:
try:
await _exec.teardown(req.decky_id)
except Exception as exc:
log.exception("agent.teardown failed")
raise HTTPException(status_code=500, detail=str(exc)) from exc
return {"status": "torn_down", "decky_id": req.decky_id}
@app.post(
"/self-destruct",
responses={500: {"description": "Reaper could not be scheduled"}},
)
async def self_destruct() -> dict:
"""Stop all DECNET services on this worker and delete the install
footprint. Called by the master during decommission. Logs under
/var/log/decnet* are preserved. Fire-and-forget — returns 202 before
the reaper starts deleting files."""
try:
await _exec.self_destruct()
except Exception as exc:
log.exception("agent.self_destruct failed")
raise HTTPException(status_code=500, detail=str(exc)) from exc
return {"status": "self_destruct_scheduled"}
@app.post(
"/mutate",
responses={501: {"description": "Worker-side mutate not yet implemented"}},
)
async def mutate(req: MutateRequest) -> dict:
# TODO: implement worker-side mutate. Currently the master performs
# mutation by re-sending a full /deploy with the updated DecnetConfig;
# this avoids duplicating mutation logic on the worker for v1. When
# ready, replace the 501 with a real redeploy-of-a-single-decky path.
raise HTTPException(
status_code=501,
detail="Per-decky mutate is performed via /deploy with updated services",
)

223
decnet/agent/executor.py Normal file
View File

@@ -0,0 +1,223 @@
"""Thin adapter between the agent's HTTP endpoints and the existing
``decnet.engine.deployer`` code path.
Kept deliberately small: the agent does not re-implement deployment logic,
it only translates a master RPC into the same function calls the unihost
CLI already uses. Everything runs in a worker thread (the deployer is
blocking) so the FastAPI event loop stays responsive.
"""
from __future__ import annotations
import asyncio
from ipaddress import IPv4Network
from typing import Any
from decnet.engine import deployer as _deployer
from decnet.config import DecnetConfig, load_state, clear_state
from decnet.logging import get_logger
from decnet.network import (
allocate_ips,
detect_interface,
detect_subnet,
get_host_ip,
)
log = get_logger("agent.executor")
def _relocalize(config: DecnetConfig) -> DecnetConfig:
"""Rewrite a master-built config to the worker's local network reality.
The master populates ``interface``/``subnet``/``gateway`` from its own
box before dispatching, which blows up the deployer on any worker whose
NIC name differs (common in heterogeneous fleets — master on ``wlp6s0``,
worker on ``enp0s3``). We always re-detect locally; if the worker sits
on a different subnet than the master, decky IPs are re-allocated from
the worker's subnet so they're actually reachable.
"""
local_iface = detect_interface()
local_subnet, local_gateway = detect_subnet(local_iface)
local_host_ip = get_host_ip(local_iface)
updates: dict[str, Any] = {
"interface": local_iface,
"subnet": local_subnet,
"gateway": local_gateway,
}
master_net = IPv4Network(config.subnet, strict=False) if config.subnet else None
local_net = IPv4Network(local_subnet, strict=False)
if master_net is None or master_net != local_net:
log.info(
"agent.deploy subnet mismatch master=%s local=%s — re-allocating decky IPs",
config.subnet, local_subnet,
)
fresh_ips = allocate_ips(
subnet=local_subnet,
gateway=local_gateway,
host_ip=local_host_ip,
count=len(config.deckies),
)
new_deckies = [d.model_copy(update={"ip": ip}) for d, ip in zip(config.deckies, fresh_ips)]
updates["deckies"] = new_deckies
return config.model_copy(update=updates)
async def deploy(config: DecnetConfig, dry_run: bool = False, no_cache: bool = False) -> None:
"""Run the blocking deployer off-loop. The deployer itself calls
save_state() internally once the compose file is materialised."""
log.info(
"agent.deploy mode=%s deckies=%d interface=%s (incoming)",
config.mode, len(config.deckies), config.interface,
)
if config.mode == "swarm":
config = _relocalize(config)
log.info(
"agent.deploy relocalized interface=%s subnet=%s gateway=%s",
config.interface, config.subnet, config.gateway,
)
await asyncio.to_thread(_deployer.deploy, config, dry_run, no_cache, False)
async def teardown(decky_id: str | None = None) -> None:
log.info("agent.teardown decky_id=%s", decky_id)
await asyncio.to_thread(_deployer.teardown, decky_id)
if decky_id is None:
await asyncio.to_thread(clear_state)
def _decky_runtime_states(config: DecnetConfig) -> dict[str, dict[str, Any]]:
"""Map decky_name → {"running": bool, "services": {svc: container_state}}.
Queried so the master can tell, after a partial-failure deploy, which
deckies actually came up instead of tainting the whole shard as failed.
Best-effort: a docker error returns an empty map, not an exception.
"""
try:
import docker # local import — agent-only path
client = docker.from_env()
live = {c.name: c.status for c in client.containers.list(all=True, ignore_removed=True)}
except Exception: # pragma: no cover — defensive
log.exception("_decky_runtime_states: docker query failed")
return {}
out: dict[str, dict[str, Any]] = {}
for d in config.deckies:
svc_states = {
svc: live.get(f"{d.name}-{svc.replace('_', '-')}", "absent")
for svc in d.services
}
out[d.name] = {
"running": bool(svc_states) and all(s == "running" for s in svc_states.values()),
"services": svc_states,
}
return out
_REAPER_SCRIPT = r"""#!/bin/bash
# DECNET agent self-destruct reaper.
# Runs detached from the agent process so it survives the agent's death.
# Waits briefly for the HTTP response to drain, then stops services,
# wipes install paths, and preserves logs.
set +e
sleep 3
# Stop decky containers started by the local deployer (best-effort).
if command -v docker >/dev/null 2>&1; then
docker ps -q --filter "label=com.docker.compose.project=decnet" | xargs -r docker stop
docker ps -aq --filter "label=com.docker.compose.project=decnet" | xargs -r docker rm -f
docker network rm decnet_lan 2>/dev/null
fi
# Stop+disable every systemd unit the installer may have dropped.
for unit in decnet-agent decnet-engine decnet-collector decnet-forwarder decnet-prober decnet-sniffer decnet-updater; do
systemctl stop "$unit" 2>/dev/null
systemctl disable "$unit" 2>/dev/null
done
# Nuke install paths. Logs under /var/log/decnet* are intentionally
# preserved — the operator typically wants them for forensic review.
rm -rf /opt/decnet* /var/lib/decnet/* /usr/local/bin/decnet* /etc/decnet
rm -f /etc/systemd/system/decnet-*.service /etc/systemd/system/decnet-*.timer
systemctl daemon-reload 2>/dev/null
rm -f "$0"
"""
async def self_destruct() -> None:
"""Tear down deckies, then spawn a detached reaper that wipes the
install footprint. Returns immediately so the HTTP response can drain
before the reaper starts deleting files out from under the agent."""
import os
import shutil
import subprocess # nosec B404
import tempfile
# Best-effort teardown first — the reaper also runs docker stop, but
# going through the deployer gives the host-macvlan/ipvlan helper a
# chance to clean up routes cleanly.
try:
await asyncio.to_thread(_deployer.teardown, None)
await asyncio.to_thread(clear_state)
except Exception:
log.exception("self_destruct: pre-reap teardown failed — reaper will force-stop containers")
# Reaper lives under /tmp so it survives rm -rf /opt/decnet*.
fd, path = tempfile.mkstemp(prefix="decnet-reaper-", suffix=".sh", dir="/tmp") # nosec B108 — reaper must outlive /opt/decnet removal
try:
os.write(fd, _REAPER_SCRIPT.encode())
finally:
os.close(fd)
os.chmod(path, 0o700) # nosec B103 — root-owned reaper, needs exec
# The reaper MUST run outside decnet-agent.service's cgroup — otherwise
# `systemctl stop decnet-agent` SIGTERMs the whole cgroup (reaper included)
# before rm -rf completes. `start_new_session=True` gets us a fresh POSIX
# session but does NOT escape the systemd cgroup. So we prefer
# `systemd-run --scope` (launches the command in a transient scope
# detached from the caller's service), falling back to a bare Popen if
# systemd-run is unavailable (non-systemd host / container).
systemd_run = shutil.which("systemd-run")
if systemd_run:
argv = [
systemd_run,
"--collect",
"--unit", f"decnet-reaper-{os.getpid()}",
"--description", "DECNET agent self-destruct reaper",
"/bin/bash", path,
]
spawn_kwargs = {"start_new_session": True}
else:
argv = ["/bin/bash", path]
spawn_kwargs = {"start_new_session": True}
subprocess.Popen( # nosec B603
argv,
stdin=subprocess.DEVNULL,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
close_fds=True,
**spawn_kwargs,
)
log.warning(
"self_destruct: reaper spawned path=%s via=%s — agent will die in ~3s",
path, "systemd-run" if systemd_run else "popen",
)
async def status() -> dict[str, Any]:
state = await asyncio.to_thread(load_state)
if state is None:
return {"deployed": False, "deckies": []}
config, _compose_path = state
runtime = await asyncio.to_thread(_decky_runtime_states, config)
return {
"deployed": True,
"mode": config.mode,
"compose_path": str(_compose_path),
"deckies": [d.model_dump() for d in config.deckies],
"runtime": runtime,
}

134
decnet/agent/heartbeat.py Normal file
View File

@@ -0,0 +1,134 @@
"""Agent → master liveness heartbeat loop.
Every ``INTERVAL_S`` seconds the worker posts ``executor.status()`` to
``POST <master>/swarm/heartbeat`` over mTLS. The master pins the
presented client cert's SHA-256 against the ``SwarmHost`` row for the
claimed ``host_uuid``; a match refreshes ``last_heartbeat`` + each
``DeckyShard``'s snapshot + runtime state.
Identity comes from ``/etc/decnet/decnet.ini`` (seeded by the enroll
bundle) — specifically ``DECNET_HOST_UUID`` and ``DECNET_MASTER_HOST``.
The worker's existing ``~/.decnet/agent/`` bundle (or
``/etc/decnet/agent/``) provides the mTLS client cert.
Started/stopped via the agent FastAPI app's lifespan. If identity
plumbing is missing (pre-enrollment dev runs) the loop logs at DEBUG and
declines to start — callers don't have to guard it.
"""
from __future__ import annotations
import asyncio
import pathlib
from typing import Optional
import httpx
from decnet.agent import executor as _exec
from decnet.logging import get_logger
from decnet.swarm import pki
from decnet.swarm.log_forwarder import build_worker_ssl_context
log = get_logger("agent.heartbeat")
INTERVAL_S = 30.0
_TIMEOUT = httpx.Timeout(connect=5.0, read=10.0, write=5.0, pool=5.0)
_task: Optional[asyncio.Task] = None
def _resolve_agent_dir() -> pathlib.Path:
"""Match the agent-dir resolution order used by the agent server:
DECNET_AGENT_DIR env, else /etc/decnet/agent (production install),
else ~/.decnet/agent (dev)."""
import os
env = os.environ.get("DECNET_AGENT_DIR")
if env:
return pathlib.Path(env)
system = pathlib.Path("/etc/decnet/agent")
if system.exists():
return system
return pki.DEFAULT_AGENT_DIR
async def _tick(client: httpx.AsyncClient, url: str, host_uuid: str, agent_version: str) -> None:
snap = await _exec.status()
resp = await client.post(
url,
json={
"host_uuid": host_uuid,
"agent_version": agent_version,
"status": snap,
},
)
# 403 / 404 are terminal-ish — we still keep looping because an
# operator may re-enrol the host mid-session, but we log loudly so
# prod ops can spot cert-pinning drift.
if resp.status_code == 204:
return
log.warning(
"heartbeat rejected status=%d body=%s",
resp.status_code, resp.text[:200],
)
async def _loop(url: str, host_uuid: str, agent_version: str, ssl_ctx) -> None:
log.info("heartbeat loop starting url=%s host_uuid=%s interval=%ss",
url, host_uuid, INTERVAL_S)
async with httpx.AsyncClient(verify=ssl_ctx, timeout=_TIMEOUT) as client:
while True:
try:
await _tick(client, url, host_uuid, agent_version)
except asyncio.CancelledError:
raise
except Exception:
log.exception("heartbeat tick failed — will retry in %ss", INTERVAL_S)
await asyncio.sleep(INTERVAL_S)
def start() -> Optional[asyncio.Task]:
"""Kick off the background heartbeat task. No-op if identity is
unconfigured (dev mode) — the caller doesn't need to check."""
global _task
from decnet.env import (
DECNET_HOST_UUID,
DECNET_MASTER_HOST,
DECNET_SWARMCTL_PORT,
)
if _task is not None and not _task.done():
return _task
if not DECNET_HOST_UUID or not DECNET_MASTER_HOST:
log.debug("heartbeat not starting — DECNET_HOST_UUID or DECNET_MASTER_HOST unset")
return None
agent_dir = _resolve_agent_dir()
try:
ssl_ctx = build_worker_ssl_context(agent_dir)
except Exception:
log.exception("heartbeat not starting — worker SSL context unavailable at %s", agent_dir)
return None
try:
from decnet import __version__ as _v
agent_version = _v
except Exception:
agent_version = "unknown"
url = f"https://{DECNET_MASTER_HOST}:{DECNET_SWARMCTL_PORT}/swarm/heartbeat"
_task = asyncio.create_task(
_loop(url, DECNET_HOST_UUID, agent_version, ssl_ctx),
name="agent-heartbeat",
)
return _task
async def stop() -> None:
global _task
if _task is None:
return
_task.cancel()
try:
await _task
except (asyncio.CancelledError, Exception):
pass
_task = None

70
decnet/agent/server.py Normal file
View File

@@ -0,0 +1,70 @@
"""Worker-agent uvicorn launcher.
Starts ``decnet.agent.app:app`` over HTTPS with mTLS enforcement. The
worker must already have a bundle in ``~/.decnet/agent/`` (delivered by
``decnet swarm enroll`` from the master); if it does not, we refuse to
start — unauthenticated agents are not a supported mode.
"""
from __future__ import annotations
import os
import pathlib
import signal
import subprocess # nosec B404
import sys
from decnet.logging import get_logger
from decnet.swarm import pki
log = get_logger("agent.server")
def run(host: str, port: int, agent_dir: pathlib.Path = pki.DEFAULT_AGENT_DIR) -> int:
bundle = pki.load_worker_bundle(agent_dir)
if bundle is None:
print(
f"[agent] No cert bundle at {agent_dir}. "
f"Run `decnet swarm enroll` from the master first.",
file=sys.stderr,
)
return 2
keyfile = agent_dir / "worker.key"
certfile = agent_dir / "worker.crt"
cafile = agent_dir / "ca.crt"
cmd = [
sys.executable,
"-m",
"uvicorn",
"decnet.agent.app:app",
"--host",
host,
"--port",
str(port),
"--ssl-keyfile",
str(keyfile),
"--ssl-certfile",
str(certfile),
"--ssl-ca-certs",
str(cafile),
# 2 == ssl.CERT_REQUIRED — clients MUST present a CA-signed cert.
"--ssl-cert-reqs",
"2",
]
log.info("agent starting host=%s port=%d bundle=%s", host, port, agent_dir)
# Own process group for clean Ctrl+C / SIGTERM propagation to uvicorn
# workers (same pattern as `decnet api`).
proc = subprocess.Popen(cmd, start_new_session=True) # nosec B603
try:
return proc.wait()
except KeyboardInterrupt:
try:
os.killpg(proc.pid, signal.SIGTERM)
try:
return proc.wait(timeout=10)
except subprocess.TimeoutExpired:
os.killpg(proc.pid, signal.SIGKILL)
return proc.wait()
except ProcessLookupError:
return 0

View File

@@ -1,972 +0,0 @@
"""
DECNET CLI — entry point for all commands.
Usage:
decnet deploy --mode unihost --deckies 5 --randomize-services
decnet status
decnet teardown [--all | --id decky-01]
decnet services
"""
import signal
from typing import Optional
import typer
from rich.console import Console
from rich.table import Table
from decnet.logging import get_logger
from decnet.env import (
DECNET_API_HOST,
DECNET_API_PORT,
DECNET_INGEST_LOG_FILE,
DECNET_WEB_HOST,
DECNET_WEB_PORT,
)
from decnet.archetypes import Archetype, all_archetypes, get_archetype
from decnet.config import (
DecnetConfig,
)
from decnet.distros import all_distros, get_distro
from decnet.fleet import all_service_names, build_deckies, build_deckies_from_ini
from decnet.ini_loader import load_ini
from decnet.network import detect_interface, detect_subnet, allocate_ips, get_host_ip
from decnet.services.registry import all_services
log = get_logger("cli")
def _daemonize() -> None:
"""Fork the current process into a background daemon (Unix double-fork)."""
import os
import sys
if os.fork() > 0:
raise SystemExit(0)
os.setsid()
if os.fork() > 0:
raise SystemExit(0)
sys.stdout = open(os.devnull, "w") # noqa: SIM115
sys.stderr = open(os.devnull, "w") # noqa: SIM115
sys.stdin = open(os.devnull, "r") # noqa: SIM115
app = typer.Typer(
name="decnet",
help="Deploy a deception network of honeypot deckies on your LAN.",
no_args_is_help=True,
)
console = Console()
def _kill_all_services() -> None:
"""Find and kill all running DECNET microservice processes."""
import os
registry = _service_registry(str(DECNET_INGEST_LOG_FILE))
killed = 0
for name, match_fn, _launch_args in registry:
pid = _is_running(match_fn)
if pid is not None:
console.print(f"[yellow]Stopping {name} (PID {pid})...[/]")
os.kill(pid, signal.SIGTERM)
killed += 1
if killed:
console.print(f"[green]{killed} background process(es) stopped.[/]")
else:
console.print("[dim]No DECNET services were running.[/]")
@app.command()
def api(
port: int = typer.Option(DECNET_API_PORT, "--port", help="Port for the backend API"),
host: str = typer.Option(DECNET_API_HOST, "--host", help="Host IP for the backend API"),
log_file: str = typer.Option(DECNET_INGEST_LOG_FILE, "--log-file", help="Path to the DECNET log file to monitor"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Run the DECNET API and Web Dashboard in standalone mode."""
import subprocess # nosec B404
import sys
import os
if daemon:
log.info("API daemonizing host=%s port=%d", host, port)
_daemonize()
log.info("API command invoked host=%s port=%d", host, port)
console.print(f"[green]Starting DECNET API on {host}:{port}...[/]")
_env: dict[str, str] = os.environ.copy()
_env["DECNET_INGEST_LOG_FILE"] = str(log_file)
try:
subprocess.run( # nosec B603 B404
[sys.executable, "-m", "uvicorn", "decnet.web.api:app", "--host", host, "--port", str(port)],
env=_env
)
except KeyboardInterrupt:
pass
except (FileNotFoundError, subprocess.SubprocessError):
console.print("[red]Failed to start API. Ensure 'uvicorn' is installed in the current environment.[/]")
@app.command()
def deploy(
mode: str = typer.Option("unihost", "--mode", "-m", help="Deployment mode: unihost | swarm"),
deckies: Optional[int] = typer.Option(None, "--deckies", "-n", help="Number of deckies to deploy (required without --config)", min=1),
interface: Optional[str] = typer.Option(None, "--interface", "-i", help="Host NIC (auto-detected if omitted)"),
subnet: Optional[str] = typer.Option(None, "--subnet", help="LAN subnet CIDR (auto-detected if omitted)"),
ip_start: Optional[str] = typer.Option(None, "--ip-start", help="First decky IP (auto if omitted)"),
services: Optional[str] = typer.Option(None, "--services", help="Comma-separated services, e.g. ssh,smb,rdp"),
randomize_services: bool = typer.Option(False, "--randomize-services", help="Assign random services to each decky"),
distro: Optional[str] = typer.Option(None, "--distro", help="Comma-separated distro slugs, e.g. debian,ubuntu22,rocky9"),
randomize_distros: bool = typer.Option(False, "--randomize-distros", help="Assign a random distro to each decky"),
log_file: Optional[str] = typer.Option(DECNET_INGEST_LOG_FILE, "--log-file", help="Host path for the collector to write RFC 5424 logs (e.g. /var/log/decnet/decnet.log)"),
archetype_name: Optional[str] = typer.Option(None, "--archetype", "-a", help="Machine archetype slug (e.g. linux-server, windows-workstation)"),
mutate_interval: Optional[int] = typer.Option(30, "--mutate-interval", help="Automatically rotate services every N minutes"),
dry_run: bool = typer.Option(False, "--dry-run", help="Generate compose file without starting containers"),
no_cache: bool = typer.Option(False, "--no-cache", help="Force rebuild all images, ignoring Docker layer cache"),
parallel: bool = typer.Option(False, "--parallel", help="Build all images concurrently (enables BuildKit, separates build from up)"),
ipvlan: bool = typer.Option(False, "--ipvlan", help="Use IPvlan L2 instead of MACVLAN (required on WiFi interfaces)"),
config_file: Optional[str] = typer.Option(None, "--config", "-c", help="Path to INI config file"),
api: bool = typer.Option(False, "--api", help="Start the FastAPI backend to ingest and serve logs"),
api_port: int = typer.Option(8000, "--api-port", help="Port for the backend API"),
daemon: bool = typer.Option(False, "--daemon", help="Detach to background as a daemon process"),
) -> None:
"""Deploy deckies to the LAN."""
import os
if daemon:
log.info("deploy daemonizing mode=%s deckies=%s", mode, deckies)
_daemonize()
log.info("deploy command invoked mode=%s deckies=%s dry_run=%s", mode, deckies, dry_run)
if mode not in ("unihost", "swarm"):
console.print("[red]--mode must be 'unihost' or 'swarm'[/]")
raise typer.Exit(1)
# ------------------------------------------------------------------ #
# Config-file path #
# ------------------------------------------------------------------ #
if config_file:
try:
ini = load_ini(config_file)
except FileNotFoundError as e:
console.print(f"[red]{e}[/]")
raise typer.Exit(1)
iface = interface or ini.interface or detect_interface()
subnet_cidr = subnet or ini.subnet
effective_gateway = ini.gateway
if subnet_cidr is None:
subnet_cidr, effective_gateway = detect_subnet(iface)
elif effective_gateway is None:
_, effective_gateway = detect_subnet(iface)
host_ip = get_host_ip(iface)
console.print(f"[dim]Config:[/] {config_file} [dim]Interface:[/] {iface} "
f"[dim]Subnet:[/] {subnet_cidr} [dim]Gateway:[/] {effective_gateway} "
f"[dim]Host IP:[/] {host_ip}")
if ini.custom_services:
from decnet.custom_service import CustomService
from decnet.services.registry import register_custom_service
for cs in ini.custom_services:
register_custom_service(
CustomService(
name=cs.name,
image=cs.image,
exec_cmd=cs.exec_cmd,
ports=cs.ports,
)
)
effective_log_file = log_file
try:
decky_configs = build_deckies_from_ini(
ini, subnet_cidr, effective_gateway, host_ip, randomize_services, cli_mutate_interval=mutate_interval
)
except ValueError as e:
console.print(f"[red]{e}[/]")
raise typer.Exit(1)
# ------------------------------------------------------------------ #
# Classic CLI path #
# ------------------------------------------------------------------ #
else:
if deckies is None:
console.print("[red]--deckies is required when --config is not used.[/]")
raise typer.Exit(1)
services_list = [s.strip() for s in services.split(",")] if services else None
if services_list:
known = set(all_service_names())
unknown = [s for s in services_list if s not in known]
if unknown:
console.print(f"[red]Unknown service(s): {unknown}. Available: {all_service_names()}[/]")
raise typer.Exit(1)
arch: Archetype | None = None
if archetype_name:
try:
arch = get_archetype(archetype_name)
except ValueError as e:
console.print(f"[red]{e}[/]")
raise typer.Exit(1)
if not services_list and not randomize_services and not arch:
console.print("[red]Specify --services, --archetype, or --randomize-services.[/]")
raise typer.Exit(1)
iface = interface or detect_interface()
if subnet is None:
subnet_cidr, effective_gateway = detect_subnet(iface)
else:
subnet_cidr = subnet
_, effective_gateway = detect_subnet(iface)
host_ip = get_host_ip(iface)
console.print(f"[dim]Interface:[/] {iface} [dim]Subnet:[/] {subnet_cidr} "
f"[dim]Gateway:[/] {effective_gateway} [dim]Host IP:[/] {host_ip}")
distros_list = [d.strip() for d in distro.split(",")] if distro else None
if distros_list:
try:
for slug in distros_list:
get_distro(slug)
except ValueError as e:
console.print(f"[red]{e}[/]")
raise typer.Exit(1)
ips = allocate_ips(subnet_cidr, effective_gateway, host_ip, deckies, ip_start)
decky_configs = build_deckies(
deckies, ips, services_list, randomize_services,
distros_explicit=distros_list, randomize_distros=randomize_distros,
archetype=arch, mutate_interval=mutate_interval,
)
effective_log_file = log_file
if api and not effective_log_file:
effective_log_file = os.path.join(os.getcwd(), "decnet.log")
console.print(f"[cyan]API mode enabled: defaulting log-file to {effective_log_file}[/]")
config = DecnetConfig(
mode=mode,
interface=iface,
subnet=subnet_cidr,
gateway=effective_gateway,
deckies=decky_configs,
log_file=effective_log_file,
ipvlan=ipvlan,
mutate_interval=mutate_interval,
)
log.debug("deploy: config built deckies=%d interface=%s subnet=%s", len(config.deckies), config.interface, config.subnet)
from decnet.engine import deploy as _deploy
_deploy(config, dry_run=dry_run, no_cache=no_cache, parallel=parallel)
if dry_run:
log.info("deploy: dry-run complete, no containers started")
else:
log.info("deploy: deployment complete deckies=%d", len(config.deckies))
if mutate_interval is not None and not dry_run:
import subprocess # nosec B404
import sys
console.print(f"[green]Starting DECNET Mutator watcher in the background (interval: {mutate_interval}m)...[/]")
try:
subprocess.Popen( # nosec B603
[sys.executable, "-m", "decnet.cli", "mutate", "--watch"],
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT,
start_new_session=True,
)
except (FileNotFoundError, subprocess.SubprocessError):
console.print("[red]Failed to start mutator watcher.[/]")
if effective_log_file and not dry_run and not api:
import subprocess # nosec B404
import sys
from pathlib import Path as _Path
_collector_err = _Path(effective_log_file).with_suffix(".collector.log")
console.print(f"[bold cyan]Starting log collector[/] → {effective_log_file}")
subprocess.Popen( # nosec B603
[sys.executable, "-m", "decnet.cli", "collect", "--log-file", str(effective_log_file)],
stdin=subprocess.DEVNULL,
stdout=open(_collector_err, "a"),
stderr=subprocess.STDOUT,
start_new_session=True,
)
if api and not dry_run:
import subprocess # nosec B404
import sys
console.print(f"[green]Starting DECNET API on port {api_port}...[/]")
_env: dict[str, str] = os.environ.copy()
_env["DECNET_INGEST_LOG_FILE"] = str(effective_log_file or "")
try:
subprocess.Popen( # nosec B603
[sys.executable, "-m", "uvicorn", "decnet.web.api:app", "--host", DECNET_API_HOST, "--port", str(api_port)],
env=_env,
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT
)
console.print(f"[dim]API running at http://{DECNET_API_HOST}:{api_port}[/]")
except (FileNotFoundError, subprocess.SubprocessError):
console.print("[red]Failed to start API. Ensure 'uvicorn' is installed in the current environment.[/]")
if effective_log_file and not dry_run:
import subprocess # nosec B404
import sys
console.print("[bold cyan]Starting DECNET-PROBER[/] (auto-discovers attackers from log stream)")
try:
_prober_args = [
sys.executable, "-m", "decnet.cli", "probe",
"--daemon",
"--log-file", str(effective_log_file),
]
subprocess.Popen( # nosec B603
_prober_args,
stdin=subprocess.DEVNULL,
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT,
start_new_session=True,
)
except (FileNotFoundError, subprocess.SubprocessError):
console.print("[red]Failed to start DECNET-PROBER.[/]")
if effective_log_file and not dry_run:
import subprocess # nosec B404
import sys
console.print("[bold cyan]Starting DECNET-PROFILER[/] (builds attacker profiles from log stream)")
try:
subprocess.Popen( # nosec B603
[sys.executable, "-m", "decnet.cli", "profiler", "--daemon"],
stdin=subprocess.DEVNULL,
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT,
start_new_session=True,
)
except (FileNotFoundError, subprocess.SubprocessError):
console.print("[red]Failed to start DECNET-PROFILER.[/]")
if effective_log_file and not dry_run:
import subprocess # nosec B404
import sys
console.print("[bold cyan]Starting DECNET-SNIFFER[/] (passive network capture)")
try:
subprocess.Popen( # nosec B603
[sys.executable, "-m", "decnet.cli", "sniffer",
"--daemon",
"--log-file", str(effective_log_file)],
stdin=subprocess.DEVNULL,
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT,
start_new_session=True,
)
except (FileNotFoundError, subprocess.SubprocessError):
console.print("[red]Failed to start DECNET-SNIFFER.[/]")
def _is_running(match_fn) -> int | None:
"""Return PID of a running DECNET process matching ``match_fn(cmdline)``, or None."""
import psutil
for proc in psutil.process_iter(["pid", "cmdline"]):
try:
cmd = proc.info["cmdline"]
if cmd and match_fn(cmd):
return proc.info["pid"]
except (psutil.NoSuchProcess, psutil.AccessDenied):
continue
return None
# Each entry: (display_name, detection_fn, launch_args_fn)
# launch_args_fn receives log_file and returns the Popen argv list.
def _service_registry(log_file: str) -> list[tuple[str, callable, list[str]]]:
"""Return the microservice registry for health-check and relaunch."""
import sys
_py = sys.executable
return [
(
"Collector",
lambda cmd: "decnet.cli" in cmd and "collect" in cmd,
[_py, "-m", "decnet.cli", "collect", "--daemon", "--log-file", log_file],
),
(
"Mutator",
lambda cmd: "decnet.cli" in cmd and "mutate" in cmd and "--watch" in cmd,
[_py, "-m", "decnet.cli", "mutate", "--daemon", "--watch"],
),
(
"Prober",
lambda cmd: "decnet.cli" in cmd and "probe" in cmd,
[_py, "-m", "decnet.cli", "probe", "--daemon", "--log-file", log_file],
),
(
"Profiler",
lambda cmd: "decnet.cli" in cmd and "profiler" in cmd,
[_py, "-m", "decnet.cli", "profiler", "--daemon"],
),
(
"Sniffer",
lambda cmd: "decnet.cli" in cmd and "sniffer" in cmd,
[_py, "-m", "decnet.cli", "sniffer", "--daemon", "--log-file", log_file],
),
(
"API",
lambda cmd: "uvicorn" in cmd and "decnet.web.api:app" in cmd,
[_py, "-m", "uvicorn", "decnet.web.api:app",
"--host", DECNET_API_HOST, "--port", str(DECNET_API_PORT)],
),
]
@app.command()
def redeploy(
log_file: str = typer.Option(DECNET_INGEST_LOG_FILE, "--log-file", "-f", help="Path to the DECNET log file"),
) -> None:
"""Check running DECNET services and relaunch any that are down."""
import subprocess # nosec B404
log.info("redeploy: checking services")
registry = _service_registry(str(log_file))
table = Table(title="DECNET Services", show_lines=True)
table.add_column("Service", style="bold cyan")
table.add_column("Status")
table.add_column("PID", style="dim")
table.add_column("Action")
relaunched = 0
for name, match_fn, launch_args in registry:
pid = _is_running(match_fn)
if pid is not None:
table.add_row(name, "[green]UP[/]", str(pid), "")
else:
try:
subprocess.Popen( # nosec B603
launch_args,
stdin=subprocess.DEVNULL,
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT,
start_new_session=True,
)
table.add_row(name, "[red]DOWN[/]", "", "[green]relaunched[/]")
relaunched += 1
except (FileNotFoundError, subprocess.SubprocessError) as exc:
table.add_row(name, "[red]DOWN[/]", "", f"[red]failed: {exc}[/]")
console.print(table)
if relaunched:
console.print(f"[green]{relaunched} service(s) relaunched.[/]")
else:
console.print("[green]All services running.[/]")
@app.command()
def probe(
log_file: str = typer.Option(DECNET_INGEST_LOG_FILE, "--log-file", "-f", help="Path for RFC 5424 syslog + .json output (reads attackers from .json, writes results to both)"),
interval: int = typer.Option(300, "--interval", "-i", help="Seconds between probe cycles (default: 300)"),
timeout: float = typer.Option(5.0, "--timeout", help="Per-probe TCP timeout in seconds"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background (used by deploy, no console output)"),
) -> None:
"""Fingerprint attackers (JARM + HASSH + TCP/IP stack) discovered in the log stream."""
import asyncio
from decnet.prober import prober_worker
if daemon:
log.info("probe daemonizing log_file=%s interval=%d", log_file, interval)
_daemonize()
asyncio.run(prober_worker(log_file, interval=interval, timeout=timeout))
return
else:
log.info("probe command invoked log_file=%s interval=%d", log_file, interval)
console.print(f"[bold cyan]DECNET-PROBER[/] watching {log_file} for attackers (interval: {interval}s)")
console.print("[dim]Press Ctrl+C to stop[/]")
try:
asyncio.run(prober_worker(log_file, interval=interval, timeout=timeout))
except KeyboardInterrupt:
console.print("\n[yellow]DECNET-PROBER stopped.[/]")
@app.command()
def collect(
log_file: str = typer.Option(DECNET_INGEST_LOG_FILE, "--log-file", "-f", help="Path to write RFC 5424 syslog lines and .json records"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Stream Docker logs from all running decky service containers to a log file."""
import asyncio
from decnet.collector import log_collector_worker
if daemon:
log.info("collect daemonizing log_file=%s", log_file)
_daemonize()
log.info("collect command invoked log_file=%s", log_file)
console.print(f"[bold cyan]Collector starting[/] → {log_file}")
asyncio.run(log_collector_worker(log_file))
@app.command()
def mutate(
watch: bool = typer.Option(False, "--watch", "-w", help="Run continuously and mutate deckies according to their interval"),
decky_name: Optional[str] = typer.Option(None, "--decky", help="Force mutate a specific decky immediately"),
force_all: bool = typer.Option(False, "--all", help="Force mutate all deckies immediately"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Manually trigger or continuously watch for decky mutation."""
import asyncio
from decnet.mutator import mutate_decky, mutate_all, run_watch_loop
from decnet.web.dependencies import repo
if daemon:
log.info("mutate daemonizing watch=%s", watch)
_daemonize()
async def _run() -> None:
await repo.initialize()
if watch:
await run_watch_loop(repo)
elif decky_name:
await mutate_decky(decky_name, repo)
elif force_all:
await mutate_all(force=True, repo=repo)
else:
await mutate_all(force=False, repo=repo)
asyncio.run(_run())
@app.command()
def status() -> None:
"""Show running deckies and their status."""
log.info("status command invoked")
from decnet.engine import status as _status
_status()
registry = _service_registry(str(DECNET_INGEST_LOG_FILE))
svc_table = Table(title="DECNET Services", show_lines=True)
svc_table.add_column("Service", style="bold cyan")
svc_table.add_column("Status")
svc_table.add_column("PID", style="dim")
for name, match_fn, _launch_args in registry:
pid = _is_running(match_fn)
if pid is not None:
svc_table.add_row(name, "[green]UP[/]", str(pid))
else:
svc_table.add_row(name, "[red]DOWN[/]", "")
console.print(svc_table)
@app.command()
def teardown(
all_: bool = typer.Option(False, "--all", help="Tear down all deckies and remove network"),
id_: Optional[str] = typer.Option(None, "--id", help="Tear down a specific decky by name"),
) -> None:
"""Stop and remove deckies."""
if not all_ and not id_:
console.print("[red]Specify --all or --id <name>.[/]")
raise typer.Exit(1)
log.info("teardown command invoked all=%s id=%s", all_, id_)
from decnet.engine import teardown as _teardown
_teardown(decky_id=id_)
log.info("teardown complete all=%s id=%s", all_, id_)
if all_:
_kill_all_services()
@app.command(name="services")
def list_services() -> None:
"""List all registered honeypot service plugins."""
svcs = all_services()
table = Table(title="Available Services", show_lines=True)
table.add_column("Name", style="bold cyan")
table.add_column("Ports")
table.add_column("Image")
for name, svc in sorted(svcs.items()):
table.add_row(name, ", ".join(str(p) for p in svc.ports), svc.default_image)
console.print(table)
@app.command(name="distros")
def list_distros() -> None:
"""List all available OS distro profiles for deckies."""
table = Table(title="Available Distro Profiles", show_lines=True)
table.add_column("Slug", style="bold cyan")
table.add_column("Display Name")
table.add_column("Docker Image", style="dim")
for slug, profile in sorted(all_distros().items()):
table.add_row(slug, profile.display_name, profile.image)
console.print(table)
@app.command(name="correlate")
def correlate(
log_file: Optional[str] = typer.Option(None, "--log-file", "-f", help="Path to DECNET syslog file to analyse"),
min_deckies: int = typer.Option(2, "--min-deckies", "-m", help="Minimum number of distinct deckies an IP must touch to be reported"),
output: str = typer.Option("table", "--output", "-o", help="Output format: table | json | syslog"),
emit_syslog: bool = typer.Option(False, "--emit-syslog", help="Also print traversal events as RFC 5424 lines (for SIEM piping)"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Analyse logs for cross-decky traversals and print the attacker movement graph."""
import sys
import json as _json
from pathlib import Path
from decnet.correlation.engine import CorrelationEngine
if daemon:
log.info("correlate daemonizing log_file=%s", log_file)
_daemonize()
engine = CorrelationEngine()
if log_file:
path = Path(log_file)
if not path.exists():
console.print(f"[red]Log file not found: {log_file}[/]")
raise typer.Exit(1)
engine.ingest_file(path)
elif not sys.stdin.isatty():
for line in sys.stdin:
engine.ingest(line)
else:
console.print("[red]Provide --log-file or pipe log data via stdin.[/]")
raise typer.Exit(1)
traversals = engine.traversals(min_deckies)
if output == "json":
console.print_json(_json.dumps(engine.report_json(min_deckies), indent=2))
elif output == "syslog":
for line in engine.traversal_syslog_lines(min_deckies):
typer.echo(line)
else:
if not traversals:
console.print(
f"[yellow]No traversals detected "
f"(min_deckies={min_deckies}, events_indexed={engine.events_indexed}).[/]"
)
else:
console.print(engine.report_table(min_deckies))
console.print(
f"[dim]Parsed {engine.lines_parsed} lines · "
f"indexed {engine.events_indexed} events · "
f"{len(engine.all_attackers())} unique IPs · "
f"[bold]{len(traversals)}[/] traversal(s)[/]"
)
if emit_syslog:
for line in engine.traversal_syslog_lines(min_deckies):
typer.echo(line)
@app.command(name="archetypes")
def list_archetypes() -> None:
"""List all machine archetype profiles."""
table = Table(title="Machine Archetypes", show_lines=True)
table.add_column("Slug", style="bold cyan")
table.add_column("Display Name")
table.add_column("Default Services", style="green")
table.add_column("Description", style="dim")
for slug, arch in sorted(all_archetypes().items()):
table.add_row(
slug,
arch.display_name,
", ".join(arch.services),
arch.description,
)
console.print(table)
@app.command(name="web")
def serve_web(
web_port: int = typer.Option(DECNET_WEB_PORT, "--web-port", help="Port to serve the DECNET Web Dashboard"),
host: str = typer.Option(DECNET_WEB_HOST, "--host", help="Host IP to serve the Web Dashboard"),
api_port: int = typer.Option(DECNET_API_PORT, "--api-port", help="Port the DECNET API is listening on"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Serve the DECNET Web Dashboard frontend.
Proxies /api/* requests to the API server so the frontend can use
relative URLs (/api/v1/...) with no CORS configuration required.
"""
import http.client
import http.server
import socketserver
from pathlib import Path
dist_dir = Path(__file__).parent.parent / "decnet_web" / "dist"
if not dist_dir.exists():
console.print(f"[red]Frontend build not found at {dist_dir}. Make sure you run 'npm run build' inside 'decnet_web'.[/]")
raise typer.Exit(1)
if daemon:
log.info("web daemonizing host=%s port=%d api_port=%d", host, web_port, api_port)
_daemonize()
_api_port = api_port
class SPAHTTPRequestHandler(http.server.SimpleHTTPRequestHandler):
def do_GET(self):
if self.path.startswith("/api/"):
self._proxy("GET")
return
path = self.translate_path(self.path)
if not Path(path).exists() or Path(path).is_dir():
self.path = "/index.html"
return super().do_GET()
def do_POST(self):
if self.path.startswith("/api/"):
self._proxy("POST")
return
self.send_error(405)
def do_PUT(self):
if self.path.startswith("/api/"):
self._proxy("PUT")
return
self.send_error(405)
def do_DELETE(self):
if self.path.startswith("/api/"):
self._proxy("DELETE")
return
self.send_error(405)
def _proxy(self, method: str) -> None:
content_length = int(self.headers.get("Content-Length", 0))
body = self.rfile.read(content_length) if content_length else None
forward = {k: v for k, v in self.headers.items()
if k.lower() not in ("host", "connection")}
try:
conn = http.client.HTTPConnection("127.0.0.1", _api_port, timeout=120)
conn.request(method, self.path, body=body, headers=forward)
resp = conn.getresponse()
self.send_response(resp.status)
for key, val in resp.getheaders():
if key.lower() not in ("connection", "transfer-encoding"):
self.send_header(key, val)
self.end_headers()
# Disable socket timeout for SSE streams — they are
# long-lived by design and the 120s timeout would kill them.
content_type = resp.getheader("Content-Type", "")
if "text/event-stream" in content_type:
conn.sock.settimeout(None)
# read1() returns bytes immediately available in the buffer
# without blocking for more. Plain read(4096) waits until
# 4096 bytes accumulate — fatal for SSE where each event
# is only ~100-500 bytes.
_read = getattr(resp, "read1", resp.read)
while True:
chunk = _read(4096)
if not chunk:
break
self.wfile.write(chunk)
self.wfile.flush()
except Exception as exc:
log.warning("web proxy error %s %s: %s", method, self.path, exc)
self.send_error(502, f"API proxy error: {exc}")
finally:
try:
conn.close()
except Exception: # nosec B110 — best-effort conn cleanup
pass
def log_message(self, fmt: str, *args: object) -> None:
log.debug("web %s", fmt % args)
import os
os.chdir(dist_dir)
socketserver.TCPServer.allow_reuse_address = True
with socketserver.ThreadingTCPServer((host, web_port), SPAHTTPRequestHandler) as httpd:
console.print(f"[green]Serving DECNET Web Dashboard on http://{host}:{web_port}[/]")
console.print(f"[dim]Proxying /api/* → http://127.0.0.1:{_api_port}[/]")
try:
httpd.serve_forever()
except KeyboardInterrupt:
console.print("\n[dim]Shutting down dashboard server.[/]")
@app.command(name="profiler")
def profiler_cmd(
interval: int = typer.Option(30, "--interval", "-i", help="Seconds between profile rebuild cycles"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Run the attacker profiler as a standalone microservice."""
import asyncio
from decnet.profiler import attacker_profile_worker
from decnet.web.dependencies import repo
if daemon:
log.info("profiler daemonizing interval=%d", interval)
_daemonize()
log.info("profiler starting interval=%d", interval)
console.print(f"[bold cyan]Profiler starting[/] (interval: {interval}s)")
async def _run() -> None:
await repo.initialize()
await attacker_profile_worker(repo, interval=interval)
try:
asyncio.run(_run())
except KeyboardInterrupt:
console.print("\n[yellow]Profiler stopped.[/]")
@app.command(name="sniffer")
def sniffer_cmd(
log_file: str = typer.Option(DECNET_INGEST_LOG_FILE, "--log-file", "-f", help="Path to write captured syslog + JSON records"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Run the network sniffer as a standalone microservice."""
import asyncio
from decnet.sniffer import sniffer_worker
if daemon:
log.info("sniffer daemonizing log_file=%s", log_file)
_daemonize()
log.info("sniffer starting log_file=%s", log_file)
console.print(f"[bold cyan]Sniffer starting[/] → {log_file}")
try:
asyncio.run(sniffer_worker(log_file))
except KeyboardInterrupt:
console.print("\n[yellow]Sniffer stopped.[/]")
_DB_RESET_TABLES: tuple[str, ...] = (
# Order matters for DROP TABLE: attacker_behavior FK-references attackers.
"attacker_behavior",
"attackers",
"logs",
"bounty",
"state",
"users",
)
async def _db_reset_mysql_async(dsn: str, mode: str, confirm: bool) -> None:
"""Inspect + (optionally) wipe a MySQL database. Pulled out of the CLI
wrapper so tests can drive it without spawning a Typer runner."""
from urllib.parse import urlparse
from sqlalchemy import text
from sqlalchemy.ext.asyncio import create_async_engine
db_name = urlparse(dsn).path.lstrip("/") or "(default)"
engine = create_async_engine(dsn)
try:
# Collect current row counts per table. Missing tables yield -1.
rows: dict[str, int] = {}
async with engine.connect() as conn:
for tbl in _DB_RESET_TABLES:
try:
result = await conn.execute(text(f"SELECT COUNT(*) FROM `{tbl}`")) # nosec B608
rows[tbl] = result.scalar() or 0
except Exception: # noqa: BLE001 — ProgrammingError for missing table varies by driver
rows[tbl] = -1
summary = Table(title=f"DECNET MySQL reset — database `{db_name}` (mode={mode})")
summary.add_column("Table", style="cyan")
summary.add_column("Rows", justify="right")
for tbl, count in rows.items():
summary.add_row(tbl, "[dim]missing[/]" if count < 0 else f"{count:,}")
console.print(summary)
if not confirm:
console.print(
"[yellow]Dry-run only. Re-run with [bold]--i-know-what-im-doing[/] "
"to actually execute.[/]"
)
return
# Destructive phase. FK checks off so TRUNCATE/DROP works in any order.
async with engine.begin() as conn:
await conn.execute(text("SET FOREIGN_KEY_CHECKS = 0"))
for tbl in _DB_RESET_TABLES:
if rows.get(tbl, -1) < 0:
continue # skip absent tables silently
if mode == "truncate":
await conn.execute(text(f"TRUNCATE TABLE `{tbl}`"))
console.print(f"[green]✓ TRUNCATE {tbl}[/]")
else: # drop-tables
await conn.execute(text(f"DROP TABLE `{tbl}`"))
console.print(f"[green]✓ DROP TABLE {tbl}[/]")
await conn.execute(text("SET FOREIGN_KEY_CHECKS = 1"))
console.print(f"[bold green]Done. Database `{db_name}` reset ({mode}).[/]")
finally:
await engine.dispose()
@app.command(name="db-reset")
def db_reset(
i_know: bool = typer.Option(
False,
"--i-know-what-im-doing",
help="Required to actually execute. Without it, the command runs in dry-run mode.",
),
mode: str = typer.Option(
"truncate",
"--mode",
help="truncate (wipe rows, keep schema) | drop-tables (DROP TABLE for each DECNET table)",
),
url: Optional[str] = typer.Option(
None,
"--url",
help="Override DECNET_DB_URL for this invocation (e.g. when cleanup needs admin creds).",
),
) -> None:
"""Wipe the MySQL database used by the DECNET dashboard.
Destructive. Runs dry by default — pass --i-know-what-im-doing to commit.
Only supported against MySQL; refuses to operate on SQLite.
"""
import asyncio
import os
if mode not in ("truncate", "drop-tables"):
console.print(f"[red]Invalid --mode '{mode}'. Expected: truncate | drop-tables.[/]")
raise typer.Exit(2)
db_type = os.environ.get("DECNET_DB_TYPE", "sqlite").lower()
if db_type != "mysql":
console.print(
f"[red]db-reset is MySQL-only (DECNET_DB_TYPE='{db_type}'). "
f"For SQLite, just delete the decnet.db file.[/]"
)
raise typer.Exit(2)
dsn = url or os.environ.get("DECNET_DB_URL")
if not dsn:
# Fall back to component env vars (DECNET_DB_HOST/PORT/NAME/USER/PASSWORD).
from decnet.web.db.mysql.database import build_mysql_url
try:
dsn = build_mysql_url()
except ValueError as e:
console.print(f"[red]{e}[/]")
raise typer.Exit(2) from e
log.info("db-reset invoked mode=%s confirm=%s", mode, i_know)
try:
asyncio.run(_db_reset_mysql_async(dsn, mode=mode, confirm=i_know))
except Exception as e: # noqa: BLE001
console.print(f"[red]db-reset failed: {e}[/]")
raise typer.Exit(1) from e
if __name__ == '__main__': # pragma: no cover
app()

80
decnet/cli/__init__.py Normal file
View File

@@ -0,0 +1,80 @@
"""
DECNET CLI — entry point for all commands.
Usage:
decnet deploy --mode unihost --deckies 5 --randomize-services
decnet status
decnet teardown [--all | --id decky-01]
decnet services
Layout: each command module exports ``register(app)`` which attaches its
commands to the passed Typer app. ``__init__.py`` builds the root app,
calls every module's ``register`` in order, then runs the master-only
gate. The gate must fire LAST so it sees the fully-populated dispatch
table before filtering.
"""
from __future__ import annotations
import typer
from . import (
agent,
api,
db,
deploy,
forwarder,
inventory,
lifecycle,
listener,
profiler,
sniffer,
swarm,
swarmctl,
updater,
web,
workers,
)
from .gating import _gate_commands_by_mode
from .utils import console as console, log as log
app = typer.Typer(
name="decnet",
help="Deploy a deception network of honeypot deckies on your LAN.",
no_args_is_help=True,
)
# Order matches the old flat layout so `decnet --help` reads the same.
for _mod in (
api, swarmctl, agent, updater, listener, forwarder,
swarm,
deploy, lifecycle, workers, inventory,
web, profiler, sniffer, db,
):
_mod.register(app)
_gate_commands_by_mode(app)
# Backwards-compat re-exports. Tests and third-party tooling import these
# directly from ``decnet.cli``; the refactor must keep them resolvable.
from .db import _db_reset_mysql_async # noqa: E402,F401
from .gating import ( # noqa: E402,F401
MASTER_ONLY_COMMANDS,
MASTER_ONLY_GROUPS,
_agent_mode_active,
_require_master_mode,
)
from .utils import ( # noqa: E402,F401
_daemonize,
_http_request,
_is_running,
_kill_all_services,
_pid_dir,
_service_registry,
_spawn_detached,
_swarmctl_base_url,
)
if __name__ == "__main__": # pragma: no cover
app()

64
decnet/cli/agent.py Normal file
View File

@@ -0,0 +1,64 @@
from __future__ import annotations
import os
import pathlib as _pathlib
import sys as _sys
from typing import Optional
import typer
from . import utils as _utils
from .utils import console, log
def register(app: typer.Typer) -> None:
@app.command()
def agent(
port: int = typer.Option(8765, "--port", help="Port for the worker agent"),
host: str = typer.Option("0.0.0.0", "--host", help="Bind address for the worker agent"), # nosec B104
agent_dir: Optional[str] = typer.Option(None, "--agent-dir", help="Worker cert bundle dir (default: ~/.decnet/agent, expanded under the running user's HOME — set this when running as sudo/root)"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
no_forwarder: bool = typer.Option(False, "--no-forwarder", help="Do not auto-spawn the log forwarder alongside the agent"),
) -> None:
"""Run the DECNET SWARM worker agent (requires a cert bundle in ~/.decnet/agent/).
By default, `decnet agent` auto-spawns `decnet forwarder` as a fully-
detached sibling process so worker logs start flowing to the master
without a second manual invocation. The forwarder survives agent
restarts and crashes — if it dies on its own, restart it manually
with `decnet forwarder --daemon …`. Pass --no-forwarder to skip.
"""
from decnet.agent import server as _agent_server
from decnet.env import DECNET_SWARM_MASTER_HOST, DECNET_INGEST_LOG_FILE
from decnet.swarm import pki as _pki
resolved_dir = _pathlib.Path(agent_dir) if agent_dir else _pki.DEFAULT_AGENT_DIR
if daemon:
log.info("agent daemonizing host=%s port=%d", host, port)
_utils._daemonize()
if not no_forwarder and DECNET_SWARM_MASTER_HOST:
fw_argv = [
_sys.executable, "-m", "decnet", "forwarder",
"--master-host", DECNET_SWARM_MASTER_HOST,
"--master-port", str(int(os.environ.get("DECNET_SWARM_SYSLOG_PORT", "6514"))),
"--agent-dir", str(resolved_dir),
"--log-file", str(DECNET_INGEST_LOG_FILE),
"--daemon",
]
try:
pid = _utils._spawn_detached(fw_argv, _utils._pid_dir() / "forwarder.pid")
log.info("agent auto-spawned forwarder pid=%d master=%s", pid, DECNET_SWARM_MASTER_HOST)
console.print(f"[dim]Auto-spawned forwarder (pid {pid}) → {DECNET_SWARM_MASTER_HOST}.[/]")
except Exception as e: # noqa: BLE001
log.warning("agent could not auto-spawn forwarder: %s", e)
console.print(f"[yellow]forwarder auto-spawn skipped: {e}[/]")
elif not no_forwarder:
log.info("agent skipping forwarder auto-spawn (DECNET_SWARM_MASTER_HOST unset)")
log.info("agent command invoked host=%s port=%d dir=%s", host, port, resolved_dir)
console.print(f"[green]Starting DECNET worker agent on {host}:{port} (mTLS)...[/]")
rc = _agent_server.run(host, port, agent_dir=resolved_dir)
if rc != 0:
raise typer.Exit(rc)

53
decnet/cli/api.py Normal file
View File

@@ -0,0 +1,53 @@
from __future__ import annotations
import os
import signal
import subprocess # nosec B404
import sys
import typer
from decnet.env import DECNET_API_HOST, DECNET_API_PORT, DECNET_INGEST_LOG_FILE
from . import utils as _utils
from .gating import _require_master_mode
from .utils import console, log
def register(app: typer.Typer) -> None:
@app.command()
def api(
port: int = typer.Option(DECNET_API_PORT, "--port", help="Port for the backend API"),
host: str = typer.Option(DECNET_API_HOST, "--host", help="Host IP for the backend API"),
log_file: str = typer.Option(DECNET_INGEST_LOG_FILE, "--log-file", help="Path to the DECNET log file to monitor"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
workers: int = typer.Option(1, "--workers", "-w", min=1, help="Number of uvicorn worker processes"),
) -> None:
"""Run the DECNET API and Web Dashboard in standalone mode."""
_require_master_mode("api")
if daemon:
log.info("API daemonizing host=%s port=%d workers=%d", host, port, workers)
_utils._daemonize()
log.info("API command invoked host=%s port=%d workers=%d", host, port, workers)
console.print(f"[green]Starting DECNET API on {host}:{port} (workers={workers})...[/]")
_env: dict[str, str] = os.environ.copy()
_env["DECNET_INGEST_LOG_FILE"] = str(log_file)
_cmd = [sys.executable, "-m", "uvicorn", "decnet.web.api:app",
"--host", host, "--port", str(port), "--workers", str(workers)]
try:
proc = subprocess.Popen(_cmd, env=_env, start_new_session=True) # nosec B603 B404
try:
proc.wait()
except KeyboardInterrupt:
try:
os.killpg(proc.pid, signal.SIGTERM)
try:
proc.wait(timeout=10)
except subprocess.TimeoutExpired:
os.killpg(proc.pid, signal.SIGKILL)
proc.wait()
except ProcessLookupError:
pass
except (FileNotFoundError, subprocess.SubprocessError):
console.print("[red]Failed to start API. Ensure 'uvicorn' is installed in the current environment.[/]")

130
decnet/cli/db.py Normal file
View File

@@ -0,0 +1,130 @@
from __future__ import annotations
from typing import Optional
import typer
from rich.table import Table
from .utils import console, log
_DB_RESET_TABLES: tuple[str, ...] = (
# Order matters for DROP TABLE: child FKs first.
# - attacker_behavior FK-references attackers.
# - decky_shards FK-references swarm_hosts.
"attacker_behavior",
"attackers",
"logs",
"bounty",
"state",
"users",
"decky_shards",
"swarm_hosts",
)
async def _db_reset_mysql_async(dsn: str, mode: str, confirm: bool) -> None:
"""Inspect + (optionally) wipe a MySQL database. Pulled out of the CLI
wrapper so tests can drive it without spawning a Typer runner."""
from urllib.parse import urlparse
from sqlalchemy import text
from sqlalchemy.ext.asyncio import create_async_engine
db_name = urlparse(dsn).path.lstrip("/") or "(default)"
engine = create_async_engine(dsn)
try:
rows: dict[str, int] = {}
async with engine.connect() as conn:
for tbl in _DB_RESET_TABLES:
try:
result = await conn.execute(text(f"SELECT COUNT(*) FROM `{tbl}`")) # nosec B608
rows[tbl] = result.scalar() or 0
except Exception: # noqa: BLE001 — ProgrammingError for missing table varies by driver
rows[tbl] = -1
summary = Table(title=f"DECNET MySQL reset — database `{db_name}` (mode={mode})")
summary.add_column("Table", style="cyan")
summary.add_column("Rows", justify="right")
for tbl, count in rows.items():
summary.add_row(tbl, "[dim]missing[/]" if count < 0 else f"{count:,}")
console.print(summary)
if not confirm:
console.print(
"[yellow]Dry-run only. Re-run with [bold]--i-know-what-im-doing[/] "
"to actually execute.[/]"
)
return
async with engine.begin() as conn:
await conn.execute(text("SET FOREIGN_KEY_CHECKS = 0"))
for tbl in _DB_RESET_TABLES:
if rows.get(tbl, -1) < 0:
continue
if mode == "truncate":
await conn.execute(text(f"TRUNCATE TABLE `{tbl}`"))
console.print(f"[green]✓ TRUNCATE {tbl}[/]")
else:
await conn.execute(text(f"DROP TABLE `{tbl}`"))
console.print(f"[green]✓ DROP TABLE {tbl}[/]")
await conn.execute(text("SET FOREIGN_KEY_CHECKS = 1"))
console.print(f"[bold green]Done. Database `{db_name}` reset ({mode}).[/]")
finally:
await engine.dispose()
def register(app: typer.Typer) -> None:
@app.command(name="db-reset")
def db_reset(
i_know: bool = typer.Option(
False,
"--i-know-what-im-doing",
help="Required to actually execute. Without it, the command runs in dry-run mode.",
),
mode: str = typer.Option(
"truncate",
"--mode",
help="truncate (wipe rows, keep schema) | drop-tables (DROP TABLE for each DECNET table)",
),
url: Optional[str] = typer.Option(
None,
"--url",
help="Override DECNET_DB_URL for this invocation (e.g. when cleanup needs admin creds).",
),
) -> None:
"""Wipe the MySQL database used by the DECNET dashboard.
Destructive. Runs dry by default — pass --i-know-what-im-doing to commit.
Only supported against MySQL; refuses to operate on SQLite.
"""
import asyncio
import os
if mode not in ("truncate", "drop-tables"):
console.print(f"[red]Invalid --mode '{mode}'. Expected: truncate | drop-tables.[/]")
raise typer.Exit(2)
db_type = os.environ.get("DECNET_DB_TYPE", "sqlite").lower()
if db_type != "mysql":
console.print(
f"[red]db-reset is MySQL-only (DECNET_DB_TYPE='{db_type}'). "
f"For SQLite, just delete the decnet.db file.[/]"
)
raise typer.Exit(2)
dsn = url or os.environ.get("DECNET_DB_URL")
if not dsn:
from decnet.web.db.mysql.database import build_mysql_url
try:
dsn = build_mysql_url()
except ValueError as e:
console.print(f"[red]{e}[/]")
raise typer.Exit(2) from e
log.info("db-reset invoked mode=%s confirm=%s", mode, i_know)
try:
asyncio.run(_db_reset_mysql_async(dsn, mode=mode, confirm=i_know))
except Exception as e: # noqa: BLE001
console.print(f"[red]db-reset failed: {e}[/]")
raise typer.Exit(1) from e

307
decnet/cli/deploy.py Normal file
View File

@@ -0,0 +1,307 @@
from __future__ import annotations
from typing import Optional
import typer
from rich.table import Table
from decnet.archetypes import Archetype, get_archetype
from decnet.config import DecnetConfig
from decnet.distros import get_distro
from decnet.env import DECNET_API_HOST, DECNET_INGEST_LOG_FILE
from decnet.fleet import all_service_names, build_deckies, build_deckies_from_ini
from decnet.ini_loader import load_ini
from decnet.network import detect_interface, detect_subnet, allocate_ips, get_host_ip
from . import utils as _utils
from .gating import _require_master_mode
from .utils import console, log
def _deploy_swarm(config: "DecnetConfig", *, dry_run: bool, no_cache: bool) -> None:
"""Shard deckies round-robin across enrolled workers and POST to swarmctl."""
base = _utils._swarmctl_base_url(None)
resp = _utils._http_request("GET", base + "/swarm/hosts?host_status=enrolled")
enrolled = resp.json()
resp2 = _utils._http_request("GET", base + "/swarm/hosts?host_status=active")
active = resp2.json()
workers = [*enrolled, *active]
if not workers:
console.print("[red]No enrolled workers — run `decnet swarm enroll ...` first.[/]")
raise typer.Exit(1)
assigned: list = []
for idx, d in enumerate(config.deckies):
target = workers[idx % len(workers)]
assigned.append(d.model_copy(update={"host_uuid": target["uuid"]}))
config = config.model_copy(update={"deckies": assigned})
body = {"config": config.model_dump(mode="json"), "dry_run": dry_run, "no_cache": no_cache}
console.print(f"[cyan]Dispatching {len(config.deckies)} deckies across {len(workers)} worker(s)...[/]")
resp3 = _utils._http_request("POST", base + "/swarm/deploy", json_body=body, timeout=900.0)
results = resp3.json().get("results", [])
table = Table(title="SWARM deploy results")
for col in ("worker", "host_uuid", "ok", "detail"):
table.add_column(col)
any_failed = False
for r in results:
ok = bool(r.get("ok"))
if not ok:
any_failed = True
detail = r.get("detail")
if isinstance(detail, dict):
detail = detail.get("status") or "ok"
table.add_row(
str(r.get("host_name") or ""),
str(r.get("host_uuid") or ""),
"[green]yes[/]" if ok else "[red]no[/]",
str(detail)[:80],
)
console.print(table)
if any_failed:
raise typer.Exit(1)
def register(app: typer.Typer) -> None:
@app.command()
def deploy(
mode: str = typer.Option("unihost", "--mode", "-m", help="Deployment mode: unihost | swarm"),
deckies: Optional[int] = typer.Option(None, "--deckies", "-n", help="Number of deckies to deploy (required without --config)", min=1),
interface: Optional[str] = typer.Option(None, "--interface", "-i", help="Host NIC (auto-detected if omitted)"),
subnet: Optional[str] = typer.Option(None, "--subnet", help="LAN subnet CIDR (auto-detected if omitted)"),
ip_start: Optional[str] = typer.Option(None, "--ip-start", help="First decky IP (auto if omitted)"),
services: Optional[str] = typer.Option(None, "--services", help="Comma-separated services, e.g. ssh,smb,rdp"),
randomize_services: bool = typer.Option(False, "--randomize-services", help="Assign random services to each decky"),
distro: Optional[str] = typer.Option(None, "--distro", help="Comma-separated distro slugs, e.g. debian,ubuntu22,rocky9"),
randomize_distros: bool = typer.Option(False, "--randomize-distros", help="Assign a random distro to each decky"),
log_file: Optional[str] = typer.Option(DECNET_INGEST_LOG_FILE, "--log-file", help="Host path for the collector to write RFC 5424 logs (e.g. /var/log/decnet/decnet.log)"),
archetype_name: Optional[str] = typer.Option(None, "--archetype", "-a", help="Machine archetype slug (e.g. linux-server, windows-workstation)"),
mutate_interval: Optional[int] = typer.Option(30, "--mutate-interval", help="Automatically rotate services every N minutes"),
dry_run: bool = typer.Option(False, "--dry-run", help="Generate compose file without starting containers"),
no_cache: bool = typer.Option(False, "--no-cache", help="Force rebuild all images, ignoring Docker layer cache"),
parallel: bool = typer.Option(False, "--parallel", help="Build all images concurrently (enables BuildKit, separates build from up)"),
ipvlan: bool = typer.Option(False, "--ipvlan", help="Use IPvlan L2 instead of MACVLAN (required on WiFi interfaces)"),
config_file: Optional[str] = typer.Option(None, "--config", "-c", help="Path to INI config file"),
api: bool = typer.Option(False, "--api", help="Start the FastAPI backend to ingest and serve logs"),
api_port: int = typer.Option(8000, "--api-port", help="Port for the backend API"),
daemon: bool = typer.Option(False, "--daemon", help="Detach to background as a daemon process"),
) -> None:
"""Deploy deckies to the LAN."""
import os
import subprocess # nosec B404
import sys
from pathlib import Path as _Path
_require_master_mode("deploy")
if daemon:
log.info("deploy daemonizing mode=%s deckies=%s", mode, deckies)
_utils._daemonize()
log.info("deploy command invoked mode=%s deckies=%s dry_run=%s", mode, deckies, dry_run)
if mode not in ("unihost", "swarm"):
console.print("[red]--mode must be 'unihost' or 'swarm'[/]")
raise typer.Exit(1)
if config_file:
try:
ini = load_ini(config_file)
except FileNotFoundError as e:
console.print(f"[red]{e}[/]")
raise typer.Exit(1)
iface = interface or ini.interface or detect_interface()
subnet_cidr = subnet or ini.subnet
effective_gateway = ini.gateway
if subnet_cidr is None:
subnet_cidr, effective_gateway = detect_subnet(iface)
elif effective_gateway is None:
_, effective_gateway = detect_subnet(iface)
host_ip = get_host_ip(iface)
console.print(f"[dim]Config:[/] {config_file} [dim]Interface:[/] {iface} "
f"[dim]Subnet:[/] {subnet_cidr} [dim]Gateway:[/] {effective_gateway} "
f"[dim]Host IP:[/] {host_ip}")
if ini.custom_services:
from decnet.custom_service import CustomService
from decnet.services.registry import register_custom_service
for cs in ini.custom_services:
register_custom_service(
CustomService(
name=cs.name,
image=cs.image,
exec_cmd=cs.exec_cmd,
ports=cs.ports,
)
)
effective_log_file = log_file
try:
decky_configs = build_deckies_from_ini(
ini, subnet_cidr, effective_gateway, host_ip, randomize_services, cli_mutate_interval=mutate_interval
)
except ValueError as e:
console.print(f"[red]{e}[/]")
raise typer.Exit(1)
else:
if deckies is None:
console.print("[red]--deckies is required when --config is not used.[/]")
raise typer.Exit(1)
services_list = [s.strip() for s in services.split(",")] if services else None
if services_list:
known = set(all_service_names())
unknown = [s for s in services_list if s not in known]
if unknown:
console.print(f"[red]Unknown service(s): {unknown}. Available: {all_service_names()}[/]")
raise typer.Exit(1)
arch: Archetype | None = None
if archetype_name:
try:
arch = get_archetype(archetype_name)
except ValueError as e:
console.print(f"[red]{e}[/]")
raise typer.Exit(1)
if not services_list and not randomize_services and not arch:
console.print("[red]Specify --services, --archetype, or --randomize-services.[/]")
raise typer.Exit(1)
iface = interface or detect_interface()
if subnet is None:
subnet_cidr, effective_gateway = detect_subnet(iface)
else:
subnet_cidr = subnet
_, effective_gateway = detect_subnet(iface)
host_ip = get_host_ip(iface)
console.print(f"[dim]Interface:[/] {iface} [dim]Subnet:[/] {subnet_cidr} "
f"[dim]Gateway:[/] {effective_gateway} [dim]Host IP:[/] {host_ip}")
distros_list = [d.strip() for d in distro.split(",")] if distro else None
if distros_list:
try:
for slug in distros_list:
get_distro(slug)
except ValueError as e:
console.print(f"[red]{e}[/]")
raise typer.Exit(1)
ips = allocate_ips(subnet_cidr, effective_gateway, host_ip, deckies, ip_start)
decky_configs = build_deckies(
deckies, ips, services_list, randomize_services,
distros_explicit=distros_list, randomize_distros=randomize_distros,
archetype=arch, mutate_interval=mutate_interval,
)
effective_log_file = log_file
if api and not effective_log_file:
effective_log_file = os.path.join(os.getcwd(), "decnet.log")
console.print(f"[cyan]API mode enabled: defaulting log-file to {effective_log_file}[/]")
config = DecnetConfig(
mode=mode,
interface=iface,
subnet=subnet_cidr,
gateway=effective_gateway,
deckies=decky_configs,
log_file=effective_log_file,
ipvlan=ipvlan,
mutate_interval=mutate_interval,
)
log.debug("deploy: config built deckies=%d interface=%s subnet=%s", len(config.deckies), config.interface, config.subnet)
if mode == "swarm":
_deploy_swarm(config, dry_run=dry_run, no_cache=no_cache)
if dry_run:
log.info("deploy: swarm dry-run complete, no workers dispatched")
else:
log.info("deploy: swarm deployment complete deckies=%d", len(config.deckies))
return
from decnet.engine import deploy as _deploy
_deploy(config, dry_run=dry_run, no_cache=no_cache, parallel=parallel)
if dry_run:
log.info("deploy: dry-run complete, no containers started")
else:
log.info("deploy: deployment complete deckies=%d", len(config.deckies))
if mutate_interval is not None and not dry_run:
console.print(f"[green]Starting DECNET Mutator watcher in the background (interval: {mutate_interval}m)...[/]")
try:
subprocess.Popen( # nosec B603
[sys.executable, "-m", "decnet.cli", "mutate", "--watch"],
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT,
start_new_session=True,
)
except (FileNotFoundError, subprocess.SubprocessError):
console.print("[red]Failed to start mutator watcher.[/]")
if effective_log_file and not dry_run and not api:
_collector_err = _Path(effective_log_file).with_suffix(".collector.log")
console.print(f"[bold cyan]Starting log collector[/] → {effective_log_file}")
subprocess.Popen( # nosec B603
[sys.executable, "-m", "decnet.cli", "collect", "--log-file", str(effective_log_file)],
stdin=subprocess.DEVNULL,
stdout=open(_collector_err, "a"),
stderr=subprocess.STDOUT,
start_new_session=True,
)
if api and not dry_run:
console.print(f"[green]Starting DECNET API on port {api_port}...[/]")
_env: dict[str, str] = os.environ.copy()
_env["DECNET_INGEST_LOG_FILE"] = str(effective_log_file or "")
try:
subprocess.Popen( # nosec B603
[sys.executable, "-m", "uvicorn", "decnet.web.api:app", "--host", DECNET_API_HOST, "--port", str(api_port)],
env=_env,
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT
)
console.print(f"[dim]API running at http://{DECNET_API_HOST}:{api_port}[/]")
except (FileNotFoundError, subprocess.SubprocessError):
console.print("[red]Failed to start API. Ensure 'uvicorn' is installed in the current environment.[/]")
if effective_log_file and not dry_run:
console.print("[bold cyan]Starting DECNET-PROBER[/] (auto-discovers attackers from log stream)")
try:
subprocess.Popen( # nosec B603
[sys.executable, "-m", "decnet.cli", "probe", "--daemon", "--log-file", str(effective_log_file)],
stdin=subprocess.DEVNULL,
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT,
start_new_session=True,
)
except (FileNotFoundError, subprocess.SubprocessError):
console.print("[red]Failed to start DECNET-PROBER.[/]")
if effective_log_file and not dry_run:
console.print("[bold cyan]Starting DECNET-PROFILER[/] (builds attacker profiles from log stream)")
try:
subprocess.Popen( # nosec B603
[sys.executable, "-m", "decnet.cli", "profiler", "--daemon"],
stdin=subprocess.DEVNULL,
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT,
start_new_session=True,
)
except (FileNotFoundError, subprocess.SubprocessError):
console.print("[red]Failed to start DECNET-PROFILER.[/]")
if effective_log_file and not dry_run:
console.print("[bold cyan]Starting DECNET-SNIFFER[/] (passive network capture)")
try:
subprocess.Popen( # nosec B603
[sys.executable, "-m", "decnet.cli", "sniffer", "--daemon", "--log-file", str(effective_log_file)],
stdin=subprocess.DEVNULL,
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT,
start_new_session=True,
)
except (FileNotFoundError, subprocess.SubprocessError):
console.print("[red]Failed to start DECNET-SNIFFER.[/]")

74
decnet/cli/forwarder.py Normal file
View File

@@ -0,0 +1,74 @@
from __future__ import annotations
import asyncio
import pathlib
import signal
from typing import Optional
import typer
from decnet.env import DECNET_INGEST_LOG_FILE
from . import utils as _utils
from .utils import console, log
def register(app: typer.Typer) -> None:
@app.command()
def forwarder(
master_host: Optional[str] = typer.Option(None, "--master-host", help="Master listener hostname/IP (default: $DECNET_SWARM_MASTER_HOST)"),
master_port: int = typer.Option(6514, "--master-port", help="Master listener TCP port (RFC 5425 default 6514)"),
log_file: Optional[str] = typer.Option(DECNET_INGEST_LOG_FILE, "--log-file", help="Local RFC 5424 file to tail and forward"),
agent_dir: Optional[str] = typer.Option(None, "--agent-dir", help="Worker cert bundle dir (default: ~/.decnet/agent)"),
state_db: Optional[str] = typer.Option(None, "--state-db", help="Forwarder offset SQLite path (default: <agent_dir>/forwarder.db)"),
poll_interval: float = typer.Option(0.5, "--poll-interval", help="Seconds between log file stat checks"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Run the worker-side syslog-over-TLS forwarder (RFC 5425, mTLS to master:6514)."""
from decnet.env import DECNET_SWARM_MASTER_HOST
from decnet.swarm import pki
from decnet.swarm.log_forwarder import ForwarderConfig, run_forwarder
resolved_host = master_host or DECNET_SWARM_MASTER_HOST
if not resolved_host:
console.print("[red]--master-host is required (or set DECNET_SWARM_MASTER_HOST).[/]")
raise typer.Exit(2)
resolved_agent_dir = pathlib.Path(agent_dir) if agent_dir else pki.DEFAULT_AGENT_DIR
if not (resolved_agent_dir / "worker.crt").exists():
console.print(f"[red]No worker cert bundle at {resolved_agent_dir} — enroll from the master first.[/]")
raise typer.Exit(2)
if not log_file:
console.print("[red]--log-file is required.[/]")
raise typer.Exit(2)
cfg = ForwarderConfig(
log_path=pathlib.Path(log_file),
master_host=resolved_host,
master_port=master_port,
agent_dir=resolved_agent_dir,
state_db=pathlib.Path(state_db) if state_db else None,
)
if daemon:
log.info("forwarder daemonizing master=%s:%d log=%s", resolved_host, master_port, log_file)
_utils._daemonize()
log.info("forwarder command invoked master=%s:%d log=%s", resolved_host, master_port, log_file)
console.print(f"[green]Starting DECNET forwarder → {resolved_host}:{master_port} (mTLS)...[/]")
async def _main() -> None:
stop = asyncio.Event()
loop = asyncio.get_running_loop()
for sig in (signal.SIGTERM, signal.SIGINT):
try:
loop.add_signal_handler(sig, stop.set)
except (NotImplementedError, RuntimeError): # pragma: no cover
pass
await run_forwarder(cfg, poll_interval=poll_interval, stop_event=stop)
try:
asyncio.run(_main())
except KeyboardInterrupt:
pass

71
decnet/cli/gating.py Normal file
View File

@@ -0,0 +1,71 @@
"""Role-based CLI gating.
MAINTAINERS: when you add a new Typer command (or add_typer group) that is
master-only, register its name in MASTER_ONLY_COMMANDS / MASTER_ONLY_GROUPS
below. The gate is the only thing that:
(a) hides the command from `decnet --help` on worker hosts, and
(b) prevents a misconfigured worker from invoking master-side logic.
Forgetting to register a new command is a role-boundary bug. Grep for
MASTER_ONLY when touching command registration.
Worker-legitimate commands (NOT in these sets): agent, updater, forwarder,
status, collect, probe, sniffer. Agents run deckies locally and should be
able to inspect them + run the per-host microservices (collector streams
container logs, prober characterizes attackers hitting this host, sniffer
captures traffic). Mutator and Profiler stay master-only: the mutator
orchestrates respawns across the swarm; the profiler rebuilds attacker
profiles against the master DB (no per-host DB exists).
"""
from __future__ import annotations
import os
import typer
from .utils import console
MASTER_ONLY_COMMANDS: frozenset[str] = frozenset({
"api", "swarmctl", "deploy", "redeploy", "teardown",
"mutate", "listener", "profiler",
"services", "distros", "correlate", "archetypes", "web",
"db-reset",
})
MASTER_ONLY_GROUPS: frozenset[str] = frozenset({"swarm"})
def _agent_mode_active() -> bool:
"""True when the host is configured as an agent AND master commands are
disallowed (the default for agents). Workers overriding this explicitly
set DECNET_DISALLOW_MASTER=false to opt into hybrid use."""
mode = os.environ.get("DECNET_MODE", "master").lower()
disallow = os.environ.get("DECNET_DISALLOW_MASTER", "true").lower() == "true"
return mode == "agent" and disallow
def _require_master_mode(command_name: str) -> None:
"""Defence-in-depth: called at the top of every master-only command body.
The registration-time gate in _gate_commands_by_mode() already hides
these commands from Typer's dispatch table, but this check protects
against direct function imports (e.g. from tests or third-party tools)
that would bypass Typer entirely."""
if _agent_mode_active():
console.print(
f"[red]`decnet {command_name}` is a master-only command; this host "
f"is configured as an agent (DECNET_MODE=agent).[/]"
)
raise typer.Exit(1)
def _gate_commands_by_mode(_app: typer.Typer) -> None:
if not _agent_mode_active():
return
_app.registered_commands = [
c for c in _app.registered_commands
if (c.name or c.callback.__name__) not in MASTER_ONLY_COMMANDS
]
_app.registered_groups = [
g for g in _app.registered_groups
if g.name not in MASTER_ONLY_GROUPS
]

52
decnet/cli/inventory.py Normal file
View File

@@ -0,0 +1,52 @@
from __future__ import annotations
import typer
from rich.table import Table
from decnet.archetypes import all_archetypes
from decnet.distros import all_distros
from decnet.services.registry import all_services
from .utils import console
def register(app: typer.Typer) -> None:
@app.command(name="services")
def list_services() -> None:
"""List all registered honeypot service plugins."""
svcs = all_services()
table = Table(title="Available Services", show_lines=True)
table.add_column("Name", style="bold cyan")
table.add_column("Ports")
table.add_column("Image")
for name, svc in sorted(svcs.items()):
table.add_row(name, ", ".join(str(p) for p in svc.ports), svc.default_image)
console.print(table)
@app.command(name="distros")
def list_distros() -> None:
"""List all available OS distro profiles for deckies."""
table = Table(title="Available Distro Profiles", show_lines=True)
table.add_column("Slug", style="bold cyan")
table.add_column("Display Name")
table.add_column("Docker Image", style="dim")
for slug, profile in sorted(all_distros().items()):
table.add_row(slug, profile.display_name, profile.image)
console.print(table)
@app.command(name="archetypes")
def list_archetypes() -> None:
"""List all machine archetype profiles."""
table = Table(title="Machine Archetypes", show_lines=True)
table.add_column("Slug", style="bold cyan")
table.add_column("Display Name")
table.add_column("Default Services", style="green")
table.add_column("Description", style="dim")
for slug, arch in sorted(all_archetypes().items()):
table.add_row(
slug,
arch.display_name,
", ".join(arch.services),
arch.description,
)
console.print(table)

97
decnet/cli/lifecycle.py Normal file
View File

@@ -0,0 +1,97 @@
from __future__ import annotations
import subprocess # nosec B404
from typing import Optional
import typer
from rich.table import Table
from decnet.env import DECNET_INGEST_LOG_FILE
from . import utils as _utils
from .gating import _agent_mode_active, _require_master_mode
from .utils import console, log
def register(app: typer.Typer) -> None:
@app.command()
def redeploy(
log_file: str = typer.Option(DECNET_INGEST_LOG_FILE, "--log-file", "-f", help="Path to the DECNET log file"),
) -> None:
"""Check running DECNET services and relaunch any that are down."""
log.info("redeploy: checking services")
registry = _utils._service_registry(str(log_file))
table = Table(title="DECNET Services", show_lines=True)
table.add_column("Service", style="bold cyan")
table.add_column("Status")
table.add_column("PID", style="dim")
table.add_column("Action")
relaunched = 0
for name, match_fn, launch_args in registry:
pid = _utils._is_running(match_fn)
if pid is not None:
table.add_row(name, "[green]UP[/]", str(pid), "")
else:
try:
subprocess.Popen( # nosec B603
launch_args,
stdin=subprocess.DEVNULL,
stdout=subprocess.DEVNULL,
stderr=subprocess.STDOUT,
start_new_session=True,
)
table.add_row(name, "[red]DOWN[/]", "", "[green]relaunched[/]")
relaunched += 1
except (FileNotFoundError, subprocess.SubprocessError) as exc:
table.add_row(name, "[red]DOWN[/]", "", f"[red]failed: {exc}[/]")
console.print(table)
if relaunched:
console.print(f"[green]{relaunched} service(s) relaunched.[/]")
else:
console.print("[green]All services running.[/]")
@app.command()
def status() -> None:
"""Show running deckies and their status."""
log.info("status command invoked")
from decnet.engine import status as _status
_status()
registry = _utils._service_registry(str(DECNET_INGEST_LOG_FILE))
if _agent_mode_active():
registry = [r for r in registry if r[0] not in {"Mutator", "Profiler", "API"}]
svc_table = Table(title="DECNET Services", show_lines=True)
svc_table.add_column("Service", style="bold cyan")
svc_table.add_column("Status")
svc_table.add_column("PID", style="dim")
for name, match_fn, _launch_args in registry:
pid = _utils._is_running(match_fn)
if pid is not None:
svc_table.add_row(name, "[green]UP[/]", str(pid))
else:
svc_table.add_row(name, "[red]DOWN[/]", "")
console.print(svc_table)
@app.command()
def teardown(
all_: bool = typer.Option(False, "--all", help="Tear down all deckies and remove network"),
id_: Optional[str] = typer.Option(None, "--id", help="Tear down a specific decky by name"),
) -> None:
"""Stop and remove deckies."""
_require_master_mode("teardown")
if not all_ and not id_:
console.print("[red]Specify --all or --id <name>.[/]")
raise typer.Exit(1)
log.info("teardown command invoked all=%s id=%s", all_, id_)
from decnet.engine import teardown as _teardown
_teardown(decky_id=id_)
log.info("teardown complete all=%s id=%s", all_, id_)
if all_:
_utils._kill_all_services()

57
decnet/cli/listener.py Normal file
View File

@@ -0,0 +1,57 @@
from __future__ import annotations
import asyncio
import pathlib
import signal
from typing import Optional
import typer
from . import utils as _utils
from .utils import console, log
def register(app: typer.Typer) -> None:
@app.command()
def listener(
bind_host: str = typer.Option("0.0.0.0", "--host", help="Bind address for the master syslog-TLS listener"), # nosec B104
bind_port: int = typer.Option(6514, "--port", help="Listener TCP port (RFC 5425 default 6514)"),
log_path: Optional[str] = typer.Option(None, "--log-path", help="RFC 5424 forensic sink (default: ./master.log)"),
json_path: Optional[str] = typer.Option(None, "--json-path", help="Parsed-JSON ingest sink (default: ./master.json)"),
ca_dir: Optional[str] = typer.Option(None, "--ca-dir", help="DECNET CA dir (default: ~/.decnet/ca)"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Run the master-side syslog-over-TLS listener (RFC 5425, mTLS)."""
from decnet.swarm import pki
from decnet.swarm.log_listener import ListenerConfig, run_listener
resolved_ca_dir = pathlib.Path(ca_dir) if ca_dir else pki.DEFAULT_CA_DIR
resolved_log = pathlib.Path(log_path) if log_path else pathlib.Path("master.log")
resolved_json = pathlib.Path(json_path) if json_path else pathlib.Path("master.json")
cfg = ListenerConfig(
log_path=resolved_log, json_path=resolved_json,
bind_host=bind_host, bind_port=bind_port, ca_dir=resolved_ca_dir,
)
if daemon:
log.info("listener daemonizing host=%s port=%d", bind_host, bind_port)
_utils._daemonize()
log.info("listener command invoked host=%s port=%d", bind_host, bind_port)
console.print(f"[green]Starting DECNET log listener on {bind_host}:{bind_port} (mTLS)...[/]")
async def _main() -> None:
stop = asyncio.Event()
loop = asyncio.get_running_loop()
for sig in (signal.SIGTERM, signal.SIGINT):
try:
loop.add_signal_handler(sig, stop.set)
except (NotImplementedError, RuntimeError): # pragma: no cover
pass
await run_listener(cfg, stop_event=stop)
try:
asyncio.run(_main())
except KeyboardInterrupt:
pass

34
decnet/cli/profiler.py Normal file
View File

@@ -0,0 +1,34 @@
from __future__ import annotations
import typer
from . import utils as _utils
from .utils import console, log
def register(app: typer.Typer) -> None:
@app.command(name="profiler")
def profiler_cmd(
interval: int = typer.Option(30, "--interval", "-i", help="Seconds between profile rebuild cycles"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Run the attacker profiler as a standalone microservice."""
import asyncio
from decnet.profiler import attacker_profile_worker
from decnet.web.dependencies import repo
if daemon:
log.info("profiler daemonizing interval=%d", interval)
_utils._daemonize()
log.info("profiler starting interval=%d", interval)
console.print(f"[bold cyan]Profiler starting[/] (interval: {interval}s)")
async def _run() -> None:
await repo.initialize()
await attacker_profile_worker(repo, interval=interval)
try:
asyncio.run(_run())
except KeyboardInterrupt:
console.print("\n[yellow]Profiler stopped.[/]")

31
decnet/cli/sniffer.py Normal file
View File

@@ -0,0 +1,31 @@
from __future__ import annotations
import typer
from decnet.env import DECNET_INGEST_LOG_FILE
from . import utils as _utils
from .utils import console, log
def register(app: typer.Typer) -> None:
@app.command(name="sniffer")
def sniffer_cmd(
log_file: str = typer.Option(DECNET_INGEST_LOG_FILE, "--log-file", "-f", help="Path to write captured syslog + JSON records"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Run the network sniffer as a standalone microservice."""
import asyncio
from decnet.sniffer import sniffer_worker
if daemon:
log.info("sniffer daemonizing log_file=%s", log_file)
_utils._daemonize()
log.info("sniffer starting log_file=%s", log_file)
console.print(f"[bold cyan]Sniffer starting[/] → {log_file}")
try:
asyncio.run(sniffer_worker(log_file))
except KeyboardInterrupt:
console.print("\n[yellow]Sniffer stopped.[/]")

346
decnet/cli/swarm.py Normal file
View File

@@ -0,0 +1,346 @@
"""`decnet swarm ...` — master-side operator commands (HTTP to local swarmctl)."""
from __future__ import annotations
from typing import Optional
import typer
from rich.table import Table
from . import utils as _utils
from .utils import console
def register(app: typer.Typer) -> None:
swarm_app = typer.Typer(
name="swarm",
help="Manage swarm workers (enroll, list, decommission). Requires `decnet swarmctl` running.",
no_args_is_help=True,
)
app.add_typer(swarm_app, name="swarm")
@swarm_app.command("enroll")
def swarm_enroll(
name: str = typer.Option(..., "--name", help="Short hostname for the worker (also the cert CN)"),
address: str = typer.Option(..., "--address", help="IP or DNS the master uses to reach the worker"),
agent_port: int = typer.Option(8765, "--agent-port", help="Worker agent TCP port"),
sans: Optional[str] = typer.Option(None, "--sans", help="Comma-separated extra SANs for the worker cert"),
notes: Optional[str] = typer.Option(None, "--notes", help="Free-form operator notes"),
out_dir: Optional[str] = typer.Option(None, "--out-dir", help="Write the bundle (ca.crt/worker.crt/worker.key) to this dir for scp"),
updater: bool = typer.Option(False, "--updater", help="Also issue an updater-identity cert (CN=updater@<name>) for the remote self-updater"),
url: Optional[str] = typer.Option(None, "--url", help="Override swarm controller URL (default: 127.0.0.1:8770)"),
) -> None:
"""Issue a mTLS bundle for a new worker and register it in the swarm."""
import pathlib as _pathlib
body: dict = {"name": name, "address": address, "agent_port": agent_port}
if sans:
body["sans"] = [s.strip() for s in sans.split(",") if s.strip()]
if notes:
body["notes"] = notes
if updater:
body["issue_updater_bundle"] = True
resp = _utils._http_request("POST", _utils._swarmctl_base_url(url) + "/swarm/enroll", json_body=body)
data = resp.json()
console.print(f"[green]Enrolled worker:[/] {data['name']} "
f"[dim]uuid=[/]{data['host_uuid']} "
f"[dim]fingerprint=[/]{data['fingerprint']}")
if data.get("updater"):
console.print(f"[green] + updater identity[/] "
f"[dim]fingerprint=[/]{data['updater']['fingerprint']}")
if out_dir:
target = _pathlib.Path(out_dir).expanduser()
target.mkdir(parents=True, exist_ok=True)
(target / "ca.crt").write_text(data["ca_cert_pem"])
(target / "worker.crt").write_text(data["worker_cert_pem"])
(target / "worker.key").write_text(data["worker_key_pem"])
for leaf in ("worker.key",):
try:
(target / leaf).chmod(0o600)
except OSError:
pass
console.print(f"[cyan]Agent bundle written to[/] {target}")
if data.get("updater"):
upd_target = target.parent / f"{target.name}-updater"
upd_target.mkdir(parents=True, exist_ok=True)
(upd_target / "ca.crt").write_text(data["ca_cert_pem"])
(upd_target / "updater.crt").write_text(data["updater"]["updater_cert_pem"])
(upd_target / "updater.key").write_text(data["updater"]["updater_key_pem"])
try:
(upd_target / "updater.key").chmod(0o600)
except OSError:
pass
console.print(f"[cyan]Updater bundle written to[/] {upd_target}")
console.print("[dim]Ship the agent dir to ~/.decnet/agent/ and the updater dir to ~/.decnet/updater/ on the worker.[/]")
else:
console.print("[dim]Ship this directory to the worker at ~/.decnet/agent/ (or wherever `decnet agent --agent-dir` points).[/]")
else:
console.print("[yellow]No --out-dir given — bundle PEMs are in the JSON response; persist them before leaving this shell.[/]")
@swarm_app.command("list")
def swarm_list(
host_status: Optional[str] = typer.Option(None, "--status", help="Filter by status (enrolled|active|unreachable|decommissioned)"),
url: Optional[str] = typer.Option(None, "--url", help="Override swarm controller URL"),
) -> None:
"""List enrolled workers."""
q = f"?host_status={host_status}" if host_status else ""
resp = _utils._http_request("GET", _utils._swarmctl_base_url(url) + "/swarm/hosts" + q)
rows = resp.json()
if not rows:
console.print("[dim]No workers enrolled.[/]")
return
table = Table(title="DECNET swarm workers")
for col in ("name", "address", "port", "status", "last heartbeat", "enrolled"):
table.add_column(col)
for r in rows:
table.add_row(
r.get("name") or "",
r.get("address") or "",
str(r.get("agent_port") or ""),
r.get("status") or "",
str(r.get("last_heartbeat") or ""),
str(r.get("enrolled_at") or ""),
)
console.print(table)
@swarm_app.command("check")
def swarm_check(
url: Optional[str] = typer.Option(None, "--url", help="Override swarm controller URL"),
json_out: bool = typer.Option(False, "--json", help="Emit JSON instead of a table"),
) -> None:
"""Actively probe every enrolled worker and refresh status + last_heartbeat."""
resp = _utils._http_request("POST", _utils._swarmctl_base_url(url) + "/swarm/check", timeout=60.0)
payload = resp.json()
results = payload.get("results", [])
if json_out:
console.print_json(data=payload)
return
if not results:
console.print("[dim]No workers enrolled.[/]")
return
table = Table(title="DECNET swarm check")
for col in ("name", "address", "reachable", "detail"):
table.add_column(col)
for r in results:
reachable = r.get("reachable")
mark = "[green]yes[/]" if reachable else "[red]no[/]"
detail = r.get("detail")
detail_str = ""
if isinstance(detail, dict):
detail_str = detail.get("status") or ", ".join(f"{k}={v}" for k, v in detail.items())
elif detail is not None:
detail_str = str(detail)
table.add_row(
r.get("name") or "",
r.get("address") or "",
mark,
detail_str,
)
console.print(table)
@swarm_app.command("update")
def swarm_update(
host: Optional[str] = typer.Option(None, "--host", help="Target worker (name or UUID). Omit with --all."),
all_hosts: bool = typer.Option(False, "--all", help="Push to every enrolled worker."),
include_self: bool = typer.Option(False, "--include-self", help="Also push to each updater's /update-self after a successful agent update."),
root: Optional[str] = typer.Option(None, "--root", help="Source tree to tar (default: CWD)."),
exclude: list[str] = typer.Option([], "--exclude", help="Additional exclude glob. Repeatable."),
updater_port: int = typer.Option(8766, "--updater-port", help="Port the workers' updater listens on."),
dry_run: bool = typer.Option(False, "--dry-run", help="Build the tarball and print stats; no network."),
url: Optional[str] = typer.Option(None, "--url", help="Override swarm controller URL."),
) -> None:
"""Push the current working tree to workers' self-updaters (with auto-rollback on failure)."""
import asyncio
import pathlib as _pathlib
from decnet.swarm.tar_tree import tar_working_tree, detect_git_sha
from decnet.swarm.updater_client import UpdaterClient
if not (host or all_hosts):
console.print("[red]Supply --host <name> or --all.[/]")
raise typer.Exit(2)
if host and all_hosts:
console.print("[red]--host and --all are mutually exclusive.[/]")
raise typer.Exit(2)
base = _utils._swarmctl_base_url(url)
resp = _utils._http_request("GET", base + "/swarm/hosts")
rows = resp.json()
if host:
targets = [r for r in rows if r.get("name") == host or r.get("uuid") == host]
if not targets:
console.print(f"[red]No enrolled worker matching '{host}'.[/]")
raise typer.Exit(1)
else:
targets = [r for r in rows if r.get("status") != "decommissioned"]
if not targets:
console.print("[dim]No targets.[/]")
return
tree_root = _pathlib.Path(root) if root else _pathlib.Path.cwd()
sha = detect_git_sha(tree_root)
console.print(f"[dim]Tarring[/] {tree_root} [dim]sha={sha or '(not a git repo)'}[/]")
tarball = tar_working_tree(tree_root, extra_excludes=exclude)
console.print(f"[dim]Tarball size:[/] {len(tarball):,} bytes")
if dry_run:
console.print("[yellow]--dry-run: not pushing.[/]")
for t in targets:
console.print(f" would push to [cyan]{t.get('name')}[/] at {t.get('address')}:{updater_port}")
return
async def _push_one(h: dict) -> dict:
name = h.get("name") or h.get("uuid")
out: dict = {"name": name, "address": h.get("address"), "agent": None, "self": None}
try:
async with UpdaterClient(h, updater_port=updater_port) as u:
r = await u.update(tarball, sha=sha)
out["agent"] = {"status": r.status_code, "body": r.json() if r.content else {}}
if r.status_code == 200 and include_self:
rs = await u.update_self(tarball, sha=sha)
out["self"] = {"status": rs.status_code, "body": rs.json() if rs.content else {}}
except Exception as exc: # noqa: BLE001
out["error"] = f"{type(exc).__name__}: {exc}"
return out
async def _push_all() -> list[dict]:
return await asyncio.gather(*(_push_one(t) for t in targets))
results = asyncio.run(_push_all())
table = Table(title="DECNET swarm update")
for col in ("host", "address", "agent", "self", "detail"):
table.add_column(col)
any_failure = False
for r in results:
agent = r.get("agent") or {}
selff = r.get("self") or {}
err = r.get("error")
if err:
any_failure = True
table.add_row(r["name"], r.get("address") or "", "[red]error[/]", "", err)
continue
a_status = agent.get("status")
if a_status == 200:
agent_cell = "[green]updated[/]"
elif a_status == 409:
agent_cell = "[yellow]rolled-back[/]"
any_failure = True
else:
agent_cell = f"[red]{a_status}[/]"
any_failure = True
if not include_self:
self_cell = ""
elif selff.get("status") == 200 or selff.get("status") is None:
self_cell = "[green]ok[/]" if selff else "[dim]skipped[/]"
else:
self_cell = f"[red]{selff.get('status')}[/]"
detail = ""
body = agent.get("body") or {}
if isinstance(body, dict):
detail = body.get("release", {}).get("sha") or body.get("detail", {}).get("error") or ""
table.add_row(r["name"], r.get("address") or "", agent_cell, self_cell, str(detail)[:80])
console.print(table)
if any_failure:
raise typer.Exit(1)
@swarm_app.command("deckies")
def swarm_deckies(
host: Optional[str] = typer.Option(None, "--host", help="Filter by worker name or UUID"),
state: Optional[str] = typer.Option(None, "--state", help="Filter by shard state (pending|running|failed|torn_down)"),
url: Optional[str] = typer.Option(None, "--url", help="Override swarm controller URL"),
json_out: bool = typer.Option(False, "--json", help="Emit JSON instead of a table"),
) -> None:
"""List deployed deckies across the swarm with their owning worker host."""
base = _utils._swarmctl_base_url(url)
host_uuid: Optional[str] = None
if host:
resp = _utils._http_request("GET", base + "/swarm/hosts")
rows = resp.json()
match = next((r for r in rows if r.get("uuid") == host or r.get("name") == host), None)
if match is None:
console.print(f"[red]No enrolled worker matching '{host}'.[/]")
raise typer.Exit(1)
host_uuid = match["uuid"]
query = []
if host_uuid:
query.append(f"host_uuid={host_uuid}")
if state:
query.append(f"state={state}")
path = "/swarm/deckies" + ("?" + "&".join(query) if query else "")
resp = _utils._http_request("GET", base + path)
rows = resp.json()
if json_out:
console.print_json(data=rows)
return
if not rows:
console.print("[dim]No deckies deployed.[/]")
return
table = Table(title="DECNET swarm deckies")
for col in ("decky", "host", "address", "state", "services"):
table.add_column(col)
for r in rows:
services = ",".join(r.get("services") or []) or ""
state_val = r.get("state") or "pending"
colored = {
"running": f"[green]{state_val}[/]",
"failed": f"[red]{state_val}[/]",
"pending": f"[yellow]{state_val}[/]",
"torn_down": f"[dim]{state_val}[/]",
}.get(state_val, state_val)
table.add_row(
r.get("decky_name") or "",
r.get("host_name") or "<unknown>",
r.get("host_address") or "",
colored,
services,
)
console.print(table)
@swarm_app.command("decommission")
def swarm_decommission(
name: Optional[str] = typer.Option(None, "--name", help="Worker hostname"),
uuid: Optional[str] = typer.Option(None, "--uuid", help="Worker UUID (skip lookup)"),
url: Optional[str] = typer.Option(None, "--url", help="Override swarm controller URL"),
yes: bool = typer.Option(False, "--yes", "-y", help="Skip interactive confirmation"),
) -> None:
"""Remove a worker from the swarm (cascades decky shard rows)."""
if not (name or uuid):
console.print("[red]Supply --name or --uuid.[/]")
raise typer.Exit(2)
base = _utils._swarmctl_base_url(url)
target_uuid = uuid
target_name = name
if target_uuid is None:
resp = _utils._http_request("GET", base + "/swarm/hosts")
rows = resp.json()
match = next((r for r in rows if r.get("name") == name), None)
if match is None:
console.print(f"[red]No enrolled worker named '{name}'.[/]")
raise typer.Exit(1)
target_uuid = match["uuid"]
target_name = match.get("name") or target_name
if not yes:
confirm = typer.confirm(f"Decommission worker {target_name!r} ({target_uuid})?", default=False)
if not confirm:
console.print("[dim]Aborted.[/]")
raise typer.Exit(0)
_utils._http_request("DELETE", f"{base}/swarm/hosts/{target_uuid}")
console.print(f"[green]Decommissioned {target_name or target_uuid}.[/]")

104
decnet/cli/swarmctl.py Normal file
View File

@@ -0,0 +1,104 @@
from __future__ import annotations
import os
import signal
import subprocess # nosec B404
import sys
from typing import Optional
import typer
from . import utils as _utils
from .gating import _require_master_mode
from .utils import console, log
def register(app: typer.Typer) -> None:
@app.command()
def swarmctl(
port: int = typer.Option(8770, "--port", help="Port for the swarm controller"),
host: str = typer.Option("127.0.0.1", "--host", help="Bind address for the swarm controller"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
no_listener: bool = typer.Option(False, "--no-listener", help="Do not auto-spawn the syslog-TLS listener alongside swarmctl"),
tls: bool = typer.Option(False, "--tls", help="Serve over HTTPS with mTLS (required for cross-host worker heartbeats)"),
cert: Optional[str] = typer.Option(None, "--cert", help="BYOC: path to TLS server cert (PEM). Auto-issues from the DECNET CA if omitted."),
key: Optional[str] = typer.Option(None, "--key", help="BYOC: path to TLS server private key (PEM)."),
client_ca: Optional[str] = typer.Option(None, "--client-ca", help="CA bundle used to verify worker client certs. Defaults to the DECNET CA."),
) -> None:
"""Run the DECNET SWARM controller (master-side, separate process from `decnet api`).
By default, `decnet swarmctl` auto-spawns `decnet listener` as a fully-
detached sibling process so the master starts accepting forwarder
connections on 6514 without a second manual invocation. The listener
survives swarmctl restarts and crashes — if it dies on its own,
restart it manually with `decnet listener --daemon …`. Pass
--no-listener to skip.
Pass ``--tls`` to serve over HTTPS with mutual-TLS enforcement. By
default the server cert is auto-issued from the DECNET CA under
``~/.decnet/swarmctl/`` so enrolled workers (which already ship that
CA's ``ca.crt``) trust it out of the box. BYOC via ``--cert``/``--key``
if you need a publicly-trusted or externally-managed cert.
"""
_require_master_mode("swarmctl")
if daemon:
log.info("swarmctl daemonizing host=%s port=%d", host, port)
_utils._daemonize()
if not no_listener:
listener_host = os.environ.get("DECNET_LISTENER_HOST", "0.0.0.0") # nosec B104
listener_port = int(os.environ.get("DECNET_SWARM_SYSLOG_PORT", "6514"))
lst_argv = [
sys.executable, "-m", "decnet", "listener",
"--host", listener_host,
"--port", str(listener_port),
"--daemon",
]
try:
pid = _utils._spawn_detached(lst_argv, _utils._pid_dir() / "listener.pid")
log.info("swarmctl auto-spawned listener pid=%d bind=%s:%d",
pid, listener_host, listener_port)
console.print(f"[dim]Auto-spawned listener (pid {pid}) on {listener_host}:{listener_port}.[/]")
except Exception as e: # noqa: BLE001
log.warning("swarmctl could not auto-spawn listener: %s", e)
console.print(f"[yellow]listener auto-spawn skipped: {e}[/]")
log.info("swarmctl command invoked host=%s port=%d tls=%s", host, port, tls)
scheme = "https" if tls else "http"
console.print(f"[green]Starting DECNET SWARM controller on {scheme}://{host}:{port}...[/]")
_cmd = [sys.executable, "-m", "uvicorn", "decnet.web.swarm_api:app",
"--host", host, "--port", str(port)]
if tls:
from decnet.swarm import pki as _pki
if cert and key:
cert_path, key_path = cert, key
elif cert or key:
console.print("[red]--cert and --key must be provided together.[/]")
raise typer.Exit(code=2)
else:
auto_cert, auto_key, _auto_ca = _pki.ensure_swarmctl_cert(host)
cert_path, key_path = str(auto_cert), str(auto_key)
console.print(f"[dim]Auto-issued swarmctl server cert → {cert_path}[/]")
ca_path = client_ca or str(_pki.DEFAULT_CA_DIR / "ca.crt")
_cmd += [
"--ssl-keyfile", key_path,
"--ssl-certfile", cert_path,
"--ssl-ca-certs", ca_path,
"--ssl-cert-reqs", "2",
]
try:
proc = subprocess.Popen(_cmd, start_new_session=True) # nosec B603 B404
try:
proc.wait()
except KeyboardInterrupt:
try:
os.killpg(proc.pid, signal.SIGTERM)
try:
proc.wait(timeout=10)
except subprocess.TimeoutExpired:
os.killpg(proc.pid, signal.SIGKILL)
proc.wait()
except ProcessLookupError:
pass
except (FileNotFoundError, subprocess.SubprocessError):
console.print("[red]Failed to start swarmctl. Ensure 'uvicorn' is installed in the current environment.[/]")

46
decnet/cli/updater.py Normal file
View File

@@ -0,0 +1,46 @@
from __future__ import annotations
import pathlib as _pathlib
from typing import Optional
import typer
from . import utils as _utils
from .utils import console, log
def register(app: typer.Typer) -> None:
@app.command()
def updater(
port: int = typer.Option(8766, "--port", help="Port for the self-updater daemon"),
host: str = typer.Option("0.0.0.0", "--host", help="Bind address for the updater"), # nosec B104
updater_dir: Optional[str] = typer.Option(None, "--updater-dir", help="Updater cert bundle dir (default: ~/.decnet/updater)"),
install_dir: Optional[str] = typer.Option(None, "--install-dir", help="Release install root (default: /opt/decnet)"),
agent_dir: Optional[str] = typer.Option(None, "--agent-dir", help="Worker agent cert bundle (for local /health probes; default: ~/.decnet/agent)"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Run the DECNET self-updater (requires a bundle in ~/.decnet/updater/)."""
from decnet.swarm import pki as _pki
from decnet.updater import server as _upd_server
resolved_updater = _pathlib.Path(updater_dir) if updater_dir else _upd_server.DEFAULT_UPDATER_DIR
resolved_install = _pathlib.Path(install_dir) if install_dir else _pathlib.Path("/opt/decnet")
resolved_agent = _pathlib.Path(agent_dir) if agent_dir else _pki.DEFAULT_AGENT_DIR
if daemon:
log.info("updater daemonizing host=%s port=%d", host, port)
_utils._daemonize()
log.info(
"updater command invoked host=%s port=%d updater_dir=%s install_dir=%s",
host, port, resolved_updater, resolved_install,
)
console.print(f"[green]Starting DECNET self-updater on {host}:{port} (mTLS)...[/]")
rc = _upd_server.run(
host, port,
updater_dir=resolved_updater,
install_dir=resolved_install,
agent_dir=resolved_agent,
)
if rc != 0:
raise typer.Exit(rc)

177
decnet/cli/utils.py Normal file
View File

@@ -0,0 +1,177 @@
"""Shared CLI helpers: console, logger, process management, swarm HTTP client.
Submodules reference these as ``from . import utils`` then ``utils.foo(...)``
so tests can patch ``decnet.cli.utils.<name>`` and have every caller see it.
"""
from __future__ import annotations
import os
import signal
import subprocess # nosec B404
import sys
from pathlib import Path
from typing import Optional
import typer
from rich.console import Console
from decnet.logging import get_logger
from decnet.env import DECNET_API_HOST, DECNET_API_PORT, DECNET_INGEST_LOG_FILE
log = get_logger("cli")
console = Console()
def _daemonize() -> None:
"""Fork the current process into a background daemon (Unix double-fork)."""
if os.fork() > 0:
raise SystemExit(0)
os.setsid()
if os.fork() > 0:
raise SystemExit(0)
sys.stdout = open(os.devnull, "w") # noqa: SIM115
sys.stderr = open(os.devnull, "w") # noqa: SIM115
sys.stdin = open(os.devnull, "r") # noqa: SIM115
def _pid_dir() -> Path:
"""Return the writable PID directory.
/opt/decnet when it exists and is writable (production), else
~/.decnet (dev). The directory is created if needed."""
candidates = [Path("/opt/decnet"), Path.home() / ".decnet"]
for path in candidates:
try:
path.mkdir(parents=True, exist_ok=True)
if os.access(path, os.W_OK):
return path
except (PermissionError, OSError):
continue
return Path("/tmp") # nosec B108
def _spawn_detached(argv: list[str], pid_file: Path) -> int:
"""Spawn a DECNET subcommand as a fully-independent sibling process.
The parent does NOT wait() on this child. start_new_session=True puts
the child in its own session so SIGHUP on parent exit doesn't kill it;
stdin/stdout/stderr go to /dev/null so the launching shell can close
without EIO on the child. close_fds=True prevents inherited sockets
from pinning ports we're trying to rebind.
This is deliberately NOT a supervisor — we fire-and-forget. If the
child dies, the operator restarts it manually via its own subcommand.
"""
if pid_file.exists():
try:
existing = int(pid_file.read_text().strip())
os.kill(existing, 0)
return existing
except (ValueError, ProcessLookupError, PermissionError, OSError):
pass # stale pid_file — fall through and spawn
with open(os.devnull, "rb") as dn_in, open(os.devnull, "ab") as dn_out:
proc = subprocess.Popen( # nosec B603
argv,
stdin=dn_in, stdout=dn_out, stderr=dn_out,
start_new_session=True, close_fds=True,
)
pid_file.parent.mkdir(parents=True, exist_ok=True)
pid_file.write_text(f"{proc.pid}\n")
return proc.pid
def _is_running(match_fn) -> int | None:
"""Return PID of a running DECNET process matching ``match_fn(cmdline)``, or None."""
import psutil
for proc in psutil.process_iter(["pid", "cmdline"]):
try:
cmd = proc.info["cmdline"]
if cmd and match_fn(cmd):
return proc.info["pid"]
except (psutil.NoSuchProcess, psutil.AccessDenied):
continue
return None
def _service_registry(log_file: str) -> list[tuple[str, callable, list[str]]]:
"""Return the microservice registry for health-check and relaunch.
On agents these run as systemd units invoking /usr/local/bin/decnet,
which doesn't include "decnet.cli" in its cmdline. On master dev boxes
they're launched via `python -m decnet.cli`. Match either form — cmd
is a list of argv tokens, so substring-check the joined string.
"""
_py = sys.executable
def _matches(sub: str, extras: tuple[str, ...] = ()):
def _check(cmd) -> bool:
joined = " ".join(cmd) if not isinstance(cmd, str) else cmd
if "decnet" not in joined:
return False
if sub not in joined:
return False
return all(e in joined for e in extras)
return _check
return [
("Collector", _matches("collect"),
[_py, "-m", "decnet.cli", "collect", "--daemon", "--log-file", log_file]),
("Mutator", _matches("mutate", ("--watch",)),
[_py, "-m", "decnet.cli", "mutate", "--daemon", "--watch"]),
("Prober", _matches("probe"),
[_py, "-m", "decnet.cli", "probe", "--daemon", "--log-file", log_file]),
("Profiler", _matches("profiler"),
[_py, "-m", "decnet.cli", "profiler", "--daemon"]),
("Sniffer", _matches("sniffer"),
[_py, "-m", "decnet.cli", "sniffer", "--daemon", "--log-file", log_file]),
("API",
lambda cmd: "uvicorn" in cmd and "decnet.web.api:app" in cmd,
[_py, "-m", "uvicorn", "decnet.web.api:app",
"--host", DECNET_API_HOST, "--port", str(DECNET_API_PORT)]),
]
def _kill_all_services() -> None:
"""Find and kill all running DECNET microservice processes."""
registry = _service_registry(str(DECNET_INGEST_LOG_FILE))
killed = 0
for name, match_fn, _launch_args in registry:
pid = _is_running(match_fn)
if pid is not None:
console.print(f"[yellow]Stopping {name} (PID {pid})...[/]")
os.kill(pid, signal.SIGTERM)
killed += 1
if killed:
console.print(f"[green]{killed} background process(es) stopped.[/]")
else:
console.print("[dim]No DECNET services were running.[/]")
_DEFAULT_SWARMCTL_URL = "http://127.0.0.1:8770"
def _swarmctl_base_url(url: Optional[str]) -> str:
return url or os.environ.get("DECNET_SWARMCTL_URL", _DEFAULT_SWARMCTL_URL)
def _http_request(method: str, url: str, *, json_body: Optional[dict] = None, timeout: float = 30.0):
"""Tiny sync wrapper around httpx; avoids leaking async into the CLI."""
import httpx
try:
resp = httpx.request(method, url, json=json_body, timeout=timeout)
except httpx.HTTPError as exc:
console.print(f"[red]Could not reach swarm controller at {url}: {exc}[/]")
console.print("[dim]Is `decnet swarmctl` running?[/]")
raise typer.Exit(2)
if resp.status_code >= 400:
try:
detail = resp.json().get("detail", resp.text)
except Exception: # nosec B110
detail = resp.text
console.print(f"[red]{method} {url} failed: {resp.status_code}{detail}[/]")
raise typer.Exit(1)
return resp

120
decnet/cli/web.py Normal file
View File

@@ -0,0 +1,120 @@
from __future__ import annotations
import typer
from decnet.env import DECNET_API_PORT, DECNET_WEB_HOST, DECNET_WEB_PORT
from . import utils as _utils
from .utils import console, log
def register(app: typer.Typer) -> None:
@app.command(name="web")
def serve_web(
web_port: int = typer.Option(DECNET_WEB_PORT, "--web-port", help="Port to serve the DECNET Web Dashboard"),
host: str = typer.Option(DECNET_WEB_HOST, "--host", help="Host IP to serve the Web Dashboard"),
api_port: int = typer.Option(DECNET_API_PORT, "--api-port", help="Port the DECNET API is listening on"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Serve the DECNET Web Dashboard frontend.
Proxies /api/* requests to the API server so the frontend can use
relative URLs (/api/v1/...) with no CORS configuration required.
"""
import http.client
import http.server
import os
import socketserver
from pathlib import Path
dist_dir = Path(__file__).resolve().parent.parent.parent / "decnet_web" / "dist"
if not dist_dir.exists():
console.print(f"[red]Frontend build not found at {dist_dir}. Make sure you run 'npm run build' inside 'decnet_web'.[/]")
raise typer.Exit(1)
if daemon:
log.info("web daemonizing host=%s port=%d api_port=%d", host, web_port, api_port)
_utils._daemonize()
_api_port = api_port
class SPAHTTPRequestHandler(http.server.SimpleHTTPRequestHandler):
def do_GET(self):
if self.path.startswith("/api/"):
self._proxy("GET")
return
path = self.translate_path(self.path)
if not Path(path).exists() or Path(path).is_dir():
self.path = "/index.html"
return super().do_GET()
def do_POST(self):
if self.path.startswith("/api/"):
self._proxy("POST")
return
self.send_error(405)
def do_PUT(self):
if self.path.startswith("/api/"):
self._proxy("PUT")
return
self.send_error(405)
def do_DELETE(self):
if self.path.startswith("/api/"):
self._proxy("DELETE")
return
self.send_error(405)
def _proxy(self, method: str) -> None:
content_length = int(self.headers.get("Content-Length", 0))
body = self.rfile.read(content_length) if content_length else None
forward = {k: v for k, v in self.headers.items()
if k.lower() not in ("host", "connection")}
try:
conn = http.client.HTTPConnection("127.0.0.1", _api_port, timeout=120)
conn.request(method, self.path, body=body, headers=forward)
resp = conn.getresponse()
self.send_response(resp.status)
for key, val in resp.getheaders():
if key.lower() not in ("connection", "transfer-encoding"):
self.send_header(key, val)
self.end_headers()
content_type = resp.getheader("Content-Type", "")
if "text/event-stream" in content_type:
conn.sock.settimeout(None)
_read = getattr(resp, "read1", resp.read)
while True:
chunk = _read(4096)
if not chunk:
break
self.wfile.write(chunk)
self.wfile.flush()
except Exception as exc:
log.warning("web proxy error %s %s: %s", method, self.path, exc)
self.send_error(502, f"API proxy error: {exc}")
finally:
try:
conn.close()
except Exception: # nosec B110 — best-effort conn cleanup
pass
def log_message(self, fmt: str, *args: object) -> None:
log.debug("web %s", fmt % args)
os.chdir(dist_dir)
socketserver.TCPServer.allow_reuse_address = True
with socketserver.ThreadingTCPServer((host, web_port), SPAHTTPRequestHandler) as httpd:
console.print(f"[green]Serving DECNET Web Dashboard on http://{host}:{web_port}[/]")
console.print(f"[dim]Proxying /api/* → http://127.0.0.1:{_api_port}[/]")
try:
httpd.serve_forever()
except KeyboardInterrupt:
console.print("\n[dim]Shutting down dashboard server.[/]")

142
decnet/cli/workers.py Normal file
View File

@@ -0,0 +1,142 @@
from __future__ import annotations
from typing import Optional
import typer
from decnet.env import DECNET_INGEST_LOG_FILE
from . import utils as _utils
from .utils import console, log
def register(app: typer.Typer) -> None:
@app.command()
def probe(
log_file: str = typer.Option(DECNET_INGEST_LOG_FILE, "--log-file", "-f", help="Path for RFC 5424 syslog + .json output (reads attackers from .json, writes results to both)"),
interval: int = typer.Option(300, "--interval", "-i", help="Seconds between probe cycles (default: 300)"),
timeout: float = typer.Option(5.0, "--timeout", help="Per-probe TCP timeout in seconds"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background (used by deploy, no console output)"),
) -> None:
"""Fingerprint attackers (JARM + HASSH + TCP/IP stack) discovered in the log stream."""
import asyncio
from decnet.prober import prober_worker
if daemon:
log.info("probe daemonizing log_file=%s interval=%d", log_file, interval)
_utils._daemonize()
asyncio.run(prober_worker(log_file, interval=interval, timeout=timeout))
return
log.info("probe command invoked log_file=%s interval=%d", log_file, interval)
console.print(f"[bold cyan]DECNET-PROBER[/] watching {log_file} for attackers (interval: {interval}s)")
console.print("[dim]Press Ctrl+C to stop[/]")
try:
asyncio.run(prober_worker(log_file, interval=interval, timeout=timeout))
except KeyboardInterrupt:
console.print("\n[yellow]DECNET-PROBER stopped.[/]")
@app.command()
def collect(
log_file: str = typer.Option(DECNET_INGEST_LOG_FILE, "--log-file", "-f", help="Path to write RFC 5424 syslog lines and .json records"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Stream Docker logs from all running decky service containers to a log file."""
import asyncio
from decnet.collector import log_collector_worker
if daemon:
log.info("collect daemonizing log_file=%s", log_file)
_utils._daemonize()
log.info("collect command invoked log_file=%s", log_file)
console.print(f"[bold cyan]Collector starting[/] → {log_file}")
asyncio.run(log_collector_worker(log_file))
@app.command()
def mutate(
watch: bool = typer.Option(False, "--watch", "-w", help="Run continuously and mutate deckies according to their interval"),
decky_name: Optional[str] = typer.Option(None, "--decky", help="Force mutate a specific decky immediately"),
force_all: bool = typer.Option(False, "--all", help="Force mutate all deckies immediately"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Manually trigger or continuously watch for decky mutation."""
import asyncio
from decnet.mutator import mutate_decky, mutate_all, run_watch_loop
from decnet.web.dependencies import repo
if daemon:
log.info("mutate daemonizing watch=%s", watch)
_utils._daemonize()
async def _run() -> None:
await repo.initialize()
if watch:
await run_watch_loop(repo)
elif decky_name:
await mutate_decky(decky_name, repo)
elif force_all:
await mutate_all(force=True, repo=repo)
else:
await mutate_all(force=False, repo=repo)
asyncio.run(_run())
@app.command(name="correlate")
def correlate(
log_file: Optional[str] = typer.Option(None, "--log-file", "-f", help="Path to DECNET syslog file to analyse"),
min_deckies: int = typer.Option(2, "--min-deckies", "-m", help="Minimum number of distinct deckies an IP must touch to be reported"),
output: str = typer.Option("table", "--output", "-o", help="Output format: table | json | syslog"),
emit_syslog: bool = typer.Option(False, "--emit-syslog", help="Also print traversal events as RFC 5424 lines (for SIEM piping)"),
daemon: bool = typer.Option(False, "--daemon", "-d", help="Detach to background as a daemon process"),
) -> None:
"""Analyse logs for cross-decky traversals and print the attacker movement graph."""
import sys
import json as _json
from pathlib import Path
from decnet.correlation.engine import CorrelationEngine
if daemon:
log.info("correlate daemonizing log_file=%s", log_file)
_utils._daemonize()
engine = CorrelationEngine()
if log_file:
path = Path(log_file)
if not path.exists():
console.print(f"[red]Log file not found: {log_file}[/]")
raise typer.Exit(1)
engine.ingest_file(path)
elif not sys.stdin.isatty():
for line in sys.stdin:
engine.ingest(line)
else:
console.print("[red]Provide --log-file or pipe log data via stdin.[/]")
raise typer.Exit(1)
traversals = engine.traversals(min_deckies)
if output == "json":
console.print_json(_json.dumps(engine.report_json(min_deckies), indent=2))
elif output == "syslog":
for line in engine.traversal_syslog_lines(min_deckies):
typer.echo(line)
else:
if not traversals:
console.print(
f"[yellow]No traversals detected "
f"(min_deckies={min_deckies}, events_indexed={engine.events_indexed}).[/]"
)
else:
console.print(engine.report_table(min_deckies))
console.print(
f"[dim]Parsed {engine.lines_parsed} lines · "
f"indexed {engine.events_indexed} events · "
f"{len(engine.all_attackers())} unique IPs · "
f"[bold]{len(traversals)}[/] traversal(s)[/]"
)
if emit_syslog:
for line in engine.traversal_syslog_lines(min_deckies):
typer.echo(line)

View File

@@ -110,14 +110,23 @@ _RFC5424_RE = re.compile(
r"(\S+) " # 1: TIMESTAMP r"(\S+) " # 1: TIMESTAMP
r"(\S+) " # 2: HOSTNAME (decky name) r"(\S+) " # 2: HOSTNAME (decky name)
r"(\S+) " # 3: APP-NAME (service) r"(\S+) " # 3: APP-NAME (service)
r"- " # PROCID always NILVALUE r"\S+ " # PROCID NILVALUE ("-") for syslog_bridge emitters,
# real PID for native syslog callers like sshd/sudo
# routed through rsyslog. Accept both; we don't consume it.
r"(\S+) " # 4: MSGID (event_type) r"(\S+) " # 4: MSGID (event_type)
r"(.+)$", # 5: SD element + optional MSG r"(.+)$", # 5: SD element + optional MSG
) )
_SD_BLOCK_RE = re.compile(r'\[decnet@55555\s+(.*?)\]', re.DOTALL) _SD_BLOCK_RE = re.compile(r'\[relay@55555\s+(.*?)\]', re.DOTALL)
_PARAM_RE = re.compile(r'(\w+)="((?:[^"\\]|\\.)*)"') _PARAM_RE = re.compile(r'(\w+)="((?:[^"\\]|\\.)*)"')
_IP_FIELDS = ("src_ip", "src", "client_ip", "remote_ip", "remote_addr", "target_ip", "ip") _IP_FIELDS = ("src_ip", "src", "client_ip", "remote_ip", "remote_addr", "target_ip", "ip")
# Free-form `key=value` pairs in the MSG body. Used for lines that bypass the
# syslog_bridge SD format — e.g. the SSH container's PROMPT_COMMAND which
# calls `logger -t bash "CMD uid=0 user=root src=1.2.3.4 pwd=/root cmd=…"`.
# Values run until the next whitespace, so `cmd=…` at end-of-line is preserved
# as one unit; we only care about IP-shaped fields here anyway.
_MSG_KV_RE = re.compile(r'(\w+)=(\S+)')
def parse_rfc5424(line: str) -> Optional[dict[str, Any]]: def parse_rfc5424(line: str) -> Optional[dict[str, Any]]:
""" """
@@ -151,6 +160,19 @@ def parse_rfc5424(line: str) -> Optional[dict[str, Any]]:
attacker_ip = fields[fname] attacker_ip = fields[fname]
break break
# Fallback for plain `logger` callers that don't use SD params (notably
# the SSH container's bash PROMPT_COMMAND: `logger -t bash "CMD … src=IP …"`).
# Scan the MSG body for IP-shaped `key=value` tokens ONLY — don't fold
# them into `fields`, because the frontend's parseEventBody already
# renders kv pairs from the msg and doubling them up produces noisy
# duplicate pills. This keeps attacker attribution working without
# changing the shape of `fields` for non-SD lines.
if attacker_ip == "Unknown" and msg:
for k, v in _MSG_KV_RE.findall(msg):
if k in _IP_FIELDS:
attacker_ip = v
break
try: try:
ts_formatted = datetime.fromisoformat(ts_raw).strftime("%Y-%m-%d %H:%M:%S") ts_formatted = datetime.fromisoformat(ts_raw).strftime("%Y-%m-%d %H:%M:%S")
except ValueError: except ValueError:

View File

@@ -63,7 +63,7 @@ def _configure_logging(dev: bool) -> None:
all microservice daemons — which redirect stderr to /dev/null — still all microservice daemons — which redirect stderr to /dev/null — still
produce readable logs. File handler is skipped under pytest. produce readable logs. File handler is skipped under pytest.
""" """
import logging.handlers as _lh from decnet.logging.inode_aware_handler import InodeAwareRotatingFileHandler
root = logging.getLogger() root = logging.getLogger()
# Guard: if our StreamHandler is already installed, all handlers are set. # Guard: if our StreamHandler is already installed, all handlers are set.
@@ -82,7 +82,7 @@ def _configure_logging(dev: bool) -> None:
_in_pytest = any(k.startswith("PYTEST") for k in os.environ) _in_pytest = any(k.startswith("PYTEST") for k in os.environ)
if not _in_pytest: if not _in_pytest:
_log_path = os.environ.get("DECNET_SYSTEM_LOGS", "decnet.system.log") _log_path = os.environ.get("DECNET_SYSTEM_LOGS", "decnet.system.log")
file_handler = _lh.RotatingFileHandler( file_handler = InodeAwareRotatingFileHandler(
_log_path, _log_path,
mode="a", mode="a",
maxBytes=10 * 1024 * 1024, # 10 MB maxBytes=10 * 1024 * 1024, # 10 MB
@@ -91,6 +91,10 @@ def _configure_logging(dev: bool) -> None:
) )
file_handler.setFormatter(fmt) file_handler.setFormatter(fmt)
root.addHandler(file_handler) root.addHandler(file_handler)
# Drop root ownership when invoked via sudo so non-root follow-up
# commands (e.g. `decnet api` after `sudo decnet deploy`) can append.
from decnet.privdrop import chown_to_invoking_user
chown_to_invoking_user(_log_path)
_dev = os.environ.get("DECNET_DEVELOPER", "").lower() == "true" _dev = os.environ.get("DECNET_DEVELOPER", "").lower() == "true"

90
decnet/config_ini.py Normal file
View File

@@ -0,0 +1,90 @@
"""Parse /etc/decnet/decnet.ini and seed os.environ defaults.
The INI file is a convenience layer on top of the existing DECNET_* env
vars. It never overrides an explicit environment variable (uses
os.environ.setdefault). Call load_ini_config() once, very early, before
any decnet.env import, so env.py picks up the seeded values as if they
had been exported by the shell.
Shape::
[decnet]
mode = agent # or "master"
log-directory = /var/log/decnet
disallow-master = true
[agent]
master-host = 192.168.1.50
master-port = 8770
agent-port = 8765
agent-dir = /home/anti/.decnet/agent
...
[master]
api-host = 0.0.0.0
swarmctl-port = 8770
listener-port = 6514
...
Only the section matching `mode` is loaded. The other section is
ignored silently so an agent host never reads master secrets (and
vice versa). Keys are converted to SCREAMING_SNAKE_CASE and prefixed
with ``DECNET_`` — e.g. ``master-host`` → ``DECNET_MASTER_HOST``.
"""
from __future__ import annotations
import configparser
import os
from pathlib import Path
from typing import Optional
DEFAULT_CONFIG_PATH = Path("/etc/decnet/decnet.ini")
# The [decnet] section keys are role-agnostic and always exported.
_COMMON_KEYS = frozenset({"mode", "disallow-master", "log-directory"})
def _key_to_env(key: str) -> str:
return "DECNET_" + key.replace("-", "_").upper()
def load_ini_config(path: Optional[Path] = None) -> Optional[Path]:
"""Seed os.environ defaults from the DECNET INI file.
Returns the path that was actually loaded (so callers can log it), or
None if no file was read. Missing file is a no-op — callers fall back
to env vars / CLI flags / hardcoded defaults.
Precedence: real os.environ > INI > defaults. Real env vars are never
overwritten because we use setdefault().
"""
if path is None:
override = os.environ.get("DECNET_CONFIG")
path = Path(override) if override else DEFAULT_CONFIG_PATH
if not path.is_file():
return None
parser = configparser.ConfigParser()
parser.read(path)
# [decnet] first — mode/disallow-master/log-directory. These seed the
# mode decision for the section selection below.
if parser.has_section("decnet"):
for key, value in parser.items("decnet"):
os.environ.setdefault(_key_to_env(key), value)
mode = os.environ.get("DECNET_MODE", "master").lower()
if mode not in ("agent", "master"):
raise ValueError(
f"decnet.ini: [decnet] mode must be 'agent' or 'master', got '{mode}'"
)
# Role-specific section.
section = mode
if parser.has_section(section):
for key, value in parser.items(section):
os.environ.setdefault(_key_to_env(key), value)
return path

View File

@@ -6,7 +6,7 @@ the fields needed for cross-decky correlation: attacker IP, decky name,
service, event type, and timestamp. service, event type, and timestamp.
Log format (produced by decnet.logging.syslog_formatter): Log format (produced by decnet.logging.syslog_formatter):
<PRI>1 TIMESTAMP HOSTNAME APP-NAME - MSGID [decnet@55555 k1="v1" k2="v2"] [MSG] <PRI>1 TIMESTAMP HOSTNAME APP-NAME - MSGID [relay@55555 k1="v1" k2="v2"] [MSG]
The attacker IP may appear under several field names depending on service: The attacker IP may appear under several field names depending on service:
src_ip — ftp, smtp, http, most services src_ip — ftp, smtp, http, most services
@@ -31,8 +31,8 @@ _RFC5424_RE = re.compile(
r"(.+)$", # 5: SD element + optional MSG r"(.+)$", # 5: SD element + optional MSG
) )
# Structured data block: [decnet@55555 k="v" ...] # Structured data block: [relay@55555 k="v" ...]
_SD_BLOCK_RE = re.compile(r'\[decnet@55555\s+(.*?)\]', re.DOTALL) _SD_BLOCK_RE = re.compile(r'\[relay@55555\s+(.*?)\]', re.DOTALL)
# Individual param: key="value" (with escaped chars inside value) # Individual param: key="value" (with escaped chars inside value)
_PARAM_RE = re.compile(r'(\w+)="((?:[^"\\]|\\.)*)"') _PARAM_RE = re.compile(r'(\w+)="((?:[^"\\]|\\.)*)"')

View File

@@ -31,11 +31,11 @@ from decnet.network import (
log = get_logger("engine") log = get_logger("engine")
console = Console() console = Console()
COMPOSE_FILE = Path("decnet-compose.yml") COMPOSE_FILE = Path("decnet-compose.yml")
_CANONICAL_LOGGING = Path(__file__).parent.parent.parent / "templates" / "decnet_logging.py" _CANONICAL_LOGGING = Path(__file__).parent.parent / "templates" / "syslog_bridge.py"
def _sync_logging_helper(config: DecnetConfig) -> None: def _sync_logging_helper(config: DecnetConfig) -> None:
"""Copy the canonical decnet_logging.py into every active template build context.""" """Copy the canonical syslog_bridge.py into every active template build context."""
from decnet.services.registry import get_service from decnet.services.registry import get_service
seen: set[Path] = set() seen: set[Path] = set()
for decky in config.deckies: for decky in config.deckies:
@@ -47,16 +47,32 @@ def _sync_logging_helper(config: DecnetConfig) -> None:
if ctx is None or ctx in seen: if ctx is None or ctx in seen:
continue continue
seen.add(ctx) seen.add(ctx)
dest = ctx / "decnet_logging.py" dest = ctx / "syslog_bridge.py"
if not dest.exists() or dest.read_bytes() != _CANONICAL_LOGGING.read_bytes(): if not dest.exists() or dest.read_bytes() != _CANONICAL_LOGGING.read_bytes():
shutil.copy2(_CANONICAL_LOGGING, dest) shutil.copy2(_CANONICAL_LOGGING, dest)
def _compose(*args: str, compose_file: Path = COMPOSE_FILE, env: dict | None = None) -> None: def _compose(*args: str, compose_file: Path = COMPOSE_FILE, env: dict | None = None) -> None:
import os import os
cmd = ["docker", "compose", "-f", str(compose_file), *args] # -p decnet pins the compose project name. Without it, docker compose
# derives the project from basename($PWD); when a daemon (systemd) runs
# with WorkingDirectory=/ that basename is empty and compose aborts with
# "project name must not be empty".
cmd = ["docker", "compose", "-p", "decnet", "-f", str(compose_file), *args]
merged = {**os.environ, **(env or {})} merged = {**os.environ, **(env or {})}
subprocess.run(cmd, check=True, env=merged) # nosec B603 result = subprocess.run(cmd, capture_output=True, text=True, env=merged) # nosec B603
if result.stdout:
print(result.stdout, end="")
if result.returncode != 0:
# Docker emits the useful detail ("Address already in use", which IP,
# which port) on stderr. Surface it to the structured log so the
# agent's journal carries it — without this the upstream traceback
# just shows the exit code.
if result.stderr:
log.error("docker compose %s failed: %s", " ".join(args), result.stderr.strip())
raise subprocess.CalledProcessError(
result.returncode, cmd, result.stdout, result.stderr
)
_PERMANENT_ERRORS = ( _PERMANENT_ERRORS = (
@@ -79,7 +95,11 @@ def _compose_with_retry(
"""Run a docker compose command, retrying on transient failures.""" """Run a docker compose command, retrying on transient failures."""
import os import os
last_exc: subprocess.CalledProcessError | None = None last_exc: subprocess.CalledProcessError | None = None
cmd = ["docker", "compose", "-f", str(compose_file), *args] # -p decnet pins the compose project name. Without it, docker compose
# derives the project from basename($PWD); when a daemon (systemd) runs
# with WorkingDirectory=/ that basename is empty and compose aborts with
# "project name must not be empty".
cmd = ["docker", "compose", "-p", "decnet", "-f", str(compose_file), *args]
merged = {**os.environ, **(env or {})} merged = {**os.environ, **(env or {})}
for attempt in range(1, retries + 1): for attempt in range(1, retries + 1):
result = subprocess.run(cmd, capture_output=True, text=True, env=merged) # nosec B603 result = subprocess.run(cmd, capture_output=True, text=True, env=merged) # nosec B603
@@ -106,6 +126,8 @@ def _compose_with_retry(
else: else:
if result.stderr: if result.stderr:
console.print(f"[red]{result.stderr.strip()}[/]") console.print(f"[red]{result.stderr.strip()}[/]")
log.error("docker compose %s failed after %d attempts: %s",
" ".join(args), retries, result.stderr.strip())
raise last_exc raise last_exc
@@ -154,6 +176,15 @@ def deploy(config: DecnetConfig, dry_run: bool = False, no_cache: bool = False,
save_state(config, compose_path) save_state(config, compose_path)
# Pre-up cleanup: a prior half-failed `up` can leave containers still
# holding the IPs/ports this run wants, which surfaces as the recurring
# "Address already in use" from Docker's IPAM. Best-effort — ignore
# failure (e.g. nothing to tear down on a clean host).
try:
_compose("down", "--remove-orphans", compose_file=compose_path)
except subprocess.CalledProcessError:
log.debug("pre-up cleanup: compose down failed (likely nothing to remove)")
build_env = {"DOCKER_BUILDKIT": "1"} if parallel else {} build_env = {"DOCKER_BUILDKIT": "1"} if parallel else {}
console.print("[bold cyan]Building images and starting deckies...[/]") console.print("[bold cyan]Building images and starting deckies...[/]")
@@ -187,10 +218,14 @@ def teardown(decky_id: str | None = None) -> None:
client = docker.from_env() client = docker.from_env()
if decky_id: if decky_id:
svc_names = [f"{decky_id}-{svc}" for svc in [d.services for d in config.deckies if d.name == decky_id]] decky = next((d for d in config.deckies if d.name == decky_id), None)
if not svc_names: if decky is None:
console.print(f"[red]Decky '{decky_id}' not found in current deployment.[/]") console.print(f"[red]Decky '{decky_id}' not found in current deployment.[/]")
return return
svc_names = [f"{decky_id}-{svc}" for svc in decky.services]
if not svc_names:
log.warning("teardown: decky %s has no services to stop", decky_id)
return
_compose("stop", *svc_names, compose_file=compose_path) _compose("stop", *svc_names, compose_file=compose_path)
_compose("rm", "-f", *svc_names, compose_file=compose_path) _compose("rm", "-f", *svc_names, compose_file=compose_path)
else: else:

View File

@@ -6,9 +6,14 @@ from dotenv import load_dotenv
# Calculate absolute path to the project root # Calculate absolute path to the project root
_ROOT: Path = Path(__file__).parent.parent.absolute() _ROOT: Path = Path(__file__).parent.parent.absolute()
# Load .env.local first, then fallback to .env # Load .env.local first, then fallback to .env.
# Also check CWD so deployments that install into site-packages (e.g. the
# self-updater's release slots) can ship a per-host .env.local at the
# process's working directory without having to edit site-packages.
load_dotenv(_ROOT / ".env.local") load_dotenv(_ROOT / ".env.local")
load_dotenv(_ROOT / ".env") load_dotenv(_ROOT / ".env")
load_dotenv(Path.cwd() / ".env.local")
load_dotenv(Path.cwd() / ".env")
def _port(name: str, default: int) -> int: def _port(name: str, default: int) -> int:
@@ -59,12 +64,45 @@ DECNET_SYSTEM_LOGS: str = os.environ.get("DECNET_SYSTEM_LOGS", "decnet.system.lo
# which causes events to be skipped or processed twice. # which causes events to be skipped or processed twice.
DECNET_EMBED_PROFILER: bool = os.environ.get("DECNET_EMBED_PROFILER", "").lower() == "true" DECNET_EMBED_PROFILER: bool = os.environ.get("DECNET_EMBED_PROFILER", "").lower() == "true"
# Set to "true" to embed the MACVLAN sniffer inside the API process.
# Leave unset (default) when the standalone `decnet sniffer --daemon` is
# running (which `decnet deploy` always does). Embedding both produces two
# workers sniffing the same interface — duplicated events and wasted CPU.
DECNET_EMBED_SNIFFER: bool = os.environ.get("DECNET_EMBED_SNIFFER", "").lower() == "true"
# Set to "true" to mount the Pyinstrument ASGI middleware on the FastAPI app.
# Produces per-request HTML flamegraphs under ./profiles/. Off by default so
# production and normal dev runs pay zero profiling overhead.
DECNET_PROFILE_REQUESTS: bool = os.environ.get("DECNET_PROFILE_REQUESTS", "").lower() == "true"
DECNET_PROFILE_DIR: str = os.environ.get("DECNET_PROFILE_DIR", "profiles")
# API Options # API Options
DECNET_API_HOST: str = os.environ.get("DECNET_API_HOST", "127.0.0.1") DECNET_API_HOST: str = os.environ.get("DECNET_API_HOST", "127.0.0.1")
DECNET_API_PORT: int = _port("DECNET_API_PORT", 8000) DECNET_API_PORT: int = _port("DECNET_API_PORT", 8000)
DECNET_JWT_SECRET: str = _require_env("DECNET_JWT_SECRET") # DECNET_JWT_SECRET is resolved lazily via module __getattr__ so that agent /
# updater / swarmctl subcommands (which never touch auth) can start without
# the master's JWT secret being present in the environment.
DECNET_INGEST_LOG_FILE: str | None = os.environ.get("DECNET_INGEST_LOG_FILE", "/var/log/decnet/decnet.log") DECNET_INGEST_LOG_FILE: str | None = os.environ.get("DECNET_INGEST_LOG_FILE", "/var/log/decnet/decnet.log")
# SWARM log pipeline — RFC 5425 syslog-over-TLS between worker forwarders
# and the master listener. Plaintext syslog across hosts is forbidden.
DECNET_SWARM_SYSLOG_PORT: int = _port("DECNET_SWARM_SYSLOG_PORT", 6514)
DECNET_SWARM_MASTER_HOST: str | None = os.environ.get("DECNET_SWARM_MASTER_HOST")
# Worker-side identity + swarmctl locator, seeded by the enroll bundle's
# /etc/decnet/decnet.ini ([agent] host-uuid / master-host / swarmctl-port).
# The agent heartbeat loop uses these to self-identify to the master.
DECNET_HOST_UUID: str | None = os.environ.get("DECNET_HOST_UUID")
DECNET_MASTER_HOST: str | None = os.environ.get("DECNET_MASTER_HOST")
DECNET_SWARMCTL_PORT: int = _port("DECNET_SWARMCTL_PORT", 8770)
# Ingester batching: how many log rows to accumulate per commit, and the
# max wait (ms) before flushing a partial batch. Larger batches reduce
# SQLite write-lock contention; the timeout keeps latency bounded during
# low-traffic periods.
DECNET_BATCH_SIZE: int = int(os.environ.get("DECNET_BATCH_SIZE", "100"))
DECNET_BATCH_MAX_WAIT_MS: int = int(os.environ.get("DECNET_BATCH_MAX_WAIT_MS", "250"))
# Web Dashboard Options # Web Dashboard Options
DECNET_WEB_HOST: str = os.environ.get("DECNET_WEB_HOST", "127.0.0.1") DECNET_WEB_HOST: str = os.environ.get("DECNET_WEB_HOST", "127.0.0.1")
DECNET_WEB_PORT: int = _port("DECNET_WEB_PORT", 8080) DECNET_WEB_PORT: int = _port("DECNET_WEB_PORT", 8080)
@@ -72,6 +110,17 @@ DECNET_ADMIN_USER: str = os.environ.get("DECNET_ADMIN_USER", "admin")
DECNET_ADMIN_PASSWORD: str = os.environ.get("DECNET_ADMIN_PASSWORD", "admin") DECNET_ADMIN_PASSWORD: str = os.environ.get("DECNET_ADMIN_PASSWORD", "admin")
DECNET_DEVELOPER: bool = os.environ.get("DECNET_DEVELOPER", "False").lower() == "true" DECNET_DEVELOPER: bool = os.environ.get("DECNET_DEVELOPER", "False").lower() == "true"
# Host role — seeded by /etc/decnet/decnet.ini or exported directly.
# "master" = the central server (api, web, swarmctl, listener).
# "agent" = a worker node (agent, forwarder, updater). Workers gate their
# Typer CLI to hide master-only commands (see decnet/cli.py).
DECNET_MODE: str = os.environ.get("DECNET_MODE", "master").lower()
# When mode=agent, hide master-only Typer commands. Set to "false" for dual-
# role dev hosts where a single machine plays both sides.
DECNET_DISALLOW_MASTER: bool = (
os.environ.get("DECNET_DISALLOW_MASTER", "true").lower() == "true"
)
# Tracing — set to "true" to enable OpenTelemetry distributed tracing. # Tracing — set to "true" to enable OpenTelemetry distributed tracing.
# Separate from DECNET_DEVELOPER so tracing can be toggled independently. # Separate from DECNET_DEVELOPER so tracing can be toggled independently.
DECNET_DEVELOPER_TRACING: bool = os.environ.get("DECNET_DEVELOPER_TRACING", "").lower() == "true" DECNET_DEVELOPER_TRACING: bool = os.environ.get("DECNET_DEVELOPER_TRACING", "").lower() == "true"
@@ -95,3 +144,10 @@ _web_hostname: str = "localhost" if DECNET_WEB_HOST in _WILDCARD_ADDRS else DECN
_cors_default: str = f"http://{_web_hostname}:{DECNET_WEB_PORT}" _cors_default: str = f"http://{_web_hostname}:{DECNET_WEB_PORT}"
_cors_raw: str = os.environ.get("DECNET_CORS_ORIGINS", _cors_default) _cors_raw: str = os.environ.get("DECNET_CORS_ORIGINS", _cors_default)
DECNET_CORS_ORIGINS: list[str] = [o.strip() for o in _cors_raw.split(",") if o.strip()] DECNET_CORS_ORIGINS: list[str] = [o.strip() for o in _cors_raw.split(",") if o.strip()]
def __getattr__(name: str) -> str:
"""Lazy resolution for secrets only the master web/api process needs."""
if name == "DECNET_JWT_SECRET":
return _require_env("DECNET_JWT_SECRET")
raise AttributeError(f"module 'decnet.env' has no attribute {name!r}")

View File

@@ -13,6 +13,8 @@ import logging.handlers
import os import os
from pathlib import Path from pathlib import Path
from decnet.logging.inode_aware_handler import InodeAwareRotatingFileHandler
from decnet.privdrop import chown_to_invoking_user, chown_tree_to_invoking_user
from decnet.telemetry import traced as _traced from decnet.telemetry import traced as _traced
_LOG_FILE_ENV = "DECNET_LOG_FILE" _LOG_FILE_ENV = "DECNET_LOG_FILE"
@@ -20,7 +22,7 @@ _DEFAULT_LOG_FILE = "/var/log/decnet/decnet.log"
_MAX_BYTES = 10 * 1024 * 1024 # 10 MB _MAX_BYTES = 10 * 1024 * 1024 # 10 MB
_BACKUP_COUNT = 5 _BACKUP_COUNT = 5
_handler: logging.handlers.RotatingFileHandler | None = None _handler: InodeAwareRotatingFileHandler | None = None
_logger: logging.Logger | None = None _logger: logging.Logger | None = None
@@ -31,13 +33,17 @@ def _init_file_handler() -> logging.Logger:
log_path = Path(os.environ.get(_LOG_FILE_ENV, _DEFAULT_LOG_FILE)) log_path = Path(os.environ.get(_LOG_FILE_ENV, _DEFAULT_LOG_FILE))
log_path.parent.mkdir(parents=True, exist_ok=True) log_path.parent.mkdir(parents=True, exist_ok=True)
# When running under sudo, hand the parent dir back to the invoking user
# so a subsequent non-root `decnet api` can also write to it.
chown_tree_to_invoking_user(log_path.parent)
_handler = logging.handlers.RotatingFileHandler( _handler = InodeAwareRotatingFileHandler(
log_path, log_path,
maxBytes=_MAX_BYTES, maxBytes=_MAX_BYTES,
backupCount=_BACKUP_COUNT, backupCount=_BACKUP_COUNT,
encoding="utf-8", encoding="utf-8",
) )
chown_to_invoking_user(log_path)
_handler.setFormatter(logging.Formatter("%(message)s")) _handler.setFormatter(logging.Formatter("%(message)s"))
_logger = logging.getLogger("decnet.syslog") _logger = logging.getLogger("decnet.syslog")

View File

@@ -0,0 +1,60 @@
"""
RotatingFileHandler that detects external deletion or rotation.
Stdlib ``RotatingFileHandler`` holds an open file descriptor for the
lifetime of the handler. If the target file is deleted (``rm``) or
rotated out (``logrotate`` without ``copytruncate``), the handler keeps
writing to the now-orphaned inode until its own size-based rotation
finally triggers — silently losing every line in between.
Stdlib ``WatchedFileHandler`` solves exactly this problem but doesn't
rotate by size. This subclass combines both: before each emit we stat
the configured path and compare its inode/device to the currently open
file; on mismatch we close and reopen.
Cheap: one ``os.stat`` per log record. Matches the pattern used by
``decnet/collector/worker.py:_reopen_if_needed``.
"""
from __future__ import annotations
import logging
import logging.handlers
import os
class InodeAwareRotatingFileHandler(logging.handlers.RotatingFileHandler):
"""RotatingFileHandler that reopens the target on external rotation/deletion."""
def _should_reopen(self) -> bool:
if self.stream is None:
return True
try:
disk_stat = os.stat(self.baseFilename)
except FileNotFoundError:
return True
except OSError:
return False
try:
open_stat = os.fstat(self.stream.fileno())
except OSError:
return True
return (disk_stat.st_ino != open_stat.st_ino
or disk_stat.st_dev != open_stat.st_dev)
def emit(self, record: logging.LogRecord) -> None:
if self._should_reopen():
try:
if self.stream is not None:
self.close()
except Exception: # nosec B110
pass
try:
self.stream = self._open()
except OSError:
# A logging handler MUST NOT crash its caller. If we can't
# reopen (e.g. file is root-owned after `sudo decnet deploy`
# and the current process is non-root), defer to the stdlib
# error path, which just prints a traceback to stderr.
self.handleError(record)
return
super().emit(record)

View File

@@ -5,7 +5,7 @@ Produces fully-compliant syslog messages:
<PRI>1 TIMESTAMP HOSTNAME APP-NAME PROCID MSGID [SD-ELEMENT] MSG <PRI>1 TIMESTAMP HOSTNAME APP-NAME PROCID MSGID [SD-ELEMENT] MSG
Facility: local0 (16) Facility: local0 (16)
PEN for structured data: decnet@55555 PEN for structured data: relay@55555
""" """
from __future__ import annotations from __future__ import annotations
@@ -16,7 +16,7 @@ from typing import Any
FACILITY_LOCAL0 = 16 FACILITY_LOCAL0 = 16
NILVALUE = "-" NILVALUE = "-"
_SD_ID = "decnet@55555" _SD_ID = "relay@55555"
SEVERITY_INFO = 6 SEVERITY_INFO = 6
SEVERITY_WARNING = 4 SEVERITY_WARNING = 4

View File

@@ -99,6 +99,9 @@ class DeckyConfig(BaseModel):
mutate_interval: int | None = None # automatic rotation interval in minutes mutate_interval: int | None = None # automatic rotation interval in minutes
last_mutated: float = 0.0 # timestamp of last mutation last_mutated: float = 0.0 # timestamp of last mutation
last_login_attempt: float = 0.0 # timestamp of most recent interaction last_login_attempt: float = 0.0 # timestamp of most recent interaction
# SWARM: the SwarmHost.uuid that runs this decky. None in unihost mode
# so existing state files deserialize unchanged.
host_uuid: str | None = None
@field_validator("services") @field_validator("services")
@classmethod @classmethod

View File

@@ -126,22 +126,57 @@ def allocate_ips(
# Docker MACVLAN network # Docker MACVLAN network
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
def create_macvlan_network( def _ensure_network(
client: docker.DockerClient, client: docker.DockerClient,
*,
driver: str,
interface: str, interface: str,
subnet: str, subnet: str,
gateway: str, gateway: str,
ip_range: str, ip_range: str,
extra_options: dict | None = None,
) -> None: ) -> None:
"""Create the MACVLAN Docker network. No-op if it already exists.""" """Create the decnet docker network with ``driver``, replacing any
existing = [n.name for n in client.networks.list()] existing network of the same name that was built with a different driver.
if MACVLAN_NETWORK_NAME in existing:
return Why the replace-on-driver-mismatch: macvlan and ipvlan slaves can't
coexist on the same parent interface. If an earlier run left behind a
macvlan-driver network and we're now asked for ipvlan (or vice versa),
short-circuiting on name alone leaves Docker attaching new containers
to the old driver and the host NIC ends up EBUSY on the next port
create. So: when driver disagrees, disconnect everything and DROP it.
"""
options = {"parent": interface}
if extra_options:
options.update(extra_options)
for net in client.networks.list(names=[MACVLAN_NETWORK_NAME]):
if net.attrs.get("Driver") == driver:
# Same driver — but if the IPAM pool drifted (different subnet,
# gateway, or ip-range than this deploy asks for), reusing it
# hands out addresses from the old pool and we race the real LAN.
# Compare and rebuild on mismatch.
pools = (net.attrs.get("IPAM") or {}).get("Config") or []
cur = pools[0] if pools else {}
if (
cur.get("Subnet") == subnet
and cur.get("Gateway") == gateway
and cur.get("IPRange") == ip_range
):
return # right driver AND matching pool, leave it alone
# Driver mismatch OR IPAM drift — tear it down. Disconnect any live
# containers first so `remove()` doesn't refuse with ErrNetworkInUse.
for cid in (net.attrs.get("Containers") or {}):
try:
net.disconnect(cid, force=True)
except docker.errors.APIError:
pass
net.remove()
client.networks.create( client.networks.create(
name=MACVLAN_NETWORK_NAME, name=MACVLAN_NETWORK_NAME,
driver="macvlan", driver=driver,
options={"parent": interface}, options=options,
ipam=docker.types.IPAMConfig( ipam=docker.types.IPAMConfig(
driver="default", driver="default",
pool_configs=[ pool_configs=[
@@ -155,6 +190,21 @@ def create_macvlan_network(
) )
def create_macvlan_network(
client: docker.DockerClient,
interface: str,
subnet: str,
gateway: str,
ip_range: str,
) -> None:
"""Create the MACVLAN Docker network, replacing an ipvlan-driver one of
the same name if necessary (parent-NIC can't host both drivers)."""
_ensure_network(
client, driver="macvlan", interface=interface,
subnet=subnet, gateway=gateway, ip_range=ip_range,
)
def create_ipvlan_network( def create_ipvlan_network(
client: docker.DockerClient, client: docker.DockerClient,
interface: str, interface: str,
@@ -162,25 +212,12 @@ def create_ipvlan_network(
gateway: str, gateway: str,
ip_range: str, ip_range: str,
) -> None: ) -> None:
"""Create an IPvlan L2 Docker network. No-op if it already exists.""" """Create an IPvlan L2 Docker network, replacing a macvlan-driver one of
existing = [n.name for n in client.networks.list()] the same name if necessary (parent-NIC can't host both drivers)."""
if MACVLAN_NETWORK_NAME in existing: _ensure_network(
return client, driver="ipvlan", interface=interface,
subnet=subnet, gateway=gateway, ip_range=ip_range,
client.networks.create( extra_options={"ipvlan_mode": "l2"},
name=MACVLAN_NETWORK_NAME,
driver="ipvlan",
options={"parent": interface, "ipvlan_mode": "l2"},
ipam=docker.types.IPAMConfig(
driver="default",
pool_configs=[
docker.types.IPAMPool(
subnet=subnet,
gateway=gateway,
iprange=ip_range,
)
],
),
) )
@@ -204,10 +241,14 @@ def _require_root() -> None:
def setup_host_macvlan(interface: str, host_macvlan_ip: str, decky_ip_range: str) -> None: def setup_host_macvlan(interface: str, host_macvlan_ip: str, decky_ip_range: str) -> None:
""" """
Create a macvlan interface on the host so the deployer can reach deckies. Create a macvlan interface on the host so the deployer can reach deckies.
Idempotent — skips steps that are already done. Idempotent — skips steps that are already done. Drops a stale ipvlan
host-helper first: the two drivers can share a parent NIC on paper but
leaving the opposite helper in place is just cruft after a driver swap.
""" """
_require_root() _require_root()
_run(["ip", "link", "del", HOST_IPVLAN_IFACE], check=False)
# Check if interface already exists # Check if interface already exists
result = _run(["ip", "link", "show", HOST_MACVLAN_IFACE], check=False) result = _run(["ip", "link", "show", HOST_MACVLAN_IFACE], check=False)
if result.returncode != 0: if result.returncode != 0:
@@ -227,10 +268,14 @@ def teardown_host_macvlan(decky_ip_range: str) -> None:
def setup_host_ipvlan(interface: str, host_ipvlan_ip: str, decky_ip_range: str) -> None: def setup_host_ipvlan(interface: str, host_ipvlan_ip: str, decky_ip_range: str) -> None:
""" """
Create an IPvlan interface on the host so the deployer can reach deckies. Create an IPvlan interface on the host so the deployer can reach deckies.
Idempotent — skips steps that are already done. Idempotent — skips steps that are already done. Drops a stale macvlan
host-helper first so a prior macvlan deploy doesn't leave its slave
dangling on the parent NIC after the driver swap.
""" """
_require_root() _require_root()
_run(["ip", "link", "del", HOST_MACVLAN_IFACE], check=False)
result = _run(["ip", "link", "show", HOST_IPVLAN_IFACE], check=False) result = _run(["ip", "link", "show", HOST_IPVLAN_IFACE], check=False)
if result.returncode != 0: if result.returncode != 0:
_run(["ip", "link", "add", HOST_IPVLAN_IFACE, "link", interface, "type", "ipvlan", "mode", "l2"]) _run(["ip", "link", "add", HOST_IPVLAN_IFACE, "link", interface, "type", "ipvlan", "mode", "l2"])

67
decnet/privdrop.py Normal file
View File

@@ -0,0 +1,67 @@
"""
Helpers for dropping root ownership on files created during privileged
operations (e.g. `sudo decnet deploy` needs root for MACVLAN, but its log
files should be owned by the invoking user so a subsequent non-root
`decnet api` can append to them).
When sudo invokes a process, it sets SUDO_UID / SUDO_GID in the
environment to the original user's IDs. We use those to chown files
back after creation.
"""
from __future__ import annotations
import os
from pathlib import Path
from typing import Optional
def _sudo_ids() -> Optional[tuple[int, int]]:
"""Return (uid, gid) of the sudo-invoking user, or None when the
process was not launched via sudo / the env vars are missing."""
raw_uid = os.environ.get("SUDO_UID")
raw_gid = os.environ.get("SUDO_GID")
if not raw_uid or not raw_gid:
return None
try:
return int(raw_uid), int(raw_gid)
except ValueError:
return None
def chown_to_invoking_user(path: str | os.PathLike[str]) -> None:
"""Best-effort chown of *path* to the sudo-invoking user.
No-op when:
* not running as root (nothing to drop),
* not launched via sudo (no SUDO_UID/SUDO_GID),
* the path does not exist,
* chown fails (logged-only — never raises).
"""
if os.geteuid() != 0:
return
ids = _sudo_ids()
if ids is None:
return
uid, gid = ids
p = Path(path)
if not p.exists():
return
try:
os.chown(p, uid, gid)
except OSError:
# Best-effort; a failed chown is not fatal to logging.
pass
def chown_tree_to_invoking_user(root: str | os.PathLike[str]) -> None:
"""Apply :func:`chown_to_invoking_user` to *root* and every file/dir
beneath it. Used for parent directories that we just created with
``mkdir(parents=True)`` as root."""
if os.geteuid() != 0 or _sudo_ids() is None:
return
root_path = Path(root)
if not root_path.exists():
return
chown_to_invoking_user(root_path)
for entry in root_path.rglob("*"):
chown_to_invoking_user(entry)

View File

@@ -51,7 +51,7 @@ DEFAULT_TCPFP_PORTS: list[int] = [22, 80, 443, 8080, 8443, 445, 3389]
# ─── RFC 5424 formatting (inline, mirrors templates/*/decnet_logging.py) ───── # ─── RFC 5424 formatting (inline, mirrors templates/*/decnet_logging.py) ─────
_FACILITY_LOCAL0 = 16 _FACILITY_LOCAL0 = 16
_SD_ID = "decnet@55555" _SD_ID = "relay@55555"
_SEVERITY_INFO = 6 _SEVERITY_INFO = 6
_SEVERITY_WARNING = 4 _SEVERITY_WARNING = 4
@@ -98,7 +98,7 @@ _RFC5424_RE = re.compile(
r"(\S+) " # 4: MSGID (event_type) r"(\S+) " # 4: MSGID (event_type)
r"(.+)$", # 5: SD + MSG r"(.+)$", # 5: SD + MSG
) )
_SD_BLOCK_RE = re.compile(r'\[decnet@55555\s+(.*?)\]', re.DOTALL) _SD_BLOCK_RE = re.compile(r'\[relay@55555\s+(.*?)\]', re.DOTALL)
_PARAM_RE = re.compile(r'(\w+)="((?:[^"\\]|\\.)*)"') _PARAM_RE = re.compile(r'(\w+)="((?:[^"\\]|\\.)*)"')
_IP_FIELDS = ("src_ip", "src", "client_ip", "remote_ip", "ip", "target_ip") _IP_FIELDS = ("src_ip", "src", "client_ip", "remote_ip", "ip", "target_ip")

View File

@@ -32,4 +32,4 @@ class ConpotService(BaseService):
} }
def dockerfile_context(self): def dockerfile_context(self):
return Path(__file__).parent.parent.parent / "templates" / "conpot" return Path(__file__).parent.parent / "templates" / "conpot"

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "docker_api" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "docker_api"
class DockerAPIService(BaseService): class DockerAPIService(BaseService):

View File

@@ -2,7 +2,7 @@ from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "elasticsearch" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "elasticsearch"
class ElasticsearchService(BaseService): class ElasticsearchService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "ftp" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "ftp"
class FTPService(BaseService): class FTPService(BaseService):

View File

@@ -2,7 +2,7 @@ import json
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "http" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "http"
class HTTPService(BaseService): class HTTPService(BaseService):

View File

@@ -2,7 +2,7 @@ import json
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "https" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "https"
class HTTPSService(BaseService): class HTTPSService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "imap" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "imap"
class IMAPService(BaseService): class IMAPService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "k8s" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "k8s"
class KubernetesAPIService(BaseService): class KubernetesAPIService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "ldap" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "ldap"
class LDAPService(BaseService): class LDAPService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "llmnr" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "llmnr"
class LLMNRService(BaseService): class LLMNRService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "mongodb" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "mongodb"
class MongoDBService(BaseService): class MongoDBService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "mqtt" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "mqtt"
class MQTTService(BaseService): class MQTTService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "mssql" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "mssql"
class MSSQLService(BaseService): class MSSQLService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "mysql" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "mysql"
class MySQLService(BaseService): class MySQLService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "pop3" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "pop3"
class POP3Service(BaseService): class POP3Service(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "postgres" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "postgres"
class PostgresService(BaseService): class PostgresService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "rdp" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "rdp"
class RDPService(BaseService): class RDPService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "redis" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "redis"
class RedisService(BaseService): class RedisService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "sip" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "sip"
class SIPService(BaseService): class SIPService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "smb" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "smb"
class SMBService(BaseService): class SMBService(BaseService):

View File

@@ -2,7 +2,7 @@ from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "smtp" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "smtp"
class SMTPService(BaseService): class SMTPService(BaseService):

View File

@@ -4,7 +4,7 @@ from decnet.services.base import BaseService
# Reuses the same template as the smtp service — only difference is # Reuses the same template as the smtp service — only difference is
# SMTP_OPEN_RELAY=1 in the environment, which enables the open relay persona. # SMTP_OPEN_RELAY=1 in the environment, which enables the open relay persona.
_TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "smtp" _TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "smtp"
class SMTPRelayService(BaseService): class SMTPRelayService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "sniffer" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "sniffer"
class SnifferService(BaseService): class SnifferService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "snmp" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "snmp"
class SNMPService(BaseService): class SNMPService(BaseService):

View File

@@ -2,7 +2,7 @@ from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "ssh" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "ssh"
class SSHService(BaseService): class SSHService(BaseService):
@@ -32,16 +32,28 @@ class SSHService(BaseService):
cfg = service_cfg or {} cfg = service_cfg or {}
env: dict = { env: dict = {
"SSH_ROOT_PASSWORD": cfg.get("password", "admin"), "SSH_ROOT_PASSWORD": cfg.get("password", "admin"),
# NODE_NAME is the authoritative decky identifier for log
# attribution — matches the host path used for the artifacts
# bind mount below. The container hostname (optionally overridden
# via SSH_HOSTNAME) is cosmetic and may differ to keep the
# decoy looking heterogeneous.
"NODE_NAME": decky_name,
} }
if "hostname" in cfg: if "hostname" in cfg:
env["SSH_HOSTNAME"] = cfg["hostname"] env["SSH_HOSTNAME"] = cfg["hostname"]
# File-catcher quarantine: bind-mount a per-decky host dir so attacker
# drops (scp/sftp/wget) are mirrored out-of-band for forensic analysis.
# The in-container path masquerades as systemd-coredump so `mount`/`df`
# from inside the container looks benign.
quarantine_host = f"/var/lib/decnet/artifacts/{decky_name}/ssh"
return { return {
"build": {"context": str(TEMPLATES_DIR)}, "build": {"context": str(TEMPLATES_DIR)},
"container_name": f"{decky_name}-ssh", "container_name": f"{decky_name}-ssh",
"restart": "unless-stopped", "restart": "unless-stopped",
"cap_add": ["NET_BIND_SERVICE"], "cap_add": ["NET_BIND_SERVICE"],
"environment": env, "environment": env,
"volumes": [f"{quarantine_host}:/var/lib/systemd/coredump:rw"],
} }
def dockerfile_context(self) -> Path: def dockerfile_context(self) -> Path:

View File

@@ -2,7 +2,7 @@ from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "telnet" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "telnet"
class TelnetService(BaseService): class TelnetService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "tftp" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "tftp"
class TFTPService(BaseService): class TFTPService(BaseService):

View File

@@ -1,7 +1,7 @@
from pathlib import Path from pathlib import Path
from decnet.services.base import BaseService from decnet.services.base import BaseService
TEMPLATES_DIR = Path(__file__).parent.parent.parent / "templates" / "vnc" TEMPLATES_DIR = Path(__file__).parent.parent / "templates" / "vnc"
class VNCService(BaseService): class VNCService(BaseService):

View File

@@ -16,7 +16,7 @@ from decnet.telemetry import traced as _traced
# ─── Constants (must match templates/sniffer/decnet_logging.py) ────────────── # ─── Constants (must match templates/sniffer/decnet_logging.py) ──────────────
_FACILITY_LOCAL0 = 16 _FACILITY_LOCAL0 = 16
_SD_ID = "decnet@55555" _SD_ID = "relay@55555"
_NILVALUE = "-" _NILVALUE = "-"
SEVERITY_INFO = 6 SEVERITY_INFO = 6

View File

@@ -18,7 +18,7 @@ from concurrent.futures import ThreadPoolExecutor
from pathlib import Path from pathlib import Path
from decnet.logging import get_logger from decnet.logging import get_logger
from decnet.network import HOST_MACVLAN_IFACE from decnet.network import HOST_IPVLAN_IFACE, HOST_MACVLAN_IFACE
from decnet.sniffer.fingerprint import SnifferEngine from decnet.sniffer.fingerprint import SnifferEngine
from decnet.sniffer.syslog import write_event from decnet.sniffer.syslog import write_event
from decnet.telemetry import traced as _traced from decnet.telemetry import traced as _traced
@@ -119,7 +119,23 @@ async def sniffer_worker(log_file: str) -> None:
cleanly. The API continues running regardless of sniffer state. cleanly. The API continues running regardless of sniffer state.
""" """
try: try:
interface = os.environ.get("DECNET_SNIFFER_IFACE", HOST_MACVLAN_IFACE) # Interface selection: explicit env override wins, otherwise probe
# both the MACVLAN and IPvlan host-side names since the driver
# choice is per-deploy (--ipvlan flag).
env_iface = os.environ.get("DECNET_SNIFFER_IFACE")
if env_iface:
interface = env_iface
elif _interface_exists(HOST_MACVLAN_IFACE):
interface = HOST_MACVLAN_IFACE
elif _interface_exists(HOST_IPVLAN_IFACE):
interface = HOST_IPVLAN_IFACE
else:
logger.warning(
"sniffer: neither %s nor %s found — sniffer disabled "
"(fleet may not be deployed yet)",
HOST_MACVLAN_IFACE, HOST_IPVLAN_IFACE,
)
return
if not _interface_exists(interface): if not _interface_exists(interface):
logger.warning( logger.warning(

7
decnet/swarm/__init__.py Normal file
View File

@@ -0,0 +1,7 @@
"""DECNET SWARM — multihost deployment subsystem.
Components:
* ``pki`` — X.509 CA + CSR signing used by all swarm mTLS channels
* ``client`` — master-side HTTP client that talks to remote workers
* ``log_forwarder``— worker-side syslog-over-TLS (RFC 5425) forwarder
"""

200
decnet/swarm/client.py Normal file
View File

@@ -0,0 +1,200 @@
"""Master-side HTTP client that talks to a worker's DECNET agent.
All traffic is mTLS: the master presents a cert issued by its own CA (which
workers trust) and the master validates the worker's cert against the same
CA. In practice the "client cert" the master shows is just another cert
signed by itself — the master is both the CA and the sole control-plane
client.
Usage:
async with AgentClient(host) as agent:
await agent.deploy(config)
status = await agent.status()
The ``host`` is a SwarmHost dict returned by the repository.
"""
from __future__ import annotations
import pathlib
import ssl
from dataclasses import dataclass
from typing import Any, Optional
import httpx
from decnet.config import DecnetConfig
from decnet.logging import get_logger
from decnet.swarm import pki
log = get_logger("swarm.client")
# How long a single HTTP operation can take. Deploy is the long pole —
# docker compose up pulls images, builds contexts, etc. Tune via env in a
# later iteration if the default proves too short.
_TIMEOUT_DEPLOY = httpx.Timeout(connect=10.0, read=600.0, write=30.0, pool=5.0)
_TIMEOUT_CONTROL = httpx.Timeout(connect=5.0, read=15.0, write=5.0, pool=5.0)
@dataclass(frozen=True)
class MasterIdentity:
"""Paths to the master's own mTLS client bundle.
The master uses ONE master-client cert to talk to every worker. It is
signed by the DECNET CA (same CA that signs worker certs). Stored
under ``~/.decnet/ca/master/`` by ``ensure_master_identity``.
"""
key_path: pathlib.Path
cert_path: pathlib.Path
ca_cert_path: pathlib.Path
def ensure_master_identity(
ca_dir: pathlib.Path = pki.DEFAULT_CA_DIR,
) -> MasterIdentity:
"""Create (or load) the master's own client cert.
Called once by the swarm controller on startup and by the CLI before
any master→worker call. Idempotent.
"""
ca = pki.ensure_ca(ca_dir)
master_dir = ca_dir / "master"
bundle = pki.load_worker_bundle(master_dir)
if bundle is None:
issued = pki.issue_worker_cert(ca, "decnet-master", ["127.0.0.1", "decnet-master"])
pki.write_worker_bundle(issued, master_dir)
return MasterIdentity(
key_path=master_dir / "worker.key",
cert_path=master_dir / "worker.crt",
ca_cert_path=master_dir / "ca.crt",
)
class AgentClient:
"""Thin async wrapper around the worker agent's HTTP API."""
def __init__(
self,
host: dict[str, Any] | None = None,
*,
address: Optional[str] = None,
agent_port: Optional[int] = None,
identity: Optional[MasterIdentity] = None,
verify_hostname: bool = False,
):
"""Either pass a SwarmHost dict, or explicit address/port.
``verify_hostname`` stays False by default because the worker's
cert SAN is populated from the operator-supplied address list, not
from modern TLS hostname-verification semantics. The mTLS client
cert + CA pinning are what authenticate the peer.
"""
if host is not None:
self._address = host["address"]
self._port = int(host.get("agent_port") or 8765)
self._host_uuid = host.get("uuid")
self._host_name = host.get("name")
else:
if address is None or agent_port is None:
raise ValueError(
"AgentClient requires either a host dict or address+agent_port"
)
self._address = address
self._port = int(agent_port)
self._host_uuid = None
self._host_name = None
self._identity = identity or ensure_master_identity()
self._verify_hostname = verify_hostname
self._client: Optional[httpx.AsyncClient] = None
# --------------------------------------------------------------- lifecycle
def _build_client(self, timeout: httpx.Timeout) -> httpx.AsyncClient:
# Build the SSL context manually — httpx.create_ssl_context layers on
# purpose/ALPN/default-CA logic that doesn't compose with private-CA
# mTLS in all combinations. A bare SSLContext is predictable.
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
ctx.load_cert_chain(
str(self._identity.cert_path), str(self._identity.key_path)
)
ctx.load_verify_locations(cafile=str(self._identity.ca_cert_path))
ctx.verify_mode = ssl.CERT_REQUIRED
# Pin by CA + cert chain, not by DNS — workers enroll with arbitrary
# SANs (IPs, hostnames) and we don't want to force operators to keep
# those in sync with whatever URL the master happens to use.
ctx.check_hostname = self._verify_hostname
return httpx.AsyncClient(
base_url=f"https://{self._address}:{self._port}",
verify=ctx,
timeout=timeout,
)
async def __aenter__(self) -> "AgentClient":
self._client = self._build_client(_TIMEOUT_CONTROL)
return self
async def __aexit__(self, *exc: Any) -> None:
if self._client:
await self._client.aclose()
self._client = None
def _require_client(self) -> httpx.AsyncClient:
if self._client is None:
raise RuntimeError("AgentClient used outside `async with` block")
return self._client
# ----------------------------------------------------------------- RPCs
async def health(self) -> dict[str, Any]:
resp = await self._require_client().get("/health")
resp.raise_for_status()
return resp.json()
async def status(self) -> dict[str, Any]:
resp = await self._require_client().get("/status")
resp.raise_for_status()
return resp.json()
async def deploy(
self,
config: DecnetConfig,
*,
dry_run: bool = False,
no_cache: bool = False,
) -> dict[str, Any]:
body = {
"config": config.model_dump(mode="json"),
"dry_run": dry_run,
"no_cache": no_cache,
}
# Swap in a long-deploy timeout for this call only.
old = self._require_client().timeout
self._require_client().timeout = _TIMEOUT_DEPLOY
try:
resp = await self._require_client().post("/deploy", json=body)
finally:
self._require_client().timeout = old
resp.raise_for_status()
return resp.json()
async def teardown(self, decky_id: Optional[str] = None) -> dict[str, Any]:
resp = await self._require_client().post(
"/teardown", json={"decky_id": decky_id}
)
resp.raise_for_status()
return resp.json()
async def self_destruct(self) -> dict[str, Any]:
"""Trigger the worker to stop services and wipe its install."""
resp = await self._require_client().post("/self-destruct")
resp.raise_for_status()
return resp.json()
# -------------------------------------------------------------- diagnostics
def __repr__(self) -> str:
return (
f"AgentClient(name={self._host_name!r}, "
f"address={self._address!r}, port={self._port})"
)

View File

@@ -0,0 +1,293 @@
"""Worker-side syslog-over-TLS forwarder (RFC 5425).
Runs alongside the worker agent. Tails the worker's local RFC 5424 log
file (written by the existing docker-collector) and ships each line to
the master's listener on TCP 6514 using octet-counted framing over mTLS.
Persists the last-forwarded byte offset in a tiny local SQLite so a
master crash never causes loss or duplication.
Design constraints (from the plan, non-negotiable):
* transport MUST be TLS — plaintext syslog is never acceptable between
hosts; only loopback (decky → worker-local collector) may be plaintext;
* mTLS — the listener pins the worker cert against the DECNET CA, so only
enrolled workers can push logs;
* offset persistence MUST be transactional w.r.t. the send — we only
advance the offset after ``writer.drain()`` returns without error.
The forwarder is intentionally a standalone coroutine, not a worker
inside the agent process. That keeps ``decnet agent`` crashes from
losing the log tail, and vice versa.
"""
from __future__ import annotations
import asyncio
import os
import pathlib
import sqlite3
import ssl
from dataclasses import dataclass
from typing import Optional
from decnet.logging import get_logger
from decnet.swarm import pki
log = get_logger("swarm.forwarder")
# RFC 5425 framing: "<octet-count> <syslog-msg>".
# The message itself is a standard RFC 5424 line (no trailing newline).
_FRAME_SEP = b" "
_INITIAL_BACKOFF = 1.0
_MAX_BACKOFF = 30.0
@dataclass(frozen=True)
class ForwarderConfig:
log_path: pathlib.Path # worker's RFC 5424 .log file
master_host: str
master_port: int = 6514
agent_dir: pathlib.Path = pki.DEFAULT_AGENT_DIR
state_db: Optional[pathlib.Path] = None # default: agent_dir / "forwarder.db"
# Max unacked bytes to keep in the local buffer when master is down.
# We bound the lag to avoid unbounded disk growth on catastrophic master
# outage — older lines are surfaced as a warning and dropped by advancing
# the offset.
max_lag_bytes: int = 128 * 1024 * 1024 # 128 MiB
# ------------------------------------------------------------ offset storage
class _OffsetStore:
"""Single-row SQLite offset tracker. Stdlib only — no ORM, no async."""
def __init__(self, db_path: pathlib.Path) -> None:
db_path.parent.mkdir(parents=True, exist_ok=True)
self._conn = sqlite3.connect(str(db_path))
self._conn.execute(
"CREATE TABLE IF NOT EXISTS forwarder_offset ("
" key TEXT PRIMARY KEY, offset INTEGER NOT NULL)"
)
self._conn.commit()
def get(self, key: str = "default") -> int:
row = self._conn.execute(
"SELECT offset FROM forwarder_offset WHERE key=?", (key,)
).fetchone()
return int(row[0]) if row else 0
def set(self, offset: int, key: str = "default") -> None:
self._conn.execute(
"INSERT INTO forwarder_offset(key, offset) VALUES(?, ?) "
"ON CONFLICT(key) DO UPDATE SET offset=excluded.offset",
(key, offset),
)
self._conn.commit()
def close(self) -> None:
self._conn.close()
# ---------------------------------------------------------------- TLS setup
def build_worker_ssl_context(agent_dir: pathlib.Path) -> ssl.SSLContext:
"""Client-side mTLS context for the forwarder.
Worker presents its agent bundle (same cert used for the control-plane
HTTPS listener). The CA is the DECNET CA; we pin by CA, not hostname,
because workers reach masters by operator-supplied address.
"""
bundle = pki.load_worker_bundle(agent_dir)
if bundle is None:
raise RuntimeError(
f"no worker bundle at {agent_dir} — enroll from the master first"
)
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
ctx.load_cert_chain(
certfile=str(agent_dir / "worker.crt"),
keyfile=str(agent_dir / "worker.key"),
)
ctx.load_verify_locations(cafile=str(agent_dir / "ca.crt"))
ctx.verify_mode = ssl.CERT_REQUIRED
ctx.check_hostname = False
return ctx
# ----------------------------------------------------------- frame encoding
def encode_frame(line: str) -> bytes:
"""RFC 5425 octet-counted framing: ``"<N> <msg>"``.
``N`` is the byte length of the payload that follows (after the space).
"""
payload = line.rstrip("\n").encode("utf-8", errors="replace")
return f"{len(payload)}".encode("ascii") + _FRAME_SEP + payload
async def read_frame(reader: asyncio.StreamReader) -> Optional[bytes]:
"""Read one octet-counted frame. Returns None on clean EOF."""
# Read the ASCII length up to the first space. Bound the prefix so a
# malicious peer can't force us to buffer unbounded bytes before we know
# it's a valid frame.
prefix = b""
while True:
c = await reader.read(1)
if not c:
return None if not prefix else b""
if c == _FRAME_SEP:
break
if len(prefix) >= 10 or not c.isdigit():
# RFC 5425 caps the length prefix at ~10 digits (< 4 GiB payload).
raise ValueError(f"invalid octet-count prefix: {prefix!r}")
prefix += c
n = int(prefix)
buf = await reader.readexactly(n)
return buf
# ----------------------------------------------------------------- main loop
async def _send_batch(
writer: asyncio.StreamWriter,
offset: int,
lines: list[tuple[int, str]],
store: _OffsetStore,
) -> int:
"""Write every line as a frame, drain, then persist the last offset."""
for _, line in lines:
writer.write(encode_frame(line))
await writer.drain()
last_offset = lines[-1][0]
store.set(last_offset)
return last_offset
async def run_forwarder(
cfg: ForwarderConfig,
*,
poll_interval: float = 0.5,
stop_event: Optional[asyncio.Event] = None,
) -> None:
"""Main forwarder loop. Run as a dedicated task.
Stops when ``stop_event`` is set (used by tests and clean shutdown).
Exceptions trigger exponential backoff but are never fatal — the
forwarder is expected to outlive transient master/network failures.
"""
state_db = cfg.state_db or (cfg.agent_dir / "forwarder.db")
store = _OffsetStore(state_db)
offset = store.get()
backoff = _INITIAL_BACKOFF
log.info(
"forwarder start log=%s master=%s:%d offset=%d",
cfg.log_path, cfg.master_host, cfg.master_port, offset,
)
try:
while stop_event is None or not stop_event.is_set():
try:
ctx = build_worker_ssl_context(cfg.agent_dir)
reader, writer = await asyncio.open_connection(
cfg.master_host, cfg.master_port, ssl=ctx
)
log.info("forwarder connected master=%s:%d", cfg.master_host, cfg.master_port)
backoff = _INITIAL_BACKOFF
try:
offset = await _pump(cfg, store, writer, offset, poll_interval, stop_event)
finally:
writer.close()
try:
await writer.wait_closed()
except Exception: # nosec B110 — socket cleanup is best-effort
pass
# Keep reader alive until here to avoid "reader garbage
# collected" warnings on some Python builds.
del reader
except (OSError, ssl.SSLError, ConnectionError) as exc:
log.warning(
"forwarder disconnected: %s — retrying in %.1fs", exc, backoff
)
try:
await asyncio.wait_for(
_sleep_unless_stopped(backoff, stop_event), timeout=backoff + 1
)
except asyncio.TimeoutError:
pass
backoff = min(_MAX_BACKOFF, backoff * 2)
finally:
store.close()
log.info("forwarder stopped offset=%d", offset)
async def _pump(
cfg: ForwarderConfig,
store: _OffsetStore,
writer: asyncio.StreamWriter,
offset: int,
poll_interval: float,
stop_event: Optional[asyncio.Event],
) -> int:
"""Read new lines since ``offset`` and ship them until disconnect."""
while stop_event is None or not stop_event.is_set():
if not cfg.log_path.exists():
await _sleep_unless_stopped(poll_interval, stop_event)
continue
stat = cfg.log_path.stat()
if stat.st_size < offset:
# truncated/rotated — reset.
log.warning("forwarder log rotated — resetting offset=0")
offset = 0
store.set(0)
if stat.st_size - offset > cfg.max_lag_bytes:
# Catastrophic lag — skip ahead to cap local disk pressure.
skip_to = stat.st_size - cfg.max_lag_bytes
log.warning(
"forwarder lag %d > cap %d — dropping oldest %d bytes",
stat.st_size - offset, cfg.max_lag_bytes, skip_to - offset,
)
offset = skip_to
store.set(offset)
if stat.st_size == offset:
await _sleep_unless_stopped(poll_interval, stop_event)
continue
batch: list[tuple[int, str]] = []
with open(cfg.log_path, "r", encoding="utf-8", errors="replace") as f:
f.seek(offset)
while True:
line = f.readline()
if not line or not line.endswith("\n"):
break
offset_after = f.tell()
batch.append((offset_after, line.rstrip("\n")))
if len(batch) >= 500:
break
if batch:
offset = await _send_batch(writer, offset, batch, store)
return offset
async def _sleep_unless_stopped(
seconds: float, stop_event: Optional[asyncio.Event]
) -> None:
if stop_event is None:
await asyncio.sleep(seconds)
return
try:
await asyncio.wait_for(stop_event.wait(), timeout=seconds)
except asyncio.TimeoutError:
pass
# Re-exported for CLI convenience
DEFAULT_PORT = 6514
def default_master_host() -> Optional[str]:
return os.environ.get("DECNET_SWARM_MASTER_HOST")

View File

@@ -0,0 +1,194 @@
"""Master-side syslog-over-TLS listener (RFC 5425).
Accepts mTLS-authenticated worker connections on TCP 6514, reads
octet-counted frames, parses each as an RFC 5424 line, and appends it to
the master's local ingest log files. The existing log_ingestion_worker
tails those files and inserts records into the master repo — worker
provenance is embedded in the parsed record's ``source_worker`` field.
Design:
* TLS is mandatory. No plaintext fallback. A peer without a CA-signed
cert is rejected at the TLS handshake; nothing gets past the kernel.
* The listener never trusts the syslog HOSTNAME field for provenance —
that's attacker-supplied from the decky. The authoritative source is
the peer cert's CN, which the CA controlled at enrollment.
* Dropped connections are fine — the worker's forwarder holds the
offset and resumes from the same byte on reconnect.
"""
from __future__ import annotations
import asyncio
import json
import pathlib
import ssl
from dataclasses import dataclass
from typing import Optional
from cryptography import x509
from cryptography.hazmat.primitives import serialization
from cryptography.x509.oid import NameOID
from decnet.logging import get_logger
from decnet.swarm import pki
from decnet.swarm.log_forwarder import read_frame
log = get_logger("swarm.listener")
@dataclass(frozen=True)
class ListenerConfig:
log_path: pathlib.Path # master's RFC 5424 .log (forensic sink)
json_path: pathlib.Path # master's .json (ingester tails this)
bind_host: str = "0.0.0.0" # nosec B104 — listener must bind publicly
bind_port: int = 6514
ca_dir: pathlib.Path = pki.DEFAULT_CA_DIR
# --------------------------------------------------------- TLS context
def build_listener_ssl_context(ca_dir: pathlib.Path) -> ssl.SSLContext:
"""Server-side mTLS context: master presents its master cert; clients
must present a cert signed by the DECNET CA."""
master_dir = ca_dir / "master"
ca_cert = master_dir / "ca.crt"
cert = master_dir / "worker.crt" # master re-uses the 'worker' bundle layout
key = master_dir / "worker.key"
for p in (ca_cert, cert, key):
if not p.exists():
raise RuntimeError(
f"master identity missing at {master_dir} — call ensure_master_identity first"
)
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_SERVER)
ctx.load_cert_chain(certfile=str(cert), keyfile=str(key))
ctx.load_verify_locations(cafile=str(ca_cert))
ctx.verify_mode = ssl.CERT_REQUIRED
return ctx
# ---------------------------------------------------------- helpers
def peer_cn(ssl_object: Optional[ssl.SSLObject]) -> str:
"""Extract the CN from the TLS peer certificate (worker provenance).
Falls back to ``"unknown"`` on any parse error — we refuse to crash on
malformed cert DNs and instead tag the message for later inspection.
"""
if ssl_object is None:
return "unknown"
der = ssl_object.getpeercert(binary_form=True)
if der is None:
return "unknown"
try:
cert = x509.load_der_x509_certificate(der)
attrs = cert.subject.get_attributes_for_oid(NameOID.COMMON_NAME)
return attrs[0].value if attrs else "unknown"
except Exception: # nosec B110 — provenance is best-effort
return "unknown"
def fingerprint_from_ssl(ssl_object: Optional[ssl.SSLObject]) -> Optional[str]:
if ssl_object is None:
return None
der = ssl_object.getpeercert(binary_form=True)
if der is None:
return None
try:
cert = x509.load_der_x509_certificate(der)
return pki.fingerprint(cert.public_bytes(serialization.Encoding.PEM))
except Exception:
return None
# --------------------------------------------------- per-connection handler
async def _handle_connection(
reader: asyncio.StreamReader,
writer: asyncio.StreamWriter,
cfg: ListenerConfig,
) -> None:
ssl_obj = writer.get_extra_info("ssl_object")
cn = peer_cn(ssl_obj)
peer = writer.get_extra_info("peername")
log.info("listener accepted worker=%s peer=%s", cn, peer)
# Lazy import to avoid a circular dep if the collector pulls in logger setup.
from decnet.collector.worker import parse_rfc5424
cfg.log_path.parent.mkdir(parents=True, exist_ok=True)
cfg.json_path.parent.mkdir(parents=True, exist_ok=True)
try:
with open(cfg.log_path, "a", encoding="utf-8") as lf, open(
cfg.json_path, "a", encoding="utf-8"
) as jf:
while True:
try:
frame = await read_frame(reader)
except asyncio.IncompleteReadError:
break
except ValueError as exc:
log.warning("listener bad frame worker=%s err=%s", cn, exc)
break
if frame is None:
break
if not frame:
continue
line = frame.decode("utf-8", errors="replace")
lf.write(line + "\n")
lf.flush()
parsed = parse_rfc5424(line)
if parsed is not None:
parsed["source_worker"] = cn
jf.write(json.dumps(parsed) + "\n")
jf.flush()
else:
log.debug("listener malformed RFC5424 worker=%s snippet=%r", cn, line[:80])
except Exception as exc:
log.warning("listener connection error worker=%s err=%s", cn, exc)
finally:
writer.close()
try:
await writer.wait_closed()
except Exception: # nosec B110 — socket cleanup is best-effort
pass
log.info("listener closed worker=%s", cn)
# ---------------------------------------------------------------- server
async def run_listener(
cfg: ListenerConfig,
*,
stop_event: Optional[asyncio.Event] = None,
) -> None:
ctx = build_listener_ssl_context(cfg.ca_dir)
async def _client_cb(
reader: asyncio.StreamReader, writer: asyncio.StreamWriter
) -> None:
await _handle_connection(reader, writer, cfg)
server = await asyncio.start_server(
_client_cb, host=cfg.bind_host, port=cfg.bind_port, ssl=ctx
)
sockets = server.sockets or ()
log.info(
"listener bound host=%s port=%d sockets=%d",
cfg.bind_host, cfg.bind_port, len(sockets),
)
async with server:
if stop_event is None:
await server.serve_forever()
else:
serve_task = asyncio.create_task(server.serve_forever())
await stop_event.wait()
server.close()
serve_task.cancel()
try:
await serve_task
except (asyncio.CancelledError, Exception): # nosec B110
pass

323
decnet/swarm/pki.py Normal file
View File

@@ -0,0 +1,323 @@
"""DECNET SWARM PKI — self-managed X.509 CA for master↔worker mTLS.
Used by:
* the SWARM controller (master) to issue per-worker server+client certs at
enrollment time,
* the agent (worker) to present its mTLS identity for both the control-plane
HTTPS endpoint and the syslog-over-TLS (RFC 5425) log forwarder,
* the master-side syslog-TLS listener to authenticate inbound workers.
Storage layout (master):
~/.decnet/ca/
ca.key (PEM, 0600 — the CA private key)
ca.crt (PEM — self-signed root)
workers/<worker-name>/
client.crt (issued, signed by CA)
Worker layout (delivered by /enroll response):
~/.decnet/agent/
ca.crt (master's CA — trust anchor)
worker.key (worker's own private key)
worker.crt (signed by master CA — used for both TLS
server auth *and* syslog client auth)
The CA is a hard dependency only in swarm mode; unihost installs never
touch this module.
"""
from __future__ import annotations
import datetime as _dt
import hashlib
import ipaddress
import os
import pathlib
from dataclasses import dataclass
from typing import Optional
from cryptography import x509
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.x509.oid import NameOID
DEFAULT_CA_DIR = pathlib.Path(os.path.expanduser("~/.decnet/ca"))
DEFAULT_AGENT_DIR = pathlib.Path(os.path.expanduser("~/.decnet/agent"))
DEFAULT_SWARMCTL_DIR = pathlib.Path(os.path.expanduser("~/.decnet/swarmctl"))
CA_KEY_BITS = 4096
WORKER_KEY_BITS = 2048
CA_VALIDITY_DAYS = 3650 # 10 years — internal CA
WORKER_VALIDITY_DAYS = 825 # max permitted by modern TLS clients
@dataclass(frozen=True)
class CABundle:
"""The master's CA identity (key is secret, cert is published)."""
key_pem: bytes
cert_pem: bytes
@dataclass(frozen=True)
class IssuedCert:
"""A signed worker certificate + its private key, handed to the worker
exactly once during enrollment.
"""
key_pem: bytes
cert_pem: bytes
ca_cert_pem: bytes
fingerprint_sha256: str # hex, lowercase
# --------------------------------------------------------------------- CA ops
def _pem_private(key: rsa.RSAPrivateKey) -> bytes:
return key.private_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PrivateFormat.PKCS8,
encryption_algorithm=serialization.NoEncryption(),
)
def _pem_cert(cert: x509.Certificate) -> bytes:
return cert.public_bytes(serialization.Encoding.PEM)
def generate_ca(common_name: str = "DECNET SWARM Root CA") -> CABundle:
"""Generate a fresh self-signed CA. Does not touch disk."""
key = rsa.generate_private_key(public_exponent=65537, key_size=CA_KEY_BITS)
subject = issuer = x509.Name(
[
x509.NameAttribute(NameOID.COMMON_NAME, common_name),
x509.NameAttribute(NameOID.ORGANIZATION_NAME, "DECNET"),
]
)
now = _dt.datetime.now(_dt.timezone.utc)
cert = (
x509.CertificateBuilder()
.subject_name(subject)
.issuer_name(issuer)
.public_key(key.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(now - _dt.timedelta(minutes=5))
.not_valid_after(now + _dt.timedelta(days=CA_VALIDITY_DAYS))
.add_extension(x509.BasicConstraints(ca=True, path_length=0), critical=True)
.add_extension(
x509.KeyUsage(
digital_signature=True,
content_commitment=False,
key_encipherment=False,
data_encipherment=False,
key_agreement=False,
key_cert_sign=True,
crl_sign=True,
encipher_only=False,
decipher_only=False,
),
critical=True,
)
.sign(private_key=key, algorithm=hashes.SHA256())
)
return CABundle(key_pem=_pem_private(key), cert_pem=_pem_cert(cert))
def save_ca(bundle: CABundle, ca_dir: pathlib.Path = DEFAULT_CA_DIR) -> None:
ca_dir.mkdir(parents=True, exist_ok=True)
# 0700 on the dir, 0600 on the key — defence against casual reads.
os.chmod(ca_dir, 0o700)
key_path = ca_dir / "ca.key"
cert_path = ca_dir / "ca.crt"
key_path.write_bytes(bundle.key_pem)
os.chmod(key_path, 0o600)
cert_path.write_bytes(bundle.cert_pem)
def load_ca(ca_dir: pathlib.Path = DEFAULT_CA_DIR) -> CABundle:
key_pem = (ca_dir / "ca.key").read_bytes()
cert_pem = (ca_dir / "ca.crt").read_bytes()
return CABundle(key_pem=key_pem, cert_pem=cert_pem)
def ensure_ca(ca_dir: pathlib.Path = DEFAULT_CA_DIR) -> CABundle:
"""Load the CA if present, otherwise generate and persist a new one."""
if (ca_dir / "ca.key").exists() and (ca_dir / "ca.crt").exists():
return load_ca(ca_dir)
bundle = generate_ca()
save_ca(bundle, ca_dir)
return bundle
# --------------------------------------------------------------- cert issuance
def _parse_san(value: str) -> x509.GeneralName:
"""Parse a SAN entry as IP if possible, otherwise DNS."""
try:
return x509.IPAddress(ipaddress.ip_address(value))
except ValueError:
return x509.DNSName(value)
def issue_worker_cert(
ca: CABundle,
worker_name: str,
sans: list[str],
validity_days: int = WORKER_VALIDITY_DAYS,
) -> IssuedCert:
"""Sign a freshly-generated worker keypair.
The cert is usable as BOTH a TLS server (agent's HTTPS endpoint) and a
TLS client (syslog-over-TLS upstream to the master) — extended key usage
covers both. ``sans`` should include every address/name the master or
workers will use to reach this worker — typically the worker's IP plus
its hostname.
"""
ca_key = serialization.load_pem_private_key(ca.key_pem, password=None)
ca_cert = x509.load_pem_x509_certificate(ca.cert_pem)
worker_key = rsa.generate_private_key(public_exponent=65537, key_size=WORKER_KEY_BITS)
subject = x509.Name(
[
x509.NameAttribute(NameOID.COMMON_NAME, worker_name),
x509.NameAttribute(NameOID.ORGANIZATION_NAME, "DECNET"),
x509.NameAttribute(NameOID.ORGANIZATIONAL_UNIT_NAME, "swarm-worker"),
]
)
now = _dt.datetime.now(_dt.timezone.utc)
san_entries: list[x509.GeneralName] = [_parse_san(s) for s in sans] if sans else []
# Always include the worker-name as a DNS SAN so cert pinning by CN-as-DNS
# works even when the operator forgets to pass an explicit SAN list.
if not any(
isinstance(e, x509.DNSName) and e.value == worker_name for e in san_entries
):
san_entries.append(x509.DNSName(worker_name))
builder = (
x509.CertificateBuilder()
.subject_name(subject)
.issuer_name(ca_cert.subject)
.public_key(worker_key.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(now - _dt.timedelta(minutes=5))
.not_valid_after(now + _dt.timedelta(days=validity_days))
.add_extension(x509.BasicConstraints(ca=False, path_length=None), critical=True)
.add_extension(
x509.KeyUsage(
digital_signature=True,
content_commitment=False,
key_encipherment=True,
data_encipherment=False,
key_agreement=False,
key_cert_sign=False,
crl_sign=False,
encipher_only=False,
decipher_only=False,
),
critical=True,
)
.add_extension(
x509.ExtendedKeyUsage(
[
x509.ObjectIdentifier("1.3.6.1.5.5.7.3.1"), # serverAuth
x509.ObjectIdentifier("1.3.6.1.5.5.7.3.2"), # clientAuth
]
),
critical=True,
)
.add_extension(x509.SubjectAlternativeName(san_entries), critical=False)
)
cert = builder.sign(private_key=ca_key, algorithm=hashes.SHA256())
cert_pem = _pem_cert(cert)
fp = hashlib.sha256(
cert.public_bytes(serialization.Encoding.DER)
).hexdigest()
return IssuedCert(
key_pem=_pem_private(worker_key),
cert_pem=cert_pem,
ca_cert_pem=ca.cert_pem,
fingerprint_sha256=fp,
)
def write_worker_bundle(
issued: IssuedCert,
agent_dir: pathlib.Path = DEFAULT_AGENT_DIR,
) -> None:
"""Persist an issued bundle into the worker's agent directory."""
agent_dir.mkdir(parents=True, exist_ok=True)
os.chmod(agent_dir, 0o700)
(agent_dir / "ca.crt").write_bytes(issued.ca_cert_pem)
(agent_dir / "worker.crt").write_bytes(issued.cert_pem)
key_path = agent_dir / "worker.key"
key_path.write_bytes(issued.key_pem)
os.chmod(key_path, 0o600)
def load_worker_bundle(
agent_dir: pathlib.Path = DEFAULT_AGENT_DIR,
) -> Optional[IssuedCert]:
"""Return the worker's bundle if enrolled; ``None`` otherwise."""
ca = agent_dir / "ca.crt"
crt = agent_dir / "worker.crt"
key = agent_dir / "worker.key"
if not (ca.exists() and crt.exists() and key.exists()):
return None
cert_pem = crt.read_bytes()
cert = x509.load_pem_x509_certificate(cert_pem)
fp = hashlib.sha256(
cert.public_bytes(serialization.Encoding.DER)
).hexdigest()
return IssuedCert(
key_pem=key.read_bytes(),
cert_pem=cert_pem,
ca_cert_pem=ca.read_bytes(),
fingerprint_sha256=fp,
)
def ensure_swarmctl_cert(
bind_host: str,
ca_dir: pathlib.Path = DEFAULT_CA_DIR,
swarmctl_dir: pathlib.Path = DEFAULT_SWARMCTL_DIR,
extra_sans: Optional[list[str]] = None,
) -> tuple[pathlib.Path, pathlib.Path, pathlib.Path]:
"""Return (cert_path, key_path, ca_path), auto-issuing if missing.
Uses the existing DECNET CA (ensuring it exists first) so workers
whose bundle already includes ``ca.crt`` can verify the swarmctl
endpoint without additional trust configuration. Self-signed is
intentionally not the default — a cert signed by the same CA the
workers already trust is the friction-free path.
Callers that want BYOC should skip this and pass their own
cert/key paths directly to uvicorn.
"""
swarmctl_dir.mkdir(parents=True, exist_ok=True)
os.chmod(swarmctl_dir, 0o700)
cert_path = swarmctl_dir / "server.crt"
key_path = swarmctl_dir / "server.key"
ca_cert_path = ca_dir / "ca.crt"
if cert_path.exists() and key_path.exists() and ca_cert_path.exists():
return cert_path, key_path, ca_cert_path
ca = ensure_ca(ca_dir)
sans = list({bind_host, "127.0.0.1", "localhost", *(extra_sans or [])})
issued = issue_worker_cert(ca, "swarmctl", sans)
cert_path.write_bytes(issued.cert_pem)
key_path.write_bytes(issued.key_pem)
os.chmod(key_path, 0o600)
# ensure_ca already wrote ca.crt under ca_dir, but save_ca is only
# called on generate — re-mirror it here to guarantee the path exists.
if not ca_cert_path.exists():
ca_cert_path.write_bytes(ca.cert_pem)
return cert_path, key_path, ca_cert_path
def fingerprint(cert_pem: bytes) -> str:
"""SHA-256 hex fingerprint of a cert (DER-encoded)."""
cert = x509.load_pem_x509_certificate(cert_pem)
return hashlib.sha256(cert.public_bytes(serialization.Encoding.DER)).hexdigest()

97
decnet/swarm/tar_tree.py Normal file
View File

@@ -0,0 +1,97 @@
"""Build a gzipped tarball of the master's working tree for pushing to workers.
Always excludes the obvious large / secret / churn paths: ``.venv/``,
``__pycache__/``, ``.git/``, ``wiki-checkout/``, ``*.db*``, ``*.log``. The
caller can supply additional exclude globs.
Deliberately does NOT invoke git — the tree is what the operator has on
disk (staged + unstaged + untracked). That's the whole point; the scp
workflow we're replacing also shipped the live tree.
"""
from __future__ import annotations
import fnmatch
import io
import pathlib
import tarfile
from typing import Iterable, Optional
DEFAULT_EXCLUDES = (
".venv", ".venv/*",
"**/.venv/*",
"__pycache__", "**/__pycache__", "**/__pycache__/*",
".git", ".git/*",
"wiki-checkout", "wiki-checkout/*",
"*.pyc", "*.pyo",
"*.db", "*.db-wal", "*.db-shm",
"*.log",
".pytest_cache", ".pytest_cache/*",
".mypy_cache", ".mypy_cache/*",
".tox", ".tox/*",
"*.egg-info", "*.egg-info/*",
"decnet-state.json",
"master.log", "master.json",
"decnet.db*",
)
def _is_excluded(rel: str, patterns: Iterable[str]) -> bool:
parts = pathlib.PurePosixPath(rel).parts
for pat in patterns:
if fnmatch.fnmatch(rel, pat):
return True
# Also match the pattern against every leading subpath — this is
# what catches nested `.venv/...` without forcing callers to spell
# out every `**/` glob.
for i in range(1, len(parts) + 1):
if fnmatch.fnmatch("/".join(parts[:i]), pat):
return True
return False
def tar_working_tree(
root: pathlib.Path,
extra_excludes: Optional[Iterable[str]] = None,
) -> bytes:
"""Return the gzipped tarball bytes of ``root``.
Entries are added with paths relative to ``root`` (no leading ``/``,
no ``..``). The updater rejects unsafe paths on the receiving side.
"""
patterns = list(DEFAULT_EXCLUDES) + list(extra_excludes or ())
buf = io.BytesIO()
with tarfile.open(fileobj=buf, mode="w:gz") as tar:
for path in sorted(root.rglob("*")):
rel = path.relative_to(root).as_posix()
if _is_excluded(rel, patterns):
continue
if path.is_symlink():
# Symlinks inside a repo tree are rare and often break
# portability; skip them rather than ship dangling links.
continue
if path.is_dir():
continue
tar.add(path, arcname=rel, recursive=False)
return buf.getvalue()
def detect_git_sha(root: pathlib.Path) -> str:
"""Best-effort ``HEAD`` sha. Returns ``""`` if not a git repo."""
head = root / ".git" / "HEAD"
if not head.is_file():
return ""
try:
ref = head.read_text().strip()
except OSError:
return ""
if ref.startswith("ref: "):
ref_path = root / ".git" / ref[5:]
if ref_path.is_file():
try:
return ref_path.read_text().strip()
except OSError:
return ""
return ""
return ref

View File

@@ -0,0 +1,124 @@
"""Master-side HTTP client for the worker's self-updater daemon.
Sibling of ``AgentClient``: same mTLS identity (same DECNET CA, same
master client cert) but targets the updater's port (default 8766) and
speaks the multipart upload protocol the updater's ``/update`` endpoint
expects.
Kept as its own module — not a subclass of ``AgentClient`` — because the
timeouts and failure semantics are genuinely different: pip install +
agent probe can take a minute on a slow VM, and ``/update-self`` drops
the connection on purpose (the updater re-execs itself mid-response).
"""
from __future__ import annotations
import ssl
from typing import Any, Optional
import httpx
from decnet.logging import get_logger
from decnet.swarm.client import MasterIdentity, ensure_master_identity
log = get_logger("swarm.updater_client")
_TIMEOUT_UPDATE = httpx.Timeout(connect=10.0, read=180.0, write=120.0, pool=5.0)
_TIMEOUT_CONTROL = httpx.Timeout(connect=5.0, read=30.0, write=10.0, pool=5.0)
class UpdaterClient:
"""Async client targeting a worker's ``decnet updater`` daemon."""
def __init__(
self,
host: dict[str, Any] | None = None,
*,
address: Optional[str] = None,
updater_port: int = 8766,
identity: Optional[MasterIdentity] = None,
):
if host is not None:
self._address = host["address"]
self._host_name = host.get("name")
else:
if address is None:
raise ValueError("UpdaterClient requires host dict or address")
self._address = address
self._host_name = None
self._port = updater_port
self._identity = identity or ensure_master_identity()
self._client: Optional[httpx.AsyncClient] = None
def _build_client(self, timeout: httpx.Timeout) -> httpx.AsyncClient:
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
ctx.load_cert_chain(
str(self._identity.cert_path), str(self._identity.key_path),
)
ctx.load_verify_locations(cafile=str(self._identity.ca_cert_path))
ctx.verify_mode = ssl.CERT_REQUIRED
ctx.check_hostname = False
return httpx.AsyncClient(
base_url=f"https://{self._address}:{self._port}",
verify=ctx,
timeout=timeout,
)
async def __aenter__(self) -> "UpdaterClient":
self._client = self._build_client(_TIMEOUT_CONTROL)
return self
async def __aexit__(self, *exc: Any) -> None:
if self._client:
await self._client.aclose()
self._client = None
def _require(self) -> httpx.AsyncClient:
if self._client is None:
raise RuntimeError("UpdaterClient used outside `async with` block")
return self._client
# --------------------------------------------------------------- RPCs
async def health(self) -> dict[str, Any]:
r = await self._require().get("/health")
r.raise_for_status()
return r.json()
async def releases(self) -> dict[str, Any]:
r = await self._require().get("/releases")
r.raise_for_status()
return r.json()
async def update(self, tarball: bytes, sha: str = "") -> httpx.Response:
"""POST /update. Returns the Response so the caller can distinguish
200 / 409 / 500 — each means something different.
"""
self._require().timeout = _TIMEOUT_UPDATE
try:
r = await self._require().post(
"/update",
files={"tarball": ("tree.tgz", tarball, "application/gzip")},
data={"sha": sha},
)
finally:
self._require().timeout = _TIMEOUT_CONTROL
return r
async def update_self(self, tarball: bytes, sha: str = "") -> httpx.Response:
"""POST /update-self. The updater re-execs itself, so the connection
usually drops mid-response; that's not an error. Callers should then
poll /health until the new SHA appears.
"""
self._require().timeout = _TIMEOUT_UPDATE
try:
r = await self._require().post(
"/update-self",
files={"tarball": ("tree.tgz", tarball, "application/gzip")},
data={"sha": sha, "confirm_self": "true"},
)
finally:
self._require().timeout = _TIMEOUT_CONTROL
return r
async def rollback(self) -> httpx.Response:
return await self._require().post("/rollback")

View File

@@ -11,16 +11,16 @@ RUN find /opt /usr /etc /home -name "*.xml" -exec sed -i 's/port="5020"/port="50
RUN (apt-get update && apt-get install -y --no-install-recommends libcap2-bin 2>/dev/null) || (apk add --no-cache libcap 2>/dev/null) || true RUN (apt-get update && apt-get install -y --no-install-recommends libcap2-bin 2>/dev/null) || (apk add --no-cache libcap 2>/dev/null) || true
RUN find /home/conpot/.local/bin /usr /opt -type f -name 'python*' -exec setcap 'cap_net_bind_service+eip' {} \; 2>/dev/null || true RUN find /home/conpot/.local/bin /usr /opt -type f -name 'python*' -exec setcap 'cap_net_bind_service+eip' {} \; 2>/dev/null || true
# Bridge conpot's own logger into DECNET's RFC 5424 syslog pipeline. # Bridge conpot's own logger into syslog-relay's RFC 5424 syslog pipeline.
# entrypoint.py is self-contained (inlines the formatter) because the # entrypoint.py is self-contained (inlines the formatter) because the
# conpot base image runs Python 3.6, which cannot import the shared # conpot base image runs Python 3.6, which cannot import the shared
# decnet_logging.py (that file uses 3.9+ / 3.10+ type syntax). # syslog_bridge.py (that file uses 3.9+ / 3.10+ type syntax).
COPY entrypoint.py /home/conpot/entrypoint.py COPY entrypoint.py /home/conpot/entrypoint.py
RUN chown conpot:conpot /home/conpot/entrypoint.py \ RUN chown conpot:conpot /home/conpot/entrypoint.py \
&& chmod +x /home/conpot/entrypoint.py && chmod +x /home/conpot/entrypoint.py
# The upstream image already runs as non-root 'conpot'. # The upstream image already runs as non-root 'conpot'.
# We do NOT switch to a 'decnet' user — doing so breaks pkg_resources # We do NOT switch to a 'logrelay' user — doing so breaks pkg_resources
# because conpot's eggs live under /home/conpot/.local and are only on # because conpot's eggs live under /home/conpot/.local and are only on
# the Python path for that user. # the Python path for that user.
USER conpot USER conpot

View File

@@ -3,7 +3,7 @@
Entrypoint wrapper for the Conpot ICS/SCADA honeypot. Entrypoint wrapper for the Conpot ICS/SCADA honeypot.
Launches conpot as a child process and bridges its log output into the Launches conpot as a child process and bridges its log output into the
DECNET structured syslog pipeline. Each line from conpot stdout/stderr syslog-relay structured syslog pipeline. Each line from conpot stdout/stderr
is classified and emitted as an RFC 5424 syslog line so the host-side is classified and emitted as an RFC 5424 syslog line so the host-side
collector can ingest it alongside every other service. collector can ingest it alongside every other service.
@@ -21,7 +21,7 @@ from datetime import datetime, timezone
# ── RFC 5424 inline formatter (Python 3.6-compatible) ───────────────────────── # ── RFC 5424 inline formatter (Python 3.6-compatible) ─────────────────────────
_FACILITY_LOCAL0 = 16 _FACILITY_LOCAL0 = 16
_SD_ID = "decnet@55555" _SD_ID = "relay@55555"
_NILVALUE = "-" _NILVALUE = "-"
SEVERITY_INFO = 6 SEVERITY_INFO = 6

View File

@@ -1,15 +1,15 @@
#!/usr/bin/env python3 #!/usr/bin/env python3
""" """
Shared RFC 5424 syslog helper for DECNET service templates. Shared RFC 5424 syslog helper used by service containers.
Services call syslog_line() to format an RFC 5424 message, then Services call syslog_line() to format an RFC 5424 message, then
write_syslog_file() to emit it to stdout Docker captures it, and the write_syslog_file() to emit it to stdout the container runtime
host-side collector streams it into the log file. captures it, and the host-side collector streams it into the log file.
RFC 5424 structure: RFC 5424 structure:
<PRI>1 TIMESTAMP HOSTNAME APP-NAME PROCID MSGID [SD-ELEMENT] MSG <PRI>1 TIMESTAMP HOSTNAME APP-NAME PROCID MSGID [SD-ELEMENT] MSG
Facility: local0 (16), PEN for SD element ID: decnet@55555 Facility: local0 (16). SD element ID uses PEN 55555.
""" """
from datetime import datetime, timezone from datetime import datetime, timezone
@@ -18,7 +18,7 @@ from typing import Any
# ─── Constants ──────────────────────────────────────────────────────────────── # ─── Constants ────────────────────────────────────────────────────────────────
_FACILITY_LOCAL0 = 16 _FACILITY_LOCAL0 = 16
_SD_ID = "decnet@55555" _SD_ID = "relay@55555"
_NILVALUE = "-" _NILVALUE = "-"
SEVERITY_EMERG = 0 SEVERITY_EMERG = 0
@@ -62,7 +62,7 @@ def syslog_line(
Args: Args:
service: APP-NAME (e.g. "http", "mysql") service: APP-NAME (e.g. "http", "mysql")
hostname: HOSTNAME (decky node name) hostname: HOSTNAME (node name)
event_type: MSGID (e.g. "request", "login_attempt") event_type: MSGID (e.g. "request", "login_attempt")
severity: Syslog severity integer (default: INFO=6) severity: Syslog severity integer (default: INFO=6)
timestamp: UTC datetime; defaults to now timestamp: UTC datetime; defaults to now
@@ -80,10 +80,10 @@ def syslog_line(
def write_syslog_file(line: str) -> None: def write_syslog_file(line: str) -> None:
"""Emit a syslog line to stdout for Docker log capture.""" """Emit a syslog line to stdout for container log capture."""
print(line, flush=True) print(line, flush=True)
def forward_syslog(line: str, log_target: str) -> None: def forward_syslog(line: str, log_target: str) -> None:
"""No-op stub. TCP forwarding is now handled by rsyslog, not by service containers.""" """No-op stub. TCP forwarding is handled by rsyslog, not by service containers."""
pass pass

View File

@@ -7,7 +7,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
git authbind \ git authbind \
&& rm -rf /var/lib/apt/lists/* && rm -rf /var/lib/apt/lists/*
RUN useradd -r -s /bin/false -d /opt decnet \ RUN useradd -r -s /bin/false -d /opt logrelay \
&& apt-get update && apt-get install -y --no-install-recommends libcap2-bin \ && apt-get update && apt-get install -y --no-install-recommends libcap2-bin \
&& rm -rf /var/lib/apt/lists/* \ && rm -rf /var/lib/apt/lists/* \
&& (find /usr/bin/ -maxdepth 1 -name 'python3*' -type f -exec setcap 'cap_net_bind_service+eip' {} \; 2>/dev/null || true) && (find /usr/bin/ -maxdepth 1 -name 'python3*' -type f -exec setcap 'cap_net_bind_service+eip' {} \; 2>/dev/null || true)
@@ -18,5 +18,5 @@ RUN chmod +x /entrypoint.sh
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \ HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
CMD kill -0 1 || exit 1 CMD kill -0 1 || exit 1
USER decnet USER logrelay
ENTRYPOINT ["/entrypoint.sh"] ENTRYPOINT ["/entrypoint.sh"]

Some files were not shown because too many files have changed in this diff Show More