Commit Graph

1463 Commits

Author SHA1 Message Date
36353026a6 release: bump to v1.0.0; add PyPI metadata; stop globbing decnet_web into packages
Some checks failed
CI / Test (Standard) (3.11) (push) Has been skipped
CI / Test (Live) (3.11) (push) Has been skipped
CI / Merge testing → main (push) Has been skipped
CI / Lint (ruff) (push) Successful in 15s
CI / Dependency audit (pip-audit) (push) Failing after 22s
CI / SAST (bandit) (push) Successful in 30s
CI / Merge dev → testing (push) Has been skipped
v1.0.0
2026-06-16 18:59:48 -04:00
a5cb20051f test(mazenet): add useMazeApi tests 2026-06-16 18:56:13 -04:00
a9c3f42ef9 feat(mazenet): topology editor updates; refine mutator ops materialisation 2026-06-16 18:55:20 -04:00
c9e4bf4022 feat(clustering): link identities by keystroke-rhythm proximity
Campaign clusterer gains a keystroke edge: when two identities'
kd_digraph_simhash centroids are within KD_HAMMING_MAX bits, a graded
weight (1.0 at identical, fading to 0 at the cutoff) feeds the campaign
graph. Supporting tier (0.6) — a typing match plus temporal overlap
reaches threshold, but typing alone never merges (FP guard against
coarse, noisy terminal timing).

Projects the column through IdentityFeatures + from_identity_row.
2026-06-16 17:09:42 -04:00
869d1eabb7 feat(clustering): roll session digraph SimHashes into identity centroid
The identity clusterer folds an identity's per-session
motor.digraph_simhash observations into one 8-byte bitwise-majority
centroid (denoises per-session jitter) and writes it to
AttackerIdentity.kd_digraph_simhash via update_identity_fingerprints —
the orphaned column is now populated. list_identities_for_clustering
projects it so the campaign clusterer can read it.

Extends the repo abstract + DummyRepo stub/coverage.
2026-06-16 17:05:34 -04:00
66c73ce59d feat(profiler): extract motor.digraph_simhash keystroke biometric
Per-session 64-bit SimHash of inter-keystroke digraph flight times:
walk single-char input events, accumulate flight time per (c1,c2),
bucket the median, Charikar-SimHash the bucketed pairs. Locality-
sensitive so the same typist is Hamming-close across sessions; pastes
and think-pauses break the chain; silent below the sample-size floor.

New shared decnet/util/simhash.py (simhash64/hamming64/bytes helpers).
Registered as a conditional Tier-A primitive (count 37->38); requires
behave-shell>=0.1.2.
2026-06-16 16:59:57 -04:00
372375194c refactor(db): run Alembic at boot, retire ad-hoc _migrate_* helpers
initialize() now delegates to _apply_schema(): real boots run
'alembic upgrade head' (schema owned by the migration history); tests
(DECNET_TESTING=1) keep create_all, which is faster and needs no upgrade
path. MySQL wraps the upgrade in the existing GET_LOCK advisory lock so
concurrent uvicorn workers don't race on DDL.

Deletes the three _migrate_* crimes (attackers-table legacy drop +
GeoIP backfill, TEXT->MEDIUMTEXT widening) — all now handled by the
baseline migration and the _BIG_TEXT model variants. Drops the test
file that only exercised the deleted helpers; adds tests pinning the
alembic-vs-create_all gate and guarding that every model table is in
the migration head.
2026-06-16 16:31:10 -04:00
ef4d67cbef build(db): add Alembic scaffolding + baseline migration
Introduce Alembic at v1. Migrations live inside the package
(decnet/web/db/migrations) so they ship with installs; alembic.ini at the
repo root drives the CLI. env.py is async and dual-backend, selecting the
engine from DECNET_DB_TYPE (mirroring db/factory.py) and reusing the app's
own connection when run programmatically.

The baseline captures all 39 tables. _BIG_TEXT round-trips as
Text().with_variant(MEDIUMTEXT, 'mysql'), so both backends get the right
column type from the migration. kd_digraph_simhash gains a sqlite BLOB
variant: BINARY(8) reflects as NUMERIC on SQLite and would otherwise trip
'alembic check' forever.
2026-06-16 16:30:29 -04:00
4f141c1a54 feat(web): stage live MazeNET edits behind an UPDATE button
Live topology edits fired one mutation per canvas action. That coupled
each edit to an immediate enqueue+apply, which (post-serialization)
raced the SSE refetch and duplicated optimistic placeholders, and gave
the user no chance to assemble a coherent changeset (add a net AND
bridge it) before any of it landed.

Live edits now STAGE: each editor primitive records its op and returns
immediately; the optimistic placeholders callers already draw are the
staged preview. The action button reads UPDATE (n) when live (DEPLOY
when pending) and flushes the batch through the slice-1 submit queue —
sequential, version-cursored, each awaited to a terminal state, stopping
loudly on the first failure with the unapplied remainder kept for retry.
REFRESH becomes DISCARD (n) to drop the batch. SSE refetch is paused
during a commit so per-mutation applied events don't wipe still-staged
placeholders mid-batch; one refetch reconciles at the end.

Also fix _dropArchetype, which bailed without an optimistic node on the
staged path, leaving a decky added to an uncommitted LAN invisible until
UPDATE.
2026-06-16 12:59:57 -04:00
f18bfee746 fix(web): serialize live topology mutations + surface failures loudly
Live MazeNET edits fired their mutations fire-and-forget: each canvas
action enqueued immediately and never awaited the result. Two failures
followed from that:

- expected_version is bumped at ENQUEUE (not at apply), so two ops fired
  back-to-back raced — the second carried a stale version and 409'd.
  Edits only worked when hand-paced (an SSE refetch landed between them).
- A failed mutation degrades the topology, but the only signal was a 4s
  toast, so the user saw DEGRADED with no cause.

useTopologyEditor now routes every live op through a serialized submit
queue: one enqueue in flight at a time (submission order preserved), an
optimistic expected_version cursor advanced per enqueue so back-to-back
ops (e.g. reparent's detach+attach) don't need a refetch between them,
and each mutation awaited to a terminal state. A 'failed' row throws
MutationFailedError, which the page pins as a persistent UPDATE FAILED
banner instead of a vanishing toast.

Slice 1 of the live-edit rework; stage+UPDATE-button batching and louder
backend materialisation reporting to follow.
2026-06-16 12:46:09 -04:00
5505de782f feat(web): wire local-decky teardown to DELETE /deckies/{name}
The Fleet UI only showed TEARDOWN for swarm-pinned deckies (POST
/swarm/hosts/{uuid}/teardown). Local deckies had no delete control though
the API now exposes DELETE /deckies/{name}.

teardown() branches on swarm vs local; the card's two-step arm/CONFIRM
button renders for any admin, keyed td:${host_uuid ?? 'local'}:${name}.
2026-06-16 12:15:48 -04:00
0c10869e26 feat(web): DELETE /deckies/{name} single-decky teardown endpoint
The Fleet module had no delete — neither UI nor API — though the engine
capability existed (engine.teardown(decky_id=...), exposed only via
`decnet teardown --id`). Wire it to HTTP.

DELETE /deckies/{name} (admin-gated, 204). Synchronous: a single decky's
compose stop/rm is quick, so it's awaited off-thread rather than the
202+lifecycle path deploy/mutate use for slow builds. The single-decky
teardown never touches the host macvlan interface, so it needs no extra
CAP_NET_ADMIN.

State consistency: engine.teardown removes the containers and the
fleet_deckies row but leaves the decky in decnet-state.json. Left as is, the
reconciler would see "present in JSON, absent from DB" and re-INSERT the row,
resurrecting the decky. So the handler prunes it from both decnet-state.json
and the DB deployment key after teardown; deleting the last decky clears
state entirely (DecnetConfig.deckies has min_length=1).

Route ordering: the dynamic DELETE /deckies/{decky_name} is registered AFTER
the fixed /deckies/* routes (Starlette matches in registration order), so it
no longer shadows DELETE /deckies/files (file-drop).

Tests cover 401/403/404/422, single-delete pruning, and last-decky clear.
2026-06-16 12:07:10 -04:00
8db593a544 test(api): repair pre-existing rotted tests (SSE ticket flow, password policy)
These had been red since the changes they cover landed — invisible because
the pre-commit gate runs mypy/ruff/bandit/pip-audit but NOT pytest, so failing
tests don't block commits and quietly accumulate.

- SSE stream/events auth migrated from ?token=<jwt> to a single-use ?ticket=
  (commit efb4e49d). Three tests still passed a raw JWT as ?token= and got
  401. Updated to mint a ticket via POST /auth/sse-ticket and pass ?ticket=
  (attacker events, topology events, /stream).
- The user-creation password policy is min_length=12; the RBAC admin-access
  test still used a 10-char password and was rejected. Bumped to a valid one.
2026-06-16 12:06:56 -04:00
9eb2803d04 chore(deps): bump cryptography/python-multipart/starlette for CVEs
pip-audit flagged fixable advisories in the web stack:
- cryptography  -> >=48.0.1  (GHSA-537c-gmf6-5ccf)
- python-multipart -> >=0.0.31 (CVE-2026-53538/53539/53540)
- starlette (transitive via fastapi) -> add direct floor >=1.3.1
  (CVE-2026-48817/48818/54282/54283)

Venv synced to cryptography 49.0.0, python-multipart 0.0.32, starlette
1.3.1; full tests/api/ suite green against the bump. Also drops the stray
browser-use[core] dev dep (the browser-use skill uses a global CLI; the
package is imported nowhere in DECNET).
2026-06-16 12:06:20 -04:00
207494f41e fix(web): duplicated deploy log lines + cancelled auto-close in DeployWizard
Two defects exposed after the deploy-success loop fix (verified live):

1. Duplicated / skipped transcript lines. The placeholder-log interval did
   `setLog(prev => [...prev, msgs[i]])` then `i++`. React 18 auto-batches
   setInterval updaters, so the updater ran after i had advanced and read the
   wrong index — skipping some lines ([NET], [SENSE]) and duplicating others
   ([TLS]). Fixed by capturing `const line = msgs[i]` before scheduling the
   update. A placeholderStartedRef also gates the effect to one run per deploy
   (reset in startDeploy) as defense-in-depth against re-render churn.

2. Wizard never closed on success. The completedRef guard combined with the
   auto-close effect's cleanup was self-defeating: a re-render inside the
   700ms window (e.g. the [OK] terminal-log append) ran the cleanup, clearing
   the pending close, and the guard then blocked rescheduling — so onComplete
   never fired. The timer now lives in a ref cleared only on unmount, so a
   scheduled close always fires exactly once regardless of re-renders.

Adds a regression test that a re-render during the close countdown does not
cancel the close. Verified end-to-end against the live instance: all 8 log
lines render once in order, the wizard auto-closes, and /deckies does not
storm.
2026-06-13 00:52:39 -04:00
2a9d1989b6 fix(web): stop runaway deploy-success loop in DeployWizard
After a successful deploy the wizard hammered GET /deckies and stacked
"DEPLOYED" toasts unbounded (~1.4/s, forever). Two compounding causes:

1. The auto-close effect's dep array includes onComplete, which the parent
   passes as an inline arrow (new reference every render). onComplete calls
   refresh(), re-rendering the parent, producing a new onComplete ref,
   re-running the effect, which reschedules onComplete — a feedback loop.
2. DeployWizard stayed permanently mounted (open only toggled a child
   overlay), so its hooks kept running with lifecycleDone===true after close.

Fix both: a completedRef guard makes the auto-close fire exactly once per
deploy regardless of effect re-runs (reset in startDeploy), and the parent
now mounts the wizard only while open so closing tears down its hooks and
clears the latent "reopen re-completes stale rows" path.

Lifecycle polling itself was never the runaway — it stops cleanly at
terminal status; the bounded ~25-poll build window is expected.

Adds a regression test asserting onComplete fires once across a simulated
re-render storm.
2026-06-13 00:19:39 -04:00
ab1151ee7f fix(fleet): read existing fleet from fleet_deckies, not State["deployment"] (BUG-2)
The web deploy collision-guard read the existing fleet from the DB
State["deployment"] key, while the UI/get_deckies() read decnet-state.json.
A fleet established via CLI/seed lands in neither path the guard consulted,
so existing_deckies was empty, the additive guard ran blind, and the
reconciler tore the running fleet down to the single submitted decky
(BUG-2: silent fleet wipe, HTTP 202, no warning).

Converge both reads on fleet_deckies — the engine-mirrored table written on
every deploy/teardown (CLI and web), which fleet/reconciler.py already
documents as the store the orchestrator, dashboard, and REST API see. Each
row's decky_config column is a full DeckyConfig dump, so it rehydrates
losslessly into the collision-guard input. The handler also commits the
intended fleet to fleet_deckies synchronously so rapid sequential deploys
read a current fleet and the dashboard observes the new shape immediately.

State["deployment"] is retained for now — the mutate handlers and the
mutator engine still coordinate through it; consolidating them is tracked
in development/ADR-001-FLEET-SOURCE-OF-TRUTH.md (open question 7).

Tests seed fleet_deckies directly (also modelling the CLI-seeded scenario)
rather than chaining real deploys through the skipped contract-test path.
2026-06-12 23:52:20 -04:00
408810b3e2 chore(security): pin TLSv1.2 floor on control-plane mTLS clients
Follow-up to V9.1.4 (which covered only the syslog forwarder/listener): set
ctx.minimum_version = TLSVersion.TLSv1_2 on the remaining DECNET-owned mTLS
client contexts — AgentClient (_build_client + _fetch_peer_fingerprint),
UpdaterClient (_build_client + _fetch_peer_fingerprint), and the updater
executor's worker context. Pure hardening, no behavior change for TLS1.2+
peers (confirmed by the existing mTLS round-trip suites).

Deliberately EXCLUDED — hardening these would be counterproductive:
- templates/https/server.py, templates/rdp/server.py: honeypot listeners,
  where looking weak/old is part of the deception.
- prober/tlscert.py: outbound TLS fingerprinting prober, which must speak
  whatever the attacker's target offers.

Added a floor-assertion test (spies httpx.AsyncClient to capture the real
verify= context).
2026-06-12 19:06:50 -04:00
efe4e49de6 fix(web): restore SSE streams via single-use ticket flow
The V3.1.1 backend change moved SSE auth off ?token=<JWT> onto a single-use
?ticket=, but the dashboard was never updated, so every live stream 401'd
('Could not validate credentials'). Add mintSseTicket() (POST /auth/sse-ticket
with the Bearer JWT, returns an opaque 60s single-use ticket) and refactor all
stream consumers to mint a fresh ticket at the top of each connect() — initial
and every reconnect — then open EventSource with ?ticket=. A reused single-use
ticket would 401-loop, so re-mint-per-connect is required.

Covers Dashboard /stream, LiveLogs, and the attacker/identity/campaign/
orchestrator/topology hooks. connect() is now async with an unmount guard
(cancelled flag checked after the await, before opening the stream); on a mint
401 the connect is skipped and the axios logout interceptor takes over.
2026-06-12 19:00:15 -04:00
593492411c feat(web): live password-strength checklist on change-password
The change-password form let the browser submit short passwords the API
then rejected with an opaque 'Schema structural violation' 400. Add a pure
validateNewPassword() util (>=12 chars, <=72 bytes, >=3 of 4 character
classes — constants tweakable) and a live ✓/✗ checklist above the submit
button so the user sees exactly what's missing. Submit is gated on
validity + confirm-match, so the form can no longer reach that 400.

- Fix minLength 8->12 on the Login change-password inputs and the UsersTab
  admin-reset guard (both lagged the API's min_length=12).
- Light-mode: render the checklist box fully white with black text (the
  neon-on-dark styling read as muddy grey); ✓/✗ icons keep a green/red cue.
- Advisory UX only — the API min_length=12 remains the enforcement boundary;
  character-class complexity is not server-enforced.
2026-06-12 18:59:46 -04:00
721122a7ef chore(types): enable warn_return_any and cast all no-any-return sites
Turn on mypy warn_return_any (pyproject) and resolve the 84 resulting
[no-any-return] errors across 43 files with typing.cast() at the return
sites — runtime no-ops that make the declared return type explicit where a
dependency (SQLAlchemy scalar/first/one, httpx .json(), subprocess, docker
SDK) hands back Any. No behavior change: no DTO/table field types altered, no
validation/coercion calls added, every cast reflects the true runtime type.

Locks in return-type strictness so the class of bug where a function silently
widens to Any can't regress. mypy decnet/ clean; adversarially verified
behavior-preserving (84 casts 1:1 with prior returns).

Bump tornado 6.5.5 -> 6.5.7 (CVE-2026-49854, transitive via snakeviz).
2026-06-12 18:21:22 -04:00
337520c7ad fix(security): close INFO ASVS findings — secret echo, TLS floor, mandatory tarball SHA, CORS/Content-Type guards, BUG-17
- V7.1.3: env known-insecure-default error no longer echoes the rejected secret value.
- V9.1.4: syslog-over-TLS forwarder + listener pin minimum_version=TLSv1_2.
- V12.1.2: updater tarball SHA-256 verification is now mandatory and fail-closed —
  /update and /update-self reject a missing digest (400), the executor rejects
  missing/mismatched digests before extract/apply. Every push path supplies it.
- V13.1.4: reject a wildcard '*' in DECNET_CORS_ORIGINS at startup.
- V13.1.5: enforce application/json on JSON write endpoints (415 otherwise),
  exempting multipart upload routes.
- BUG-17: SSE error log records the user uuid, not the resume cursor.

Also completes V2.1.7 consistently: the attacker-injectable PYTEST* env bypass is
replaced with explicit DECNET_TESTING=1 in the three remaining sites
(env.validate_public_binding, config logging, mysql url builder).

Tests added for every fix; unanimous adversarial review (no update-outage risk —
all push paths verified to send the digest).
2026-06-10 13:50:06 -04:00
245975a6dd fix(security): close LOW ASVS findings — env bypass, SSE/deployment authz, CN fail-close, password byte-limit, exception leaks, BUG-12..16
Auth/session (V2.1.7, V4.1.5, V4.1.6, V2.1.4/V2.1.5):
- env secret validation no longer bypassed by attacker-injectable PYTEST* env;
  gated on explicit DECNET_TESTING=1 (set only in conftest).
- must_change_password now enforced on the SSE header-JWT path, not just ticket mint.
- GET /system/deployment-mode requires viewer auth (was leaking role + topology size).
- CreateUser/ResetUser passwords min_length=12; passwords >72 bytes rejected
  explicitly instead of bcrypt silently truncating.

Swarm ingestion (V9.1.3, BUG-16):
- Log listener hard-rejects peers with unparseable/empty cert CN (fail closed,
  ingests nothing) instead of tagging 'unknown'.
- Shutdown handlers no longer swallow real errors (narrowed to CancelledError).

Info leakage (V7.1.2, V14.1.2):
- Exception text sanitized on swarm-update, health, tarpit, realism, file-drop,
  blank-topology endpoints (raw tc/docker stderr, DB/Docker errors logged
  server-side, generic detail returned). pyproject license corrected to AGPL-3.0.

Correctness (BUG-12..16):
- BUG-12 atomic credential upsert (UNIQUE constraint + IntegrityError retry,
  consistent principal_key canonicalization).
- BUG-13 rule-tail watermark uses >= with seen-id dedup (no same-second drop).
- BUG-14 worker wake cleared before wait (no lost wake during tick).
- BUG-15 intel gather tolerates an unexpected provider raise.
- BUG-16 see above.

Already-closed (verified, no change): V2.1.6, V5.1.3, V9.1.2. Accept-risk +
documented: V2.1.8 cache window, V3.1.3 idle timeout. Tests added for every fix;
unanimous adversarial review after two refute-fix rounds.
2026-06-10 13:27:14 -04:00
d80e6aa6d1 fix(security): close MEDIUM ASVS findings — JWT pinning, SSE tickets, SSRF, mTLS pin, rate limits + correctness bugs
Auth (V2.1.1/V3.1.2, V2.1.3, V3.1.1):
- Pin JWT iss/aud/typ at mint and require+verify them at decode; revocation
  (jti denylist + tokens_valid_from) still enforced.
- Change-password now requires min_length=12.
- SSE auth moves off JWT-in-URL to a single-use 60s opaque ticket
  (POST /auth/sse-ticket); raw JWT in query no longer authenticates a stream.
  Removed dead fail-open get_stream_user helper.

Egress (V5.1.1, V9.1.1/V14.1.3):
- Webhook delivery + CRUD reject SSRF destinations (private/loopback/link-local/
  metadata, IPv4-mapped, multi-A-record) via resolved-IP validation, pin to the
  vetted IP, and never auto-follow redirects. Opt-out via DECNET_WEBHOOK_ALLOW_PRIVATE.
- UpdaterClient pins the worker leaf cert SHA-256 against the stored per-host
  fingerprint (fail closed on missing/mismatch); DECNET_VERIFY_HOSTNAME now
  defaults True.

Hardening (V13.1.3, V4.1.4, V13.1.2):
- Rate-limit change-password (5/min), enroll-bundle (10/min), webhook-create
  (20/min), host-delete (20/min) via the existing slowapi limiter.
- Correct false 'global auth middleware' comment; document enroll-bundle proxy
  trust.

Correctness (BUG-7..11):
- BUG-7 unbound bus in finally; BUG-8 apply_ceiling clamps to min(base,ceiling);
  BUG-9 commit before emit; BUG-10 multi-actor rearm for sub-threshold identities;
  BUG-11 normalize naive timestamps to UTC.

Already-closed (no change): V14.1.1, V2.1.2/V3.1.3, V5.1.2. Tests added for
every fix; unanimous adversarial review.
2026-06-10 12:32:15 -04:00
6a8af315fb fix(core): close HIGH ASVS findings V7.1.1 and correctness bugs BUG-1..6
- V7.1.1: /swarm/check no longer returns raw exception text; logs detail
  server-side, returns generic 'probe failed'.
- BUG-1: register EditAction -> SSHDriver so edit ticks no longer crash.
- BUG-2: topology reconcile matches generator-named deckies by
  expected-name membership instead of a hyphen heuristic.
- BUG-3: intel provider lookups acquire the per-provider semaphore so
  declared concurrency bounds are enforced.
- BUG-4: RuleIndex.install evicts a rule from kinds it no longer applies to.
- BUG-5: UnixSocketBus.connect() is lock-guarded with a double-check so
  concurrent first-connects open exactly one socket and reader task.
- BUG-6/V5.1.3: multi-token JSON-field search binds each token to a
  distinct parameter instead of collapsing to the last value.

Regression tests added for every fix, verified red-before/green-after.
V4.1.1c/V12.1.1 (updater master-CN gate) and V12.5.1 (tarball include-list)
confirmed already fixed in prior commits and left untouched.
2026-06-09 23:12:49 -04:00
8d18c59201 fix(swarm): require admin JWT on all swarm operator endpoints
Gate all 8 swarm-controller operator routes (enroll, list/get/decommission
hosts, deploy, teardown, check, list deckies) with the centralized
require_admin RBAC dependency alongside require_operator_cert; mTLS becomes
defense-in-depth instead of the only gate. /heartbeat stays cert-fingerprint
pinned (worker-facing) and /swarm/health stays open (liveness only).

CLI swarm commands now send Authorization: Bearer $DECNET_API_TOKEN with a
401/403 hint covering the must_change_password bootstrap flow.

Bump pyjwt to 2.13.0 and pip to 26.1.2 (pip-audit PYSEC-2026-175/177/178/179,
PYSEC-2026-196); authz suite re-verified on the new pyjwt.

Closes ASVS_L2_AUDIT.md V4.1.1a and V4.1.1b (CRITICAL).
2026-06-09 17:08:10 -04:00
ae16c4437b feat(auth): make access-token TTL configurable, default 4h
Replace the hardcoded 1440-minute (24h) JWT lifetime with
DECNET_JWT_EXP_MINUTES (validated positive int, default 240 = 4h).
Shrinks the passive window of a stolen token; active revocation is
unchanged (immediate->=<10s).
2026-05-30 23:05:05 -04:00
9fc489258b fix(auth): bulk-revoke sessions on password and role change
A stolen JWT used to survive a password reset for its full 24h. Now every
session-invalidating change moves the user's tokens_valid_from cutoff to
'now', so all of that user's prior tokens 401 on next use:

- self change-password, admin reset-password, role change all bump the
  cutoff (delete needs no bump: the row is gone, so the user lookup 401s).
- Cutoff is compared against the token's iat floored to whole seconds, so a
  re-login in the same second as the change isn't caught by its own
  revocation (the cost is a <=1s grey zone on same-second-old tokens).
- Per-user: changing one user never revokes another.
2026-05-30 18:27:53 -04:00
c82897193e feat(auth): logout endpoint revokes the presented token
POST /auth/logout adds the caller's jti to the denylist and drops the
local negative-cache entry, so the token 401s on its very next use.
Single-session semantics: only this token dies, other sessions for the
same user keep working. Reachable for must_change_password users (it
runs the revocation checks but skips the must_change gate via
get_token_claims) so a session can always be ended; an already-revoked
token is rejected.
2026-05-30 18:21:16 -04:00
698ecaa322 feat(auth): jti claim and token-revocation store
Stateless JWTs had no revocation path: a stolen token stayed valid for
its full 24h even after the victim changed their password, and there was
no logout. This lays the foundation for revoking them.

- User.tokens_valid_from: per-user bulk-revocation cutoff (compared against
  the token's iat). RevokedToken(jti PK, exp): single-token denylist, pruned
  opportunistically on insert so it never outgrows live-but-revoked tokens.
- login() now mints a jti; create_access_token already stamps iat/exp.
- repo.revoke_token / is_token_revoked / set_tokens_valid_from (abstract +
  shared sqlmodel impl + DummyRepo coverage stubs).
- Centralized validate path in dependencies.py: every auth dependency now
  resolves the user and fails closed on (1) missing jti (legacy/pre-deploy
  token -> one forced re-login), (2) iat before the cutoff, (3) a denylisted
  jti. Denylist lookups ride a 10s membership cache mirroring the user cache.
- Contract/fuzz harness seeds its fixed-uuid principal under
  DECNET_CONTRACT_TEST so its minted token resolves to a live admin user.
2026-05-30 18:18:41 -04:00
fdb6507c6f fix(env): fail closed on weak DECNET_ADMIN_PASSWORD
DECNET_ADMIN_PASSWORD defaulted to the literal "admin" with no guard, so
a master that never set it seeded an admin/admin account. Resolve it
lazily via __getattr__ -> _require_env (the same pattern as
DECNET_JWT_SECRET): unset or a known-bad default (admin/secret/...) is
rejected, and <12 chars is rejected outside DECNET_DEVELOPER. Only the
master web/api processes that import the DB layer resolve it; workers
never do, and the pytest short-circuit keeps the dev loop unaffected.
The module attribute stays addressable for the admin-seed monkeypatch.
2026-05-30 17:31:00 -04:00
28327a9b4e fix(swarm): ship update tarball from an explicit include-list, never secrets
tar_working_tree walked the whole working tree minus a blocklist that
omitted .env.local, *.key, *.pem, *.crt — so the JWT secret, Fernet key,
admin password, DB creds and TLS private keys fanned out to every worker
on each update push.

Invert to an allowlist (DEFAULT_INCLUDES = pyproject.toml + LICENSE +
README.md + decnet/), the exact surface 'pip install .' needs; decnet/
carries its own package-data. A defensive _HYGIENE_PATTERNS layer drops
secret-/churn-shaped files even if nested under decnet/. extra_excludes
can still narrow but can no longer widen past the allowlist.

Verified against the live repo: the bundle carries the package + metadata
and zero secret/db/log/pyc files, and pip-installs clean from the
extracted tree.
2026-05-30 17:26:23 -04:00
a4193d7022 fix(updater): enforce master CN gate and mandatory tarball checksum
The worker-side updater extracted + pip-installed + re-exec'd any tarball
from any caller holding a CA-signed cert; the documented updater@* CN
gating was never implemented. Now:

- require_master_cert gates /update, /update-self, /rollback, /releases:
  the client cert CN must be decnet-master (the identity UpdaterClient
  presents). A worker/agent cert can no longer push code to a peer.
- sha256 is mandatory on /update and /update-self (400 otherwise), so the
  integrity check always runs before extract/install. UpdaterClient
  already sends it; this just hardens the contract.

The transport peer-identity primitives move to decnet/web/_mtls.py (a
light namespace module) so the minimal updater reuses them without
importing the API router tree; router/swarm/_mtls.py re-exports them and
keeps the operator gate. Closes the updater-RCE critical.
2026-05-30 17:22:12 -04:00
30750d294d fix(swarm): mTLS client-cert authz on the swarm control plane
The swarm controller (port 8770) exposed 9 routes with zero app-layer
auth, and swarmctl --tls defaulted off — anyone able to reach the port
could enroll workers (minting CA-signed certs + private keys), deploy,
or tear down the fleet. Two fail-closed layers:

- require_operator_cert gates every operator route (enroll/deploy/
  teardown/hosts/check/deckies). When mTLS is on, the peer cert's CN
  must be an operator identity (decnet-master/swarmctl); worker and
  updater@* certs are rejected. Plaintext loopback (single-host master)
  is accepted as the local operator — the docker.sock boundary.
- swarmctl refuses to bind a routable interface without --tls, so a
  network-exposed plaintext control plane can never start.

/heartbeat keeps its worker fingerprint pinning. Closes the two ASVS
criticals (control-plane no-auth, unauthenticated cert minting).
2026-05-30 17:16:12 -04:00
e7a686206c refactor(swarm): shared mTLS peer-identity helper
Extract peer-cert extraction from the heartbeat endpoint into
decnet/web/router/swarm/_mtls.py, adding CN parsing alongside the
SHA-256 fingerprint and a require_operator_cert dependency (CN in
{decnet-master, swarmctl}). api_heartbeat delegates to it; behaviour
unchanged. Prerequisite for control-plane and updater authz.
2026-05-30 17:03:13 -04:00
431c86bbe8 docs: fix CLI references and start commands in README 2026-05-27 13:03:45 -04:00
f47dd4b520 docs: update README to reflect current codebase state
Some checks failed
CI / Test (Standard) (3.11) (push) Has been skipped
CI / Test (Live) (3.11) (push) Has been skipped
CI / Merge testing → main (push) Has been skipped
CI / Lint (ruff) (push) Successful in 20s
CI / SAST (bandit) (push) Successful in 32s
CI / Dependency audit (pip-audit) (push) Failing after 1m25s
CI / Merge dev → testing (push) Has been skipped
Rewrites the architecture section for the full current module tree and adds
new sections for the REST API, swarm/agent mode, service bus, attacker
intelligence stack (profiler, clustering, correlation, GeoIP/ASN),
MazeNET topology, canary tokens, and TTP tagging/export. Updates the CLI
reference table, test count (478 → 5050), and Python version constraints.
2026-05-26 00:57:40 -04:00
f2b3393669 chore: relicense to AGPL-3.0-or-later and add SPDX headers
Replaces LICENSE (GPLv3 -> AGPLv3) and prepends
`SPDX-License-Identifier: AGPL-3.0-or-later` to every source file
across decnet/, decnet_web/, tests/, scripts/, and tools/.

Rationale: closes the GPLv3 ASP loophole so any party operating a
modified DECNET as a network service must offer their modified
source. Personal copyright (Samuel Paschuan) + inbound=outbound
contributions make a future unilateral relicense infeasible.

- LICENSE: full AGPL-3.0 text (gnu.org/licenses/agpl-3.0.txt)
- COPYRIGHT: project copyright notice
- tools/add_spdx_headers.py: idempotent header injector
  (shebang- and PEP 263-aware)

Touches 1565 source files (.py, .ts, .tsx, .js, .jsx, .css, .sh).
No behavior change; comments only.
2026-05-22 21:04:16 -04:00
ee10b55cfe fix(engine): per-scope docker compose project names
Every compose invocation used -p decnet so fleet + every topology
lived in one docker compose project. --remove-orphans, run during
fleet pre-up cleanup and on every topology teardown / rollback, then
swept every container in the project not listed in the current compose
file — wiping sibling topologies and the flat fleet along with the
intended target.

Parameterize project on _compose / _compose_with_retry / _compose_ps
(default FLEET_COMPOSE_PROJECT="decnet"). Add _topology_compose_project
that returns decnet-topo-<id8>, and pass it through every topology
compose call site (master deploy_topology + rollback + post-deploy ps,
master teardown_topology, agent apply, agent teardown, all four live
service mutations on topology deckies). Fleet calls keep the default
and are unaffected.

Migration: live containers from before this fix remain in the shared
"decnet" project and need a one-time manual cleanup before they're
reachable to the new topology code paths.
2026-05-22 18:29:33 -04:00
1b90048715 fix(api): /deckies/deploy becomes additive by default
The wizard POSTs only the new decky on each submit. The handler used to
treat every INI as the complete desired fleet (config.deckies = INI) so
the reconciler tore down prior deckies as orphans — deploying a second
Windows workstation silently wiped the first.

Add replace_fleet to DeployIniRequest (default false). Default path
merges new deckies into existing config and rejects name/IP collisions
with 409. replace_fleet=true preserves set-desired-state semantics for
CLI / declarative callers. Lifecycle rows are created only for the
deckies submitted in the current call, so /deckies/lifecycle?ids=...
reflects exactly what this submit deployed.

build_deckies_from_ini gains reserved_ips so additive auto-allocation
skips IPs already held by the existing fleet.
2026-05-22 18:14:50 -04:00
5b13a01ab6 feat(web): deploy wizard polls async lifecycle instead of holding HTTP
- New useLifecyclePolling(ids, intervalMs) hook: polls
  GET /deckies/lifecycle?ids=... every 2s until every row is terminal,
  surfaces transient HTTP failures without giving up.
- DeployWizard: drops the 180s axios timeout and the fake-log-driven
  deployOk flag. After POST 202, sets lifecycle_ids -> the hook drives
  the per-decky pill grid (PENDING / RUNNING / SUCCEEDED / FAILED).
  Real terminal lines stream into the log as rows resolve. Auto-close
  on all-success after 700ms.
- DeckyFleet.css: .lifecycle-grid + .lifecycle-pill in the existing
  fleet vocabulary; running pill pulses, failed pill borders alert.
- Existing 4 wizard render tests still pass; 4 new hook tests cover
  empty ids / single-success / polling-until-terminal / HTTP error.
2026-05-22 17:50:26 -04:00
eacac9aa60 feat(api): GET /deckies/lifecycle + master startup sweep
GET /deckies/lifecycle?ids=<uuid>&ids=<uuid> returns the matching
DeckyLifecycle rows so the wizard can poll instead of holding an HTTP
request open across compose work. require_viewer gating -- read-only.

Startup sweep: on master boot, any pending/running row with
started_at older than 1h flips to failed with
error='master restarted during operation'. Pre-v1 substitute for a
durable task queue: if the master crashes mid-deploy, the wizard sees
FAILED on refresh and the operator retries. Idempotent + cheap; runs
unconditionally including in contract-test mode.
2026-05-22 16:44:17 -04:00
4743c8f733 feat(api): /deckies/deploy and /mutate become 202 fire-and-forget
This is the unblock for the wizard hang. Both endpoints used to run
docker compose synchronously inside the HTTP handler -- on master
(unihost) or via asyncio.gather of worker /deploy POSTs at 600s
timeout each (swarm) -- blocking every other API request.

New flow:
  1. Commit the new config shape to repo state (fast).
  2. Create one DeckyLifecycle row per decky (status=pending).
  3. Spawn asyncio.create_task(run_deploy / run_mutate) -- the
     lifecycle runner drives rows through running -> succeeded|failed
     and emits decky.<name>.lifecycle on the bus.
  4. Return 202 with {lifecycle_ids: [...]}. Wizard polls
     GET /deckies/lifecycle?ids=... (next commit).

mutator/engine.py gains pick_new_services() -- shared between the
async API path and the watch-loop's synchronous mutate_decky().

DeployResponse grows lifecycle_ids[]. The old dispatch_decnet_config
helper still exists for the CLI swarm-deploy command path; it just
isn't called from the API handler anymore.

Test changes: 200 -> 202, drop dispatch_decnet_config mocks (handler
no longer calls it), assert lifecycle_ids in response + committed
state matches expectations.
2026-05-22 16:40:55 -04:00
e5e2bec3aa feat(swarm): heartbeat handler applies lifecycle deltas
HeartbeatRequest grows an optional lifecycle field carrying per-decky
completion records from the worker:
  [{decky_name, operation, status, error?, completed_at?}]

For each delta, the master finds the most-recently-started open
DeckyLifecycle row for (decky_name, operation, host_uuid) and flips
it to terminal with the worker's error text + timestamp. Stale
duplicates (row already sealed or never existed) are logged and
dropped -- not errors.

Each successful pivot also emits decky.<name>.lifecycle on the bus
so the dashboard sees the transition without waiting for its next
poll tick.

This is the master-side completion channel for the worker's 202
fire-and-forget /deploy and /mutate.
2026-05-22 16:33:48 -04:00
d1ca96b2f4 feat(agent): /deploy and /mutate become 202 fire-and-forget
The wizard API used to hang because /deckies/deploy ran docker compose
build && up -d synchronously, holding the request thread for minutes.
The worker side of that pipeline now returns 202 Accepted immediately
and runs the deploy in an asyncio.create_task.

On task completion (success or failure) the worker pushes a one-off
heartbeat carrying a lifecycle delta per decky:
  {decky_name, operation, status: succeeded|failed, error?, completed_at}

Master pivots these onto open DeckyLifecycle rows in the heartbeat
handler (next commit).  The scheduled 30s heartbeat tick is the
fallback if the immediate push drops.

- decnet/agent/app.py: /deploy and /mutate return 202; dry_run mutate
  still validates synchronously and returns 200.
- decnet/agent/executor.py: deploy_async + mutate_async wrap the work
  and push the completion delta.
- decnet/agent/heartbeat.py: push_lifecycle_delta() helper builds a
  one-off body and POSTs with the same mTLS context as the loop.
- decnet/swarm/client.py: revert deploy/mutate to control timeout
  (master no longer holds the HTTP request open for compose work).

Worker state.json gains no lifecycle field -- master DeckyLifecycle is
the source of truth; the master sweep handles crashed-mid-deploy
recovery.
2026-05-22 16:31:23 -04:00
c0ad380020 feat(lifecycle): runner + strategies + bus topic
Add decnet.lifecycle package: pure orchestration layer that the
master API will invoke via asyncio.create_task to drive DeckyLifecycle
rows through pending -> running -> succeeded | failed without
holding an HTTP request open.

Strategy classes per (operation, transport):
- LocalDeployStrategy: master-resident, runs engine.deployer.deploy
  in a thread.
- SwarmDeployStrategy: shards by host_uuid, dispatches via
  AgentClient.deploy; worker drives terminal via heartbeat.
- LocalMutateStrategy: write_compose + compose up.
- SwarmMutateStrategy: AgentClient.mutate; worker drives terminal.

decnet.bus.topics gains decky_lifecycle(name) -> decky.<name>.lifecycle
plus DECKY_LIFECYCLE constant. Payload documented in the wiki
(separate commit). publish_safely keeps bus best-effort.

Nothing is wired to call this yet -- next commits convert worker
/deploy /mutate to 202, then heartbeat delta wiring, then master API.
2026-05-22 16:25:33 -04:00
05c0721a51 feat(db): add DeckyLifecycle table for async deploy/mutate tracking
One row per (decky, operation) attempt. State machine:
pending -> running -> succeeded | failed (+ error text). Rows are
append-only after terminal; retries write a new row.

Sibling of DeckyShard rather than a rework -- DeckyShard tracks
runtime container state observed via heartbeat, this tracks
operation lifecycle. New table, UUID PK.

Adds BaseRepository abstract methods (create_lifecycle,
update_lifecycle, get_lifecycle_by_ids, find_open_lifecycle,
sweep_stale_lifecycle) with SQLModelRepository mixin impl.
Backbone for the upcoming 202-Accepted async API.
2026-05-22 16:20:00 -04:00
ade8bbe30a feat(agent): real worker-side /mutate with master swarm dispatch
- Implement /mutate handler: load_state, update services + last_mutated,
  save_state, write_compose, compose up -d via asyncio.to_thread. 404
  for missing state / unknown decky_id. dry_run short-circuits before
  any side effect.
- Add AgentClient.mutate(decky_id, services, *, dry_run=False) using
  _TIMEOUT_DEPLOY (compose up can pull/build, exceeds control timeout).
- mutator/engine.py: in swarm mode with decky.host_uuid set, resolve
  worker via _resolve_swarm_host and dispatch through AgentClient.mutate
  instead of writing a compose file on master. Master-resident deckies
  (unihost mode, or swarm with host_uuid=None) keep the local path.
2026-05-22 16:14:46 -04:00
418245f9b4 chore(deps): bump starlette to 1.0.1 (PYSEC-2026-161) 2026-05-22 16:14:35 -04:00
8eccb260be feat(dns-service): expose DNS_STATE_PATH config field
Adds state_path ServiceConfigField and passes DNS_STATE_PATH into the
container environment. Operator must mount the parent directory on a
volume for persistence to survive container recreation.
2026-05-21 22:10:43 -04:00