Campaign clusterer gains a keystroke edge: when two identities'
kd_digraph_simhash centroids are within KD_HAMMING_MAX bits, a graded
weight (1.0 at identical, fading to 0 at the cutoff) feeds the campaign
graph. Supporting tier (0.6) — a typing match plus temporal overlap
reaches threshold, but typing alone never merges (FP guard against
coarse, noisy terminal timing).
Projects the column through IdentityFeatures + from_identity_row.
The identity clusterer folds an identity's per-session
motor.digraph_simhash observations into one 8-byte bitwise-majority
centroid (denoises per-session jitter) and writes it to
AttackerIdentity.kd_digraph_simhash via update_identity_fingerprints —
the orphaned column is now populated. list_identities_for_clustering
projects it so the campaign clusterer can read it.
Extends the repo abstract + DummyRepo stub/coverage.
Per-session 64-bit SimHash of inter-keystroke digraph flight times:
walk single-char input events, accumulate flight time per (c1,c2),
bucket the median, Charikar-SimHash the bucketed pairs. Locality-
sensitive so the same typist is Hamming-close across sessions; pastes
and think-pauses break the chain; silent below the sample-size floor.
New shared decnet/util/simhash.py (simhash64/hamming64/bytes helpers).
Registered as a conditional Tier-A primitive (count 37->38); requires
behave-shell>=0.1.2.
initialize() now delegates to _apply_schema(): real boots run
'alembic upgrade head' (schema owned by the migration history); tests
(DECNET_TESTING=1) keep create_all, which is faster and needs no upgrade
path. MySQL wraps the upgrade in the existing GET_LOCK advisory lock so
concurrent uvicorn workers don't race on DDL.
Deletes the three _migrate_* crimes (attackers-table legacy drop +
GeoIP backfill, TEXT->MEDIUMTEXT widening) — all now handled by the
baseline migration and the _BIG_TEXT model variants. Drops the test
file that only exercised the deleted helpers; adds tests pinning the
alembic-vs-create_all gate and guarding that every model table is in
the migration head.
Introduce Alembic at v1. Migrations live inside the package
(decnet/web/db/migrations) so they ship with installs; alembic.ini at the
repo root drives the CLI. env.py is async and dual-backend, selecting the
engine from DECNET_DB_TYPE (mirroring db/factory.py) and reusing the app's
own connection when run programmatically.
The baseline captures all 39 tables. _BIG_TEXT round-trips as
Text().with_variant(MEDIUMTEXT, 'mysql'), so both backends get the right
column type from the migration. kd_digraph_simhash gains a sqlite BLOB
variant: BINARY(8) reflects as NUMERIC on SQLite and would otherwise trip
'alembic check' forever.
Live topology edits fired one mutation per canvas action. That coupled
each edit to an immediate enqueue+apply, which (post-serialization)
raced the SSE refetch and duplicated optimistic placeholders, and gave
the user no chance to assemble a coherent changeset (add a net AND
bridge it) before any of it landed.
Live edits now STAGE: each editor primitive records its op and returns
immediately; the optimistic placeholders callers already draw are the
staged preview. The action button reads UPDATE (n) when live (DEPLOY
when pending) and flushes the batch through the slice-1 submit queue —
sequential, version-cursored, each awaited to a terminal state, stopping
loudly on the first failure with the unapplied remainder kept for retry.
REFRESH becomes DISCARD (n) to drop the batch. SSE refetch is paused
during a commit so per-mutation applied events don't wipe still-staged
placeholders mid-batch; one refetch reconciles at the end.
Also fix _dropArchetype, which bailed without an optimistic node on the
staged path, leaving a decky added to an uncommitted LAN invisible until
UPDATE.
Live MazeNET edits fired their mutations fire-and-forget: each canvas
action enqueued immediately and never awaited the result. Two failures
followed from that:
- expected_version is bumped at ENQUEUE (not at apply), so two ops fired
back-to-back raced — the second carried a stale version and 409'd.
Edits only worked when hand-paced (an SSE refetch landed between them).
- A failed mutation degrades the topology, but the only signal was a 4s
toast, so the user saw DEGRADED with no cause.
useTopologyEditor now routes every live op through a serialized submit
queue: one enqueue in flight at a time (submission order preserved), an
optimistic expected_version cursor advanced per enqueue so back-to-back
ops (e.g. reparent's detach+attach) don't need a refetch between them,
and each mutation awaited to a terminal state. A 'failed' row throws
MutationFailedError, which the page pins as a persistent UPDATE FAILED
banner instead of a vanishing toast.
Slice 1 of the live-edit rework; stage+UPDATE-button batching and louder
backend materialisation reporting to follow.
The Fleet UI only showed TEARDOWN for swarm-pinned deckies (POST
/swarm/hosts/{uuid}/teardown). Local deckies had no delete control though
the API now exposes DELETE /deckies/{name}.
teardown() branches on swarm vs local; the card's two-step arm/CONFIRM
button renders for any admin, keyed td:${host_uuid ?? 'local'}:${name}.
The Fleet module had no delete — neither UI nor API — though the engine
capability existed (engine.teardown(decky_id=...), exposed only via
`decnet teardown --id`). Wire it to HTTP.
DELETE /deckies/{name} (admin-gated, 204). Synchronous: a single decky's
compose stop/rm is quick, so it's awaited off-thread rather than the
202+lifecycle path deploy/mutate use for slow builds. The single-decky
teardown never touches the host macvlan interface, so it needs no extra
CAP_NET_ADMIN.
State consistency: engine.teardown removes the containers and the
fleet_deckies row but leaves the decky in decnet-state.json. Left as is, the
reconciler would see "present in JSON, absent from DB" and re-INSERT the row,
resurrecting the decky. So the handler prunes it from both decnet-state.json
and the DB deployment key after teardown; deleting the last decky clears
state entirely (DecnetConfig.deckies has min_length=1).
Route ordering: the dynamic DELETE /deckies/{decky_name} is registered AFTER
the fixed /deckies/* routes (Starlette matches in registration order), so it
no longer shadows DELETE /deckies/files (file-drop).
Tests cover 401/403/404/422, single-delete pruning, and last-decky clear.
These had been red since the changes they cover landed — invisible because
the pre-commit gate runs mypy/ruff/bandit/pip-audit but NOT pytest, so failing
tests don't block commits and quietly accumulate.
- SSE stream/events auth migrated from ?token=<jwt> to a single-use ?ticket=
(commit efb4e49d). Three tests still passed a raw JWT as ?token= and got
401. Updated to mint a ticket via POST /auth/sse-ticket and pass ?ticket=
(attacker events, topology events, /stream).
- The user-creation password policy is min_length=12; the RBAC admin-access
test still used a 10-char password and was rejected. Bumped to a valid one.
pip-audit flagged fixable advisories in the web stack:
- cryptography -> >=48.0.1 (GHSA-537c-gmf6-5ccf)
- python-multipart -> >=0.0.31 (CVE-2026-53538/53539/53540)
- starlette (transitive via fastapi) -> add direct floor >=1.3.1
(CVE-2026-48817/48818/54282/54283)
Venv synced to cryptography 49.0.0, python-multipart 0.0.32, starlette
1.3.1; full tests/api/ suite green against the bump. Also drops the stray
browser-use[core] dev dep (the browser-use skill uses a global CLI; the
package is imported nowhere in DECNET).
Two defects exposed after the deploy-success loop fix (verified live):
1. Duplicated / skipped transcript lines. The placeholder-log interval did
`setLog(prev => [...prev, msgs[i]])` then `i++`. React 18 auto-batches
setInterval updaters, so the updater ran after i had advanced and read the
wrong index — skipping some lines ([NET], [SENSE]) and duplicating others
([TLS]). Fixed by capturing `const line = msgs[i]` before scheduling the
update. A placeholderStartedRef also gates the effect to one run per deploy
(reset in startDeploy) as defense-in-depth against re-render churn.
2. Wizard never closed on success. The completedRef guard combined with the
auto-close effect's cleanup was self-defeating: a re-render inside the
700ms window (e.g. the [OK] terminal-log append) ran the cleanup, clearing
the pending close, and the guard then blocked rescheduling — so onComplete
never fired. The timer now lives in a ref cleared only on unmount, so a
scheduled close always fires exactly once regardless of re-renders.
Adds a regression test that a re-render during the close countdown does not
cancel the close. Verified end-to-end against the live instance: all 8 log
lines render once in order, the wizard auto-closes, and /deckies does not
storm.
After a successful deploy the wizard hammered GET /deckies and stacked
"DEPLOYED" toasts unbounded (~1.4/s, forever). Two compounding causes:
1. The auto-close effect's dep array includes onComplete, which the parent
passes as an inline arrow (new reference every render). onComplete calls
refresh(), re-rendering the parent, producing a new onComplete ref,
re-running the effect, which reschedules onComplete — a feedback loop.
2. DeployWizard stayed permanently mounted (open only toggled a child
overlay), so its hooks kept running with lifecycleDone===true after close.
Fix both: a completedRef guard makes the auto-close fire exactly once per
deploy regardless of effect re-runs (reset in startDeploy), and the parent
now mounts the wizard only while open so closing tears down its hooks and
clears the latent "reopen re-completes stale rows" path.
Lifecycle polling itself was never the runaway — it stops cleanly at
terminal status; the bounded ~25-poll build window is expected.
Adds a regression test asserting onComplete fires once across a simulated
re-render storm.
The web deploy collision-guard read the existing fleet from the DB
State["deployment"] key, while the UI/get_deckies() read decnet-state.json.
A fleet established via CLI/seed lands in neither path the guard consulted,
so existing_deckies was empty, the additive guard ran blind, and the
reconciler tore the running fleet down to the single submitted decky
(BUG-2: silent fleet wipe, HTTP 202, no warning).
Converge both reads on fleet_deckies — the engine-mirrored table written on
every deploy/teardown (CLI and web), which fleet/reconciler.py already
documents as the store the orchestrator, dashboard, and REST API see. Each
row's decky_config column is a full DeckyConfig dump, so it rehydrates
losslessly into the collision-guard input. The handler also commits the
intended fleet to fleet_deckies synchronously so rapid sequential deploys
read a current fleet and the dashboard observes the new shape immediately.
State["deployment"] is retained for now — the mutate handlers and the
mutator engine still coordinate through it; consolidating them is tracked
in development/ADR-001-FLEET-SOURCE-OF-TRUTH.md (open question 7).
Tests seed fleet_deckies directly (also modelling the CLI-seeded scenario)
rather than chaining real deploys through the skipped contract-test path.
Follow-up to V9.1.4 (which covered only the syslog forwarder/listener): set
ctx.minimum_version = TLSVersion.TLSv1_2 on the remaining DECNET-owned mTLS
client contexts — AgentClient (_build_client + _fetch_peer_fingerprint),
UpdaterClient (_build_client + _fetch_peer_fingerprint), and the updater
executor's worker context. Pure hardening, no behavior change for TLS1.2+
peers (confirmed by the existing mTLS round-trip suites).
Deliberately EXCLUDED — hardening these would be counterproductive:
- templates/https/server.py, templates/rdp/server.py: honeypot listeners,
where looking weak/old is part of the deception.
- prober/tlscert.py: outbound TLS fingerprinting prober, which must speak
whatever the attacker's target offers.
Added a floor-assertion test (spies httpx.AsyncClient to capture the real
verify= context).
The V3.1.1 backend change moved SSE auth off ?token=<JWT> onto a single-use
?ticket=, but the dashboard was never updated, so every live stream 401'd
('Could not validate credentials'). Add mintSseTicket() (POST /auth/sse-ticket
with the Bearer JWT, returns an opaque 60s single-use ticket) and refactor all
stream consumers to mint a fresh ticket at the top of each connect() — initial
and every reconnect — then open EventSource with ?ticket=. A reused single-use
ticket would 401-loop, so re-mint-per-connect is required.
Covers Dashboard /stream, LiveLogs, and the attacker/identity/campaign/
orchestrator/topology hooks. connect() is now async with an unmount guard
(cancelled flag checked after the await, before opening the stream); on a mint
401 the connect is skipped and the axios logout interceptor takes over.
The change-password form let the browser submit short passwords the API
then rejected with an opaque 'Schema structural violation' 400. Add a pure
validateNewPassword() util (>=12 chars, <=72 bytes, >=3 of 4 character
classes — constants tweakable) and a live ✓/✗ checklist above the submit
button so the user sees exactly what's missing. Submit is gated on
validity + confirm-match, so the form can no longer reach that 400.
- Fix minLength 8->12 on the Login change-password inputs and the UsersTab
admin-reset guard (both lagged the API's min_length=12).
- Light-mode: render the checklist box fully white with black text (the
neon-on-dark styling read as muddy grey); ✓/✗ icons keep a green/red cue.
- Advisory UX only — the API min_length=12 remains the enforcement boundary;
character-class complexity is not server-enforced.
Turn on mypy warn_return_any (pyproject) and resolve the 84 resulting
[no-any-return] errors across 43 files with typing.cast() at the return
sites — runtime no-ops that make the declared return type explicit where a
dependency (SQLAlchemy scalar/first/one, httpx .json(), subprocess, docker
SDK) hands back Any. No behavior change: no DTO/table field types altered, no
validation/coercion calls added, every cast reflects the true runtime type.
Locks in return-type strictness so the class of bug where a function silently
widens to Any can't regress. mypy decnet/ clean; adversarially verified
behavior-preserving (84 casts 1:1 with prior returns).
Bump tornado 6.5.5 -> 6.5.7 (CVE-2026-49854, transitive via snakeviz).
- V7.1.3: env known-insecure-default error no longer echoes the rejected secret value.
- V9.1.4: syslog-over-TLS forwarder + listener pin minimum_version=TLSv1_2.
- V12.1.2: updater tarball SHA-256 verification is now mandatory and fail-closed —
/update and /update-self reject a missing digest (400), the executor rejects
missing/mismatched digests before extract/apply. Every push path supplies it.
- V13.1.4: reject a wildcard '*' in DECNET_CORS_ORIGINS at startup.
- V13.1.5: enforce application/json on JSON write endpoints (415 otherwise),
exempting multipart upload routes.
- BUG-17: SSE error log records the user uuid, not the resume cursor.
Also completes V2.1.7 consistently: the attacker-injectable PYTEST* env bypass is
replaced with explicit DECNET_TESTING=1 in the three remaining sites
(env.validate_public_binding, config logging, mysql url builder).
Tests added for every fix; unanimous adversarial review (no update-outage risk —
all push paths verified to send the digest).
Auth (V2.1.1/V3.1.2, V2.1.3, V3.1.1):
- Pin JWT iss/aud/typ at mint and require+verify them at decode; revocation
(jti denylist + tokens_valid_from) still enforced.
- Change-password now requires min_length=12.
- SSE auth moves off JWT-in-URL to a single-use 60s opaque ticket
(POST /auth/sse-ticket); raw JWT in query no longer authenticates a stream.
Removed dead fail-open get_stream_user helper.
Egress (V5.1.1, V9.1.1/V14.1.3):
- Webhook delivery + CRUD reject SSRF destinations (private/loopback/link-local/
metadata, IPv4-mapped, multi-A-record) via resolved-IP validation, pin to the
vetted IP, and never auto-follow redirects. Opt-out via DECNET_WEBHOOK_ALLOW_PRIVATE.
- UpdaterClient pins the worker leaf cert SHA-256 against the stored per-host
fingerprint (fail closed on missing/mismatch); DECNET_VERIFY_HOSTNAME now
defaults True.
Hardening (V13.1.3, V4.1.4, V13.1.2):
- Rate-limit change-password (5/min), enroll-bundle (10/min), webhook-create
(20/min), host-delete (20/min) via the existing slowapi limiter.
- Correct false 'global auth middleware' comment; document enroll-bundle proxy
trust.
Correctness (BUG-7..11):
- BUG-7 unbound bus in finally; BUG-8 apply_ceiling clamps to min(base,ceiling);
BUG-9 commit before emit; BUG-10 multi-actor rearm for sub-threshold identities;
BUG-11 normalize naive timestamps to UTC.
Already-closed (no change): V14.1.1, V2.1.2/V3.1.3, V5.1.2. Tests added for
every fix; unanimous adversarial review.
- V7.1.1: /swarm/check no longer returns raw exception text; logs detail
server-side, returns generic 'probe failed'.
- BUG-1: register EditAction -> SSHDriver so edit ticks no longer crash.
- BUG-2: topology reconcile matches generator-named deckies by
expected-name membership instead of a hyphen heuristic.
- BUG-3: intel provider lookups acquire the per-provider semaphore so
declared concurrency bounds are enforced.
- BUG-4: RuleIndex.install evicts a rule from kinds it no longer applies to.
- BUG-5: UnixSocketBus.connect() is lock-guarded with a double-check so
concurrent first-connects open exactly one socket and reader task.
- BUG-6/V5.1.3: multi-token JSON-field search binds each token to a
distinct parameter instead of collapsing to the last value.
Regression tests added for every fix, verified red-before/green-after.
V4.1.1c/V12.1.1 (updater master-CN gate) and V12.5.1 (tarball include-list)
confirmed already fixed in prior commits and left untouched.
Gate all 8 swarm-controller operator routes (enroll, list/get/decommission
hosts, deploy, teardown, check, list deckies) with the centralized
require_admin RBAC dependency alongside require_operator_cert; mTLS becomes
defense-in-depth instead of the only gate. /heartbeat stays cert-fingerprint
pinned (worker-facing) and /swarm/health stays open (liveness only).
CLI swarm commands now send Authorization: Bearer $DECNET_API_TOKEN with a
401/403 hint covering the must_change_password bootstrap flow.
Bump pyjwt to 2.13.0 and pip to 26.1.2 (pip-audit PYSEC-2026-175/177/178/179,
PYSEC-2026-196); authz suite re-verified on the new pyjwt.
Closes ASVS_L2_AUDIT.md V4.1.1a and V4.1.1b (CRITICAL).
Replace the hardcoded 1440-minute (24h) JWT lifetime with
DECNET_JWT_EXP_MINUTES (validated positive int, default 240 = 4h).
Shrinks the passive window of a stolen token; active revocation is
unchanged (immediate->=<10s).
A stolen JWT used to survive a password reset for its full 24h. Now every
session-invalidating change moves the user's tokens_valid_from cutoff to
'now', so all of that user's prior tokens 401 on next use:
- self change-password, admin reset-password, role change all bump the
cutoff (delete needs no bump: the row is gone, so the user lookup 401s).
- Cutoff is compared against the token's iat floored to whole seconds, so a
re-login in the same second as the change isn't caught by its own
revocation (the cost is a <=1s grey zone on same-second-old tokens).
- Per-user: changing one user never revokes another.
POST /auth/logout adds the caller's jti to the denylist and drops the
local negative-cache entry, so the token 401s on its very next use.
Single-session semantics: only this token dies, other sessions for the
same user keep working. Reachable for must_change_password users (it
runs the revocation checks but skips the must_change gate via
get_token_claims) so a session can always be ended; an already-revoked
token is rejected.
Stateless JWTs had no revocation path: a stolen token stayed valid for
its full 24h even after the victim changed their password, and there was
no logout. This lays the foundation for revoking them.
- User.tokens_valid_from: per-user bulk-revocation cutoff (compared against
the token's iat). RevokedToken(jti PK, exp): single-token denylist, pruned
opportunistically on insert so it never outgrows live-but-revoked tokens.
- login() now mints a jti; create_access_token already stamps iat/exp.
- repo.revoke_token / is_token_revoked / set_tokens_valid_from (abstract +
shared sqlmodel impl + DummyRepo coverage stubs).
- Centralized validate path in dependencies.py: every auth dependency now
resolves the user and fails closed on (1) missing jti (legacy/pre-deploy
token -> one forced re-login), (2) iat before the cutoff, (3) a denylisted
jti. Denylist lookups ride a 10s membership cache mirroring the user cache.
- Contract/fuzz harness seeds its fixed-uuid principal under
DECNET_CONTRACT_TEST so its minted token resolves to a live admin user.
DECNET_ADMIN_PASSWORD defaulted to the literal "admin" with no guard, so
a master that never set it seeded an admin/admin account. Resolve it
lazily via __getattr__ -> _require_env (the same pattern as
DECNET_JWT_SECRET): unset or a known-bad default (admin/secret/...) is
rejected, and <12 chars is rejected outside DECNET_DEVELOPER. Only the
master web/api processes that import the DB layer resolve it; workers
never do, and the pytest short-circuit keeps the dev loop unaffected.
The module attribute stays addressable for the admin-seed monkeypatch.
tar_working_tree walked the whole working tree minus a blocklist that
omitted .env.local, *.key, *.pem, *.crt — so the JWT secret, Fernet key,
admin password, DB creds and TLS private keys fanned out to every worker
on each update push.
Invert to an allowlist (DEFAULT_INCLUDES = pyproject.toml + LICENSE +
README.md + decnet/), the exact surface 'pip install .' needs; decnet/
carries its own package-data. A defensive _HYGIENE_PATTERNS layer drops
secret-/churn-shaped files even if nested under decnet/. extra_excludes
can still narrow but can no longer widen past the allowlist.
Verified against the live repo: the bundle carries the package + metadata
and zero secret/db/log/pyc files, and pip-installs clean from the
extracted tree.
The worker-side updater extracted + pip-installed + re-exec'd any tarball
from any caller holding a CA-signed cert; the documented updater@* CN
gating was never implemented. Now:
- require_master_cert gates /update, /update-self, /rollback, /releases:
the client cert CN must be decnet-master (the identity UpdaterClient
presents). A worker/agent cert can no longer push code to a peer.
- sha256 is mandatory on /update and /update-self (400 otherwise), so the
integrity check always runs before extract/install. UpdaterClient
already sends it; this just hardens the contract.
The transport peer-identity primitives move to decnet/web/_mtls.py (a
light namespace module) so the minimal updater reuses them without
importing the API router tree; router/swarm/_mtls.py re-exports them and
keeps the operator gate. Closes the updater-RCE critical.
The swarm controller (port 8770) exposed 9 routes with zero app-layer
auth, and swarmctl --tls defaulted off — anyone able to reach the port
could enroll workers (minting CA-signed certs + private keys), deploy,
or tear down the fleet. Two fail-closed layers:
- require_operator_cert gates every operator route (enroll/deploy/
teardown/hosts/check/deckies). When mTLS is on, the peer cert's CN
must be an operator identity (decnet-master/swarmctl); worker and
updater@* certs are rejected. Plaintext loopback (single-host master)
is accepted as the local operator — the docker.sock boundary.
- swarmctl refuses to bind a routable interface without --tls, so a
network-exposed plaintext control plane can never start.
/heartbeat keeps its worker fingerprint pinning. Closes the two ASVS
criticals (control-plane no-auth, unauthenticated cert minting).
Extract peer-cert extraction from the heartbeat endpoint into
decnet/web/router/swarm/_mtls.py, adding CN parsing alongside the
SHA-256 fingerprint and a require_operator_cert dependency (CN in
{decnet-master, swarmctl}). api_heartbeat delegates to it; behaviour
unchanged. Prerequisite for control-plane and updater authz.
Rewrites the architecture section for the full current module tree and adds
new sections for the REST API, swarm/agent mode, service bus, attacker
intelligence stack (profiler, clustering, correlation, GeoIP/ASN),
MazeNET topology, canary tokens, and TTP tagging/export. Updates the CLI
reference table, test count (478 → 5050), and Python version constraints.
Replaces LICENSE (GPLv3 -> AGPLv3) and prepends
`SPDX-License-Identifier: AGPL-3.0-or-later` to every source file
across decnet/, decnet_web/, tests/, scripts/, and tools/.
Rationale: closes the GPLv3 ASP loophole so any party operating a
modified DECNET as a network service must offer their modified
source. Personal copyright (Samuel Paschuan) + inbound=outbound
contributions make a future unilateral relicense infeasible.
- LICENSE: full AGPL-3.0 text (gnu.org/licenses/agpl-3.0.txt)
- COPYRIGHT: project copyright notice
- tools/add_spdx_headers.py: idempotent header injector
(shebang- and PEP 263-aware)
Touches 1565 source files (.py, .ts, .tsx, .js, .jsx, .css, .sh).
No behavior change; comments only.
Every compose invocation used -p decnet so fleet + every topology
lived in one docker compose project. --remove-orphans, run during
fleet pre-up cleanup and on every topology teardown / rollback, then
swept every container in the project not listed in the current compose
file — wiping sibling topologies and the flat fleet along with the
intended target.
Parameterize project on _compose / _compose_with_retry / _compose_ps
(default FLEET_COMPOSE_PROJECT="decnet"). Add _topology_compose_project
that returns decnet-topo-<id8>, and pass it through every topology
compose call site (master deploy_topology + rollback + post-deploy ps,
master teardown_topology, agent apply, agent teardown, all four live
service mutations on topology deckies). Fleet calls keep the default
and are unaffected.
Migration: live containers from before this fix remain in the shared
"decnet" project and need a one-time manual cleanup before they're
reachable to the new topology code paths.
The wizard POSTs only the new decky on each submit. The handler used to
treat every INI as the complete desired fleet (config.deckies = INI) so
the reconciler tore down prior deckies as orphans — deploying a second
Windows workstation silently wiped the first.
Add replace_fleet to DeployIniRequest (default false). Default path
merges new deckies into existing config and rejects name/IP collisions
with 409. replace_fleet=true preserves set-desired-state semantics for
CLI / declarative callers. Lifecycle rows are created only for the
deckies submitted in the current call, so /deckies/lifecycle?ids=...
reflects exactly what this submit deployed.
build_deckies_from_ini gains reserved_ips so additive auto-allocation
skips IPs already held by the existing fleet.
- New useLifecyclePolling(ids, intervalMs) hook: polls
GET /deckies/lifecycle?ids=... every 2s until every row is terminal,
surfaces transient HTTP failures without giving up.
- DeployWizard: drops the 180s axios timeout and the fake-log-driven
deployOk flag. After POST 202, sets lifecycle_ids -> the hook drives
the per-decky pill grid (PENDING / RUNNING / SUCCEEDED / FAILED).
Real terminal lines stream into the log as rows resolve. Auto-close
on all-success after 700ms.
- DeckyFleet.css: .lifecycle-grid + .lifecycle-pill in the existing
fleet vocabulary; running pill pulses, failed pill borders alert.
- Existing 4 wizard render tests still pass; 4 new hook tests cover
empty ids / single-success / polling-until-terminal / HTTP error.
GET /deckies/lifecycle?ids=<uuid>&ids=<uuid> returns the matching
DeckyLifecycle rows so the wizard can poll instead of holding an HTTP
request open across compose work. require_viewer gating -- read-only.
Startup sweep: on master boot, any pending/running row with
started_at older than 1h flips to failed with
error='master restarted during operation'. Pre-v1 substitute for a
durable task queue: if the master crashes mid-deploy, the wizard sees
FAILED on refresh and the operator retries. Idempotent + cheap; runs
unconditionally including in contract-test mode.
This is the unblock for the wizard hang. Both endpoints used to run
docker compose synchronously inside the HTTP handler -- on master
(unihost) or via asyncio.gather of worker /deploy POSTs at 600s
timeout each (swarm) -- blocking every other API request.
New flow:
1. Commit the new config shape to repo state (fast).
2. Create one DeckyLifecycle row per decky (status=pending).
3. Spawn asyncio.create_task(run_deploy / run_mutate) -- the
lifecycle runner drives rows through running -> succeeded|failed
and emits decky.<name>.lifecycle on the bus.
4. Return 202 with {lifecycle_ids: [...]}. Wizard polls
GET /deckies/lifecycle?ids=... (next commit).
mutator/engine.py gains pick_new_services() -- shared between the
async API path and the watch-loop's synchronous mutate_decky().
DeployResponse grows lifecycle_ids[]. The old dispatch_decnet_config
helper still exists for the CLI swarm-deploy command path; it just
isn't called from the API handler anymore.
Test changes: 200 -> 202, drop dispatch_decnet_config mocks (handler
no longer calls it), assert lifecycle_ids in response + committed
state matches expectations.
HeartbeatRequest grows an optional lifecycle field carrying per-decky
completion records from the worker:
[{decky_name, operation, status, error?, completed_at?}]
For each delta, the master finds the most-recently-started open
DeckyLifecycle row for (decky_name, operation, host_uuid) and flips
it to terminal with the worker's error text + timestamp. Stale
duplicates (row already sealed or never existed) are logged and
dropped -- not errors.
Each successful pivot also emits decky.<name>.lifecycle on the bus
so the dashboard sees the transition without waiting for its next
poll tick.
This is the master-side completion channel for the worker's 202
fire-and-forget /deploy and /mutate.
The wizard API used to hang because /deckies/deploy ran docker compose
build && up -d synchronously, holding the request thread for minutes.
The worker side of that pipeline now returns 202 Accepted immediately
and runs the deploy in an asyncio.create_task.
On task completion (success or failure) the worker pushes a one-off
heartbeat carrying a lifecycle delta per decky:
{decky_name, operation, status: succeeded|failed, error?, completed_at}
Master pivots these onto open DeckyLifecycle rows in the heartbeat
handler (next commit). The scheduled 30s heartbeat tick is the
fallback if the immediate push drops.
- decnet/agent/app.py: /deploy and /mutate return 202; dry_run mutate
still validates synchronously and returns 200.
- decnet/agent/executor.py: deploy_async + mutate_async wrap the work
and push the completion delta.
- decnet/agent/heartbeat.py: push_lifecycle_delta() helper builds a
one-off body and POSTs with the same mTLS context as the loop.
- decnet/swarm/client.py: revert deploy/mutate to control timeout
(master no longer holds the HTTP request open for compose work).
Worker state.json gains no lifecycle field -- master DeckyLifecycle is
the source of truth; the master sweep handles crashed-mid-deploy
recovery.
Add decnet.lifecycle package: pure orchestration layer that the
master API will invoke via asyncio.create_task to drive DeckyLifecycle
rows through pending -> running -> succeeded | failed without
holding an HTTP request open.
Strategy classes per (operation, transport):
- LocalDeployStrategy: master-resident, runs engine.deployer.deploy
in a thread.
- SwarmDeployStrategy: shards by host_uuid, dispatches via
AgentClient.deploy; worker drives terminal via heartbeat.
- LocalMutateStrategy: write_compose + compose up.
- SwarmMutateStrategy: AgentClient.mutate; worker drives terminal.
decnet.bus.topics gains decky_lifecycle(name) -> decky.<name>.lifecycle
plus DECKY_LIFECYCLE constant. Payload documented in the wiki
(separate commit). publish_safely keeps bus best-effort.
Nothing is wired to call this yet -- next commits convert worker
/deploy /mutate to 202, then heartbeat delta wiring, then master API.
One row per (decky, operation) attempt. State machine:
pending -> running -> succeeded | failed (+ error text). Rows are
append-only after terminal; retries write a new row.
Sibling of DeckyShard rather than a rework -- DeckyShard tracks
runtime container state observed via heartbeat, this tracks
operation lifecycle. New table, UUID PK.
Adds BaseRepository abstract methods (create_lifecycle,
update_lifecycle, get_lifecycle_by_ids, find_open_lifecycle,
sweep_stale_lifecycle) with SQLModelRepository mixin impl.
Backbone for the upcoming 202-Accepted async API.
- Implement /mutate handler: load_state, update services + last_mutated,
save_state, write_compose, compose up -d via asyncio.to_thread. 404
for missing state / unknown decky_id. dry_run short-circuits before
any side effect.
- Add AgentClient.mutate(decky_id, services, *, dry_run=False) using
_TIMEOUT_DEPLOY (compose up can pull/build, exceeds control timeout).
- mutator/engine.py: in swarm mode with decky.host_uuid set, resolve
worker via _resolve_swarm_host and dispatch through AgentClient.mutate
instead of writing a compose file on master. Master-resident deckies
(unihost mode, or swarm with host_uuid=None) keep the local path.