Campaign clusterer gains a keystroke edge: when two identities'
kd_digraph_simhash centroids are within KD_HAMMING_MAX bits, a graded
weight (1.0 at identical, fading to 0 at the cutoff) feeds the campaign
graph. Supporting tier (0.6) — a typing match plus temporal overlap
reaches threshold, but typing alone never merges (FP guard against
coarse, noisy terminal timing).
Projects the column through IdentityFeatures + from_identity_row.
The identity clusterer folds an identity's per-session
motor.digraph_simhash observations into one 8-byte bitwise-majority
centroid (denoises per-session jitter) and writes it to
AttackerIdentity.kd_digraph_simhash via update_identity_fingerprints —
the orphaned column is now populated. list_identities_for_clustering
projects it so the campaign clusterer can read it.
Extends the repo abstract + DummyRepo stub/coverage.
Per-session 64-bit SimHash of inter-keystroke digraph flight times:
walk single-char input events, accumulate flight time per (c1,c2),
bucket the median, Charikar-SimHash the bucketed pairs. Locality-
sensitive so the same typist is Hamming-close across sessions; pastes
and think-pauses break the chain; silent below the sample-size floor.
New shared decnet/util/simhash.py (simhash64/hamming64/bytes helpers).
Registered as a conditional Tier-A primitive (count 37->38); requires
behave-shell>=0.1.2.
initialize() now delegates to _apply_schema(): real boots run
'alembic upgrade head' (schema owned by the migration history); tests
(DECNET_TESTING=1) keep create_all, which is faster and needs no upgrade
path. MySQL wraps the upgrade in the existing GET_LOCK advisory lock so
concurrent uvicorn workers don't race on DDL.
Deletes the three _migrate_* crimes (attackers-table legacy drop +
GeoIP backfill, TEXT->MEDIUMTEXT widening) — all now handled by the
baseline migration and the _BIG_TEXT model variants. Drops the test
file that only exercised the deleted helpers; adds tests pinning the
alembic-vs-create_all gate and guarding that every model table is in
the migration head.
Introduce Alembic at v1. Migrations live inside the package
(decnet/web/db/migrations) so they ship with installs; alembic.ini at the
repo root drives the CLI. env.py is async and dual-backend, selecting the
engine from DECNET_DB_TYPE (mirroring db/factory.py) and reusing the app's
own connection when run programmatically.
The baseline captures all 39 tables. _BIG_TEXT round-trips as
Text().with_variant(MEDIUMTEXT, 'mysql'), so both backends get the right
column type from the migration. kd_digraph_simhash gains a sqlite BLOB
variant: BINARY(8) reflects as NUMERIC on SQLite and would otherwise trip
'alembic check' forever.
Live topology edits fired one mutation per canvas action. That coupled
each edit to an immediate enqueue+apply, which (post-serialization)
raced the SSE refetch and duplicated optimistic placeholders, and gave
the user no chance to assemble a coherent changeset (add a net AND
bridge it) before any of it landed.
Live edits now STAGE: each editor primitive records its op and returns
immediately; the optimistic placeholders callers already draw are the
staged preview. The action button reads UPDATE (n) when live (DEPLOY
when pending) and flushes the batch through the slice-1 submit queue —
sequential, version-cursored, each awaited to a terminal state, stopping
loudly on the first failure with the unapplied remainder kept for retry.
REFRESH becomes DISCARD (n) to drop the batch. SSE refetch is paused
during a commit so per-mutation applied events don't wipe still-staged
placeholders mid-batch; one refetch reconciles at the end.
Also fix _dropArchetype, which bailed without an optimistic node on the
staged path, leaving a decky added to an uncommitted LAN invisible until
UPDATE.
Live MazeNET edits fired their mutations fire-and-forget: each canvas
action enqueued immediately and never awaited the result. Two failures
followed from that:
- expected_version is bumped at ENQUEUE (not at apply), so two ops fired
back-to-back raced — the second carried a stale version and 409'd.
Edits only worked when hand-paced (an SSE refetch landed between them).
- A failed mutation degrades the topology, but the only signal was a 4s
toast, so the user saw DEGRADED with no cause.
useTopologyEditor now routes every live op through a serialized submit
queue: one enqueue in flight at a time (submission order preserved), an
optimistic expected_version cursor advanced per enqueue so back-to-back
ops (e.g. reparent's detach+attach) don't need a refetch between them,
and each mutation awaited to a terminal state. A 'failed' row throws
MutationFailedError, which the page pins as a persistent UPDATE FAILED
banner instead of a vanishing toast.
Slice 1 of the live-edit rework; stage+UPDATE-button batching and louder
backend materialisation reporting to follow.
The Fleet UI only showed TEARDOWN for swarm-pinned deckies (POST
/swarm/hosts/{uuid}/teardown). Local deckies had no delete control though
the API now exposes DELETE /deckies/{name}.
teardown() branches on swarm vs local; the card's two-step arm/CONFIRM
button renders for any admin, keyed td:${host_uuid ?? 'local'}:${name}.
The Fleet module had no delete — neither UI nor API — though the engine
capability existed (engine.teardown(decky_id=...), exposed only via
`decnet teardown --id`). Wire it to HTTP.
DELETE /deckies/{name} (admin-gated, 204). Synchronous: a single decky's
compose stop/rm is quick, so it's awaited off-thread rather than the
202+lifecycle path deploy/mutate use for slow builds. The single-decky
teardown never touches the host macvlan interface, so it needs no extra
CAP_NET_ADMIN.
State consistency: engine.teardown removes the containers and the
fleet_deckies row but leaves the decky in decnet-state.json. Left as is, the
reconciler would see "present in JSON, absent from DB" and re-INSERT the row,
resurrecting the decky. So the handler prunes it from both decnet-state.json
and the DB deployment key after teardown; deleting the last decky clears
state entirely (DecnetConfig.deckies has min_length=1).
Route ordering: the dynamic DELETE /deckies/{decky_name} is registered AFTER
the fixed /deckies/* routes (Starlette matches in registration order), so it
no longer shadows DELETE /deckies/files (file-drop).
Tests cover 401/403/404/422, single-delete pruning, and last-decky clear.
These had been red since the changes they cover landed — invisible because
the pre-commit gate runs mypy/ruff/bandit/pip-audit but NOT pytest, so failing
tests don't block commits and quietly accumulate.
- SSE stream/events auth migrated from ?token=<jwt> to a single-use ?ticket=
(commit efb4e49d). Three tests still passed a raw JWT as ?token= and got
401. Updated to mint a ticket via POST /auth/sse-ticket and pass ?ticket=
(attacker events, topology events, /stream).
- The user-creation password policy is min_length=12; the RBAC admin-access
test still used a 10-char password and was rejected. Bumped to a valid one.
pip-audit flagged fixable advisories in the web stack:
- cryptography -> >=48.0.1 (GHSA-537c-gmf6-5ccf)
- python-multipart -> >=0.0.31 (CVE-2026-53538/53539/53540)
- starlette (transitive via fastapi) -> add direct floor >=1.3.1
(CVE-2026-48817/48818/54282/54283)
Venv synced to cryptography 49.0.0, python-multipart 0.0.32, starlette
1.3.1; full tests/api/ suite green against the bump. Also drops the stray
browser-use[core] dev dep (the browser-use skill uses a global CLI; the
package is imported nowhere in DECNET).
Two defects exposed after the deploy-success loop fix (verified live):
1. Duplicated / skipped transcript lines. The placeholder-log interval did
`setLog(prev => [...prev, msgs[i]])` then `i++`. React 18 auto-batches
setInterval updaters, so the updater ran after i had advanced and read the
wrong index — skipping some lines ([NET], [SENSE]) and duplicating others
([TLS]). Fixed by capturing `const line = msgs[i]` before scheduling the
update. A placeholderStartedRef also gates the effect to one run per deploy
(reset in startDeploy) as defense-in-depth against re-render churn.
2. Wizard never closed on success. The completedRef guard combined with the
auto-close effect's cleanup was self-defeating: a re-render inside the
700ms window (e.g. the [OK] terminal-log append) ran the cleanup, clearing
the pending close, and the guard then blocked rescheduling — so onComplete
never fired. The timer now lives in a ref cleared only on unmount, so a
scheduled close always fires exactly once regardless of re-renders.
Adds a regression test that a re-render during the close countdown does not
cancel the close. Verified end-to-end against the live instance: all 8 log
lines render once in order, the wizard auto-closes, and /deckies does not
storm.
After a successful deploy the wizard hammered GET /deckies and stacked
"DEPLOYED" toasts unbounded (~1.4/s, forever). Two compounding causes:
1. The auto-close effect's dep array includes onComplete, which the parent
passes as an inline arrow (new reference every render). onComplete calls
refresh(), re-rendering the parent, producing a new onComplete ref,
re-running the effect, which reschedules onComplete — a feedback loop.
2. DeployWizard stayed permanently mounted (open only toggled a child
overlay), so its hooks kept running with lifecycleDone===true after close.
Fix both: a completedRef guard makes the auto-close fire exactly once per
deploy regardless of effect re-runs (reset in startDeploy), and the parent
now mounts the wizard only while open so closing tears down its hooks and
clears the latent "reopen re-completes stale rows" path.
Lifecycle polling itself was never the runaway — it stops cleanly at
terminal status; the bounded ~25-poll build window is expected.
Adds a regression test asserting onComplete fires once across a simulated
re-render storm.
The web deploy collision-guard read the existing fleet from the DB
State["deployment"] key, while the UI/get_deckies() read decnet-state.json.
A fleet established via CLI/seed lands in neither path the guard consulted,
so existing_deckies was empty, the additive guard ran blind, and the
reconciler tore the running fleet down to the single submitted decky
(BUG-2: silent fleet wipe, HTTP 202, no warning).
Converge both reads on fleet_deckies — the engine-mirrored table written on
every deploy/teardown (CLI and web), which fleet/reconciler.py already
documents as the store the orchestrator, dashboard, and REST API see. Each
row's decky_config column is a full DeckyConfig dump, so it rehydrates
losslessly into the collision-guard input. The handler also commits the
intended fleet to fleet_deckies synchronously so rapid sequential deploys
read a current fleet and the dashboard observes the new shape immediately.
State["deployment"] is retained for now — the mutate handlers and the
mutator engine still coordinate through it; consolidating them is tracked
in development/ADR-001-FLEET-SOURCE-OF-TRUTH.md (open question 7).
Tests seed fleet_deckies directly (also modelling the CLI-seeded scenario)
rather than chaining real deploys through the skipped contract-test path.
Follow-up to V9.1.4 (which covered only the syslog forwarder/listener): set
ctx.minimum_version = TLSVersion.TLSv1_2 on the remaining DECNET-owned mTLS
client contexts — AgentClient (_build_client + _fetch_peer_fingerprint),
UpdaterClient (_build_client + _fetch_peer_fingerprint), and the updater
executor's worker context. Pure hardening, no behavior change for TLS1.2+
peers (confirmed by the existing mTLS round-trip suites).
Deliberately EXCLUDED — hardening these would be counterproductive:
- templates/https/server.py, templates/rdp/server.py: honeypot listeners,
where looking weak/old is part of the deception.
- prober/tlscert.py: outbound TLS fingerprinting prober, which must speak
whatever the attacker's target offers.
Added a floor-assertion test (spies httpx.AsyncClient to capture the real
verify= context).
The V3.1.1 backend change moved SSE auth off ?token=<JWT> onto a single-use
?ticket=, but the dashboard was never updated, so every live stream 401'd
('Could not validate credentials'). Add mintSseTicket() (POST /auth/sse-ticket
with the Bearer JWT, returns an opaque 60s single-use ticket) and refactor all
stream consumers to mint a fresh ticket at the top of each connect() — initial
and every reconnect — then open EventSource with ?ticket=. A reused single-use
ticket would 401-loop, so re-mint-per-connect is required.
Covers Dashboard /stream, LiveLogs, and the attacker/identity/campaign/
orchestrator/topology hooks. connect() is now async with an unmount guard
(cancelled flag checked after the await, before opening the stream); on a mint
401 the connect is skipped and the axios logout interceptor takes over.
The change-password form let the browser submit short passwords the API
then rejected with an opaque 'Schema structural violation' 400. Add a pure
validateNewPassword() util (>=12 chars, <=72 bytes, >=3 of 4 character
classes — constants tweakable) and a live ✓/✗ checklist above the submit
button so the user sees exactly what's missing. Submit is gated on
validity + confirm-match, so the form can no longer reach that 400.
- Fix minLength 8->12 on the Login change-password inputs and the UsersTab
admin-reset guard (both lagged the API's min_length=12).
- Light-mode: render the checklist box fully white with black text (the
neon-on-dark styling read as muddy grey); ✓/✗ icons keep a green/red cue.
- Advisory UX only — the API min_length=12 remains the enforcement boundary;
character-class complexity is not server-enforced.
Turn on mypy warn_return_any (pyproject) and resolve the 84 resulting
[no-any-return] errors across 43 files with typing.cast() at the return
sites — runtime no-ops that make the declared return type explicit where a
dependency (SQLAlchemy scalar/first/one, httpx .json(), subprocess, docker
SDK) hands back Any. No behavior change: no DTO/table field types altered, no
validation/coercion calls added, every cast reflects the true runtime type.
Locks in return-type strictness so the class of bug where a function silently
widens to Any can't regress. mypy decnet/ clean; adversarially verified
behavior-preserving (84 casts 1:1 with prior returns).
Bump tornado 6.5.5 -> 6.5.7 (CVE-2026-49854, transitive via snakeviz).
- V7.1.3: env known-insecure-default error no longer echoes the rejected secret value.
- V9.1.4: syslog-over-TLS forwarder + listener pin minimum_version=TLSv1_2.
- V12.1.2: updater tarball SHA-256 verification is now mandatory and fail-closed —
/update and /update-self reject a missing digest (400), the executor rejects
missing/mismatched digests before extract/apply. Every push path supplies it.
- V13.1.4: reject a wildcard '*' in DECNET_CORS_ORIGINS at startup.
- V13.1.5: enforce application/json on JSON write endpoints (415 otherwise),
exempting multipart upload routes.
- BUG-17: SSE error log records the user uuid, not the resume cursor.
Also completes V2.1.7 consistently: the attacker-injectable PYTEST* env bypass is
replaced with explicit DECNET_TESTING=1 in the three remaining sites
(env.validate_public_binding, config logging, mysql url builder).
Tests added for every fix; unanimous adversarial review (no update-outage risk —
all push paths verified to send the digest).
Auth (V2.1.1/V3.1.2, V2.1.3, V3.1.1):
- Pin JWT iss/aud/typ at mint and require+verify them at decode; revocation
(jti denylist + tokens_valid_from) still enforced.
- Change-password now requires min_length=12.
- SSE auth moves off JWT-in-URL to a single-use 60s opaque ticket
(POST /auth/sse-ticket); raw JWT in query no longer authenticates a stream.
Removed dead fail-open get_stream_user helper.
Egress (V5.1.1, V9.1.1/V14.1.3):
- Webhook delivery + CRUD reject SSRF destinations (private/loopback/link-local/
metadata, IPv4-mapped, multi-A-record) via resolved-IP validation, pin to the
vetted IP, and never auto-follow redirects. Opt-out via DECNET_WEBHOOK_ALLOW_PRIVATE.
- UpdaterClient pins the worker leaf cert SHA-256 against the stored per-host
fingerprint (fail closed on missing/mismatch); DECNET_VERIFY_HOSTNAME now
defaults True.
Hardening (V13.1.3, V4.1.4, V13.1.2):
- Rate-limit change-password (5/min), enroll-bundle (10/min), webhook-create
(20/min), host-delete (20/min) via the existing slowapi limiter.
- Correct false 'global auth middleware' comment; document enroll-bundle proxy
trust.
Correctness (BUG-7..11):
- BUG-7 unbound bus in finally; BUG-8 apply_ceiling clamps to min(base,ceiling);
BUG-9 commit before emit; BUG-10 multi-actor rearm for sub-threshold identities;
BUG-11 normalize naive timestamps to UTC.
Already-closed (no change): V14.1.1, V2.1.2/V3.1.3, V5.1.2. Tests added for
every fix; unanimous adversarial review.
- V7.1.1: /swarm/check no longer returns raw exception text; logs detail
server-side, returns generic 'probe failed'.
- BUG-1: register EditAction -> SSHDriver so edit ticks no longer crash.
- BUG-2: topology reconcile matches generator-named deckies by
expected-name membership instead of a hyphen heuristic.
- BUG-3: intel provider lookups acquire the per-provider semaphore so
declared concurrency bounds are enforced.
- BUG-4: RuleIndex.install evicts a rule from kinds it no longer applies to.
- BUG-5: UnixSocketBus.connect() is lock-guarded with a double-check so
concurrent first-connects open exactly one socket and reader task.
- BUG-6/V5.1.3: multi-token JSON-field search binds each token to a
distinct parameter instead of collapsing to the last value.
Regression tests added for every fix, verified red-before/green-after.
V4.1.1c/V12.1.1 (updater master-CN gate) and V12.5.1 (tarball include-list)
confirmed already fixed in prior commits and left untouched.
Gate all 8 swarm-controller operator routes (enroll, list/get/decommission
hosts, deploy, teardown, check, list deckies) with the centralized
require_admin RBAC dependency alongside require_operator_cert; mTLS becomes
defense-in-depth instead of the only gate. /heartbeat stays cert-fingerprint
pinned (worker-facing) and /swarm/health stays open (liveness only).
CLI swarm commands now send Authorization: Bearer $DECNET_API_TOKEN with a
401/403 hint covering the must_change_password bootstrap flow.
Bump pyjwt to 2.13.0 and pip to 26.1.2 (pip-audit PYSEC-2026-175/177/178/179,
PYSEC-2026-196); authz suite re-verified on the new pyjwt.
Closes ASVS_L2_AUDIT.md V4.1.1a and V4.1.1b (CRITICAL).
Replace the hardcoded 1440-minute (24h) JWT lifetime with
DECNET_JWT_EXP_MINUTES (validated positive int, default 240 = 4h).
Shrinks the passive window of a stolen token; active revocation is
unchanged (immediate->=<10s).
A stolen JWT used to survive a password reset for its full 24h. Now every
session-invalidating change moves the user's tokens_valid_from cutoff to
'now', so all of that user's prior tokens 401 on next use:
- self change-password, admin reset-password, role change all bump the
cutoff (delete needs no bump: the row is gone, so the user lookup 401s).
- Cutoff is compared against the token's iat floored to whole seconds, so a
re-login in the same second as the change isn't caught by its own
revocation (the cost is a <=1s grey zone on same-second-old tokens).
- Per-user: changing one user never revokes another.
POST /auth/logout adds the caller's jti to the denylist and drops the
local negative-cache entry, so the token 401s on its very next use.
Single-session semantics: only this token dies, other sessions for the
same user keep working. Reachable for must_change_password users (it
runs the revocation checks but skips the must_change gate via
get_token_claims) so a session can always be ended; an already-revoked
token is rejected.
Stateless JWTs had no revocation path: a stolen token stayed valid for
its full 24h even after the victim changed their password, and there was
no logout. This lays the foundation for revoking them.
- User.tokens_valid_from: per-user bulk-revocation cutoff (compared against
the token's iat). RevokedToken(jti PK, exp): single-token denylist, pruned
opportunistically on insert so it never outgrows live-but-revoked tokens.
- login() now mints a jti; create_access_token already stamps iat/exp.
- repo.revoke_token / is_token_revoked / set_tokens_valid_from (abstract +
shared sqlmodel impl + DummyRepo coverage stubs).
- Centralized validate path in dependencies.py: every auth dependency now
resolves the user and fails closed on (1) missing jti (legacy/pre-deploy
token -> one forced re-login), (2) iat before the cutoff, (3) a denylisted
jti. Denylist lookups ride a 10s membership cache mirroring the user cache.
- Contract/fuzz harness seeds its fixed-uuid principal under
DECNET_CONTRACT_TEST so its minted token resolves to a live admin user.
DECNET_ADMIN_PASSWORD defaulted to the literal "admin" with no guard, so
a master that never set it seeded an admin/admin account. Resolve it
lazily via __getattr__ -> _require_env (the same pattern as
DECNET_JWT_SECRET): unset or a known-bad default (admin/secret/...) is
rejected, and <12 chars is rejected outside DECNET_DEVELOPER. Only the
master web/api processes that import the DB layer resolve it; workers
never do, and the pytest short-circuit keeps the dev loop unaffected.
The module attribute stays addressable for the admin-seed monkeypatch.
tar_working_tree walked the whole working tree minus a blocklist that
omitted .env.local, *.key, *.pem, *.crt — so the JWT secret, Fernet key,
admin password, DB creds and TLS private keys fanned out to every worker
on each update push.
Invert to an allowlist (DEFAULT_INCLUDES = pyproject.toml + LICENSE +
README.md + decnet/), the exact surface 'pip install .' needs; decnet/
carries its own package-data. A defensive _HYGIENE_PATTERNS layer drops
secret-/churn-shaped files even if nested under decnet/. extra_excludes
can still narrow but can no longer widen past the allowlist.
Verified against the live repo: the bundle carries the package + metadata
and zero secret/db/log/pyc files, and pip-installs clean from the
extracted tree.
The worker-side updater extracted + pip-installed + re-exec'd any tarball
from any caller holding a CA-signed cert; the documented updater@* CN
gating was never implemented. Now:
- require_master_cert gates /update, /update-self, /rollback, /releases:
the client cert CN must be decnet-master (the identity UpdaterClient
presents). A worker/agent cert can no longer push code to a peer.
- sha256 is mandatory on /update and /update-self (400 otherwise), so the
integrity check always runs before extract/install. UpdaterClient
already sends it; this just hardens the contract.
The transport peer-identity primitives move to decnet/web/_mtls.py (a
light namespace module) so the minimal updater reuses them without
importing the API router tree; router/swarm/_mtls.py re-exports them and
keeps the operator gate. Closes the updater-RCE critical.
The swarm controller (port 8770) exposed 9 routes with zero app-layer
auth, and swarmctl --tls defaulted off — anyone able to reach the port
could enroll workers (minting CA-signed certs + private keys), deploy,
or tear down the fleet. Two fail-closed layers:
- require_operator_cert gates every operator route (enroll/deploy/
teardown/hosts/check/deckies). When mTLS is on, the peer cert's CN
must be an operator identity (decnet-master/swarmctl); worker and
updater@* certs are rejected. Plaintext loopback (single-host master)
is accepted as the local operator — the docker.sock boundary.
- swarmctl refuses to bind a routable interface without --tls, so a
network-exposed plaintext control plane can never start.
/heartbeat keeps its worker fingerprint pinning. Closes the two ASVS
criticals (control-plane no-auth, unauthenticated cert minting).
Extract peer-cert extraction from the heartbeat endpoint into
decnet/web/router/swarm/_mtls.py, adding CN parsing alongside the
SHA-256 fingerprint and a require_operator_cert dependency (CN in
{decnet-master, swarmctl}). api_heartbeat delegates to it; behaviour
unchanged. Prerequisite for control-plane and updater authz.
Rewrites the architecture section for the full current module tree and adds
new sections for the REST API, swarm/agent mode, service bus, attacker
intelligence stack (profiler, clustering, correlation, GeoIP/ASN),
MazeNET topology, canary tokens, and TTP tagging/export. Updates the CLI
reference table, test count (478 → 5050), and Python version constraints.
Replaces LICENSE (GPLv3 -> AGPLv3) and prepends
`SPDX-License-Identifier: AGPL-3.0-or-later` to every source file
across decnet/, decnet_web/, tests/, scripts/, and tools/.
Rationale: closes the GPLv3 ASP loophole so any party operating a
modified DECNET as a network service must offer their modified
source. Personal copyright (Samuel Paschuan) + inbound=outbound
contributions make a future unilateral relicense infeasible.
- LICENSE: full AGPL-3.0 text (gnu.org/licenses/agpl-3.0.txt)
- COPYRIGHT: project copyright notice
- tools/add_spdx_headers.py: idempotent header injector
(shebang- and PEP 263-aware)
Touches 1565 source files (.py, .ts, .tsx, .js, .jsx, .css, .sh).
No behavior change; comments only.
Every compose invocation used -p decnet so fleet + every topology
lived in one docker compose project. --remove-orphans, run during
fleet pre-up cleanup and on every topology teardown / rollback, then
swept every container in the project not listed in the current compose
file — wiping sibling topologies and the flat fleet along with the
intended target.
Parameterize project on _compose / _compose_with_retry / _compose_ps
(default FLEET_COMPOSE_PROJECT="decnet"). Add _topology_compose_project
that returns decnet-topo-<id8>, and pass it through every topology
compose call site (master deploy_topology + rollback + post-deploy ps,
master teardown_topology, agent apply, agent teardown, all four live
service mutations on topology deckies). Fleet calls keep the default
and are unaffected.
Migration: live containers from before this fix remain in the shared
"decnet" project and need a one-time manual cleanup before they're
reachable to the new topology code paths.
The wizard POSTs only the new decky on each submit. The handler used to
treat every INI as the complete desired fleet (config.deckies = INI) so
the reconciler tore down prior deckies as orphans — deploying a second
Windows workstation silently wiped the first.
Add replace_fleet to DeployIniRequest (default false). Default path
merges new deckies into existing config and rejects name/IP collisions
with 409. replace_fleet=true preserves set-desired-state semantics for
CLI / declarative callers. Lifecycle rows are created only for the
deckies submitted in the current call, so /deckies/lifecycle?ids=...
reflects exactly what this submit deployed.
build_deckies_from_ini gains reserved_ips so additive auto-allocation
skips IPs already held by the existing fleet.
- New useLifecyclePolling(ids, intervalMs) hook: polls
GET /deckies/lifecycle?ids=... every 2s until every row is terminal,
surfaces transient HTTP failures without giving up.
- DeployWizard: drops the 180s axios timeout and the fake-log-driven
deployOk flag. After POST 202, sets lifecycle_ids -> the hook drives
the per-decky pill grid (PENDING / RUNNING / SUCCEEDED / FAILED).
Real terminal lines stream into the log as rows resolve. Auto-close
on all-success after 700ms.
- DeckyFleet.css: .lifecycle-grid + .lifecycle-pill in the existing
fleet vocabulary; running pill pulses, failed pill borders alert.
- Existing 4 wizard render tests still pass; 4 new hook tests cover
empty ids / single-success / polling-until-terminal / HTTP error.
GET /deckies/lifecycle?ids=<uuid>&ids=<uuid> returns the matching
DeckyLifecycle rows so the wizard can poll instead of holding an HTTP
request open across compose work. require_viewer gating -- read-only.
Startup sweep: on master boot, any pending/running row with
started_at older than 1h flips to failed with
error='master restarted during operation'. Pre-v1 substitute for a
durable task queue: if the master crashes mid-deploy, the wizard sees
FAILED on refresh and the operator retries. Idempotent + cheap; runs
unconditionally including in contract-test mode.
This is the unblock for the wizard hang. Both endpoints used to run
docker compose synchronously inside the HTTP handler -- on master
(unihost) or via asyncio.gather of worker /deploy POSTs at 600s
timeout each (swarm) -- blocking every other API request.
New flow:
1. Commit the new config shape to repo state (fast).
2. Create one DeckyLifecycle row per decky (status=pending).
3. Spawn asyncio.create_task(run_deploy / run_mutate) -- the
lifecycle runner drives rows through running -> succeeded|failed
and emits decky.<name>.lifecycle on the bus.
4. Return 202 with {lifecycle_ids: [...]}. Wizard polls
GET /deckies/lifecycle?ids=... (next commit).
mutator/engine.py gains pick_new_services() -- shared between the
async API path and the watch-loop's synchronous mutate_decky().
DeployResponse grows lifecycle_ids[]. The old dispatch_decnet_config
helper still exists for the CLI swarm-deploy command path; it just
isn't called from the API handler anymore.
Test changes: 200 -> 202, drop dispatch_decnet_config mocks (handler
no longer calls it), assert lifecycle_ids in response + committed
state matches expectations.
HeartbeatRequest grows an optional lifecycle field carrying per-decky
completion records from the worker:
[{decky_name, operation, status, error?, completed_at?}]
For each delta, the master finds the most-recently-started open
DeckyLifecycle row for (decky_name, operation, host_uuid) and flips
it to terminal with the worker's error text + timestamp. Stale
duplicates (row already sealed or never existed) are logged and
dropped -- not errors.
Each successful pivot also emits decky.<name>.lifecycle on the bus
so the dashboard sees the transition without waiting for its next
poll tick.
This is the master-side completion channel for the worker's 202
fire-and-forget /deploy and /mutate.
The wizard API used to hang because /deckies/deploy ran docker compose
build && up -d synchronously, holding the request thread for minutes.
The worker side of that pipeline now returns 202 Accepted immediately
and runs the deploy in an asyncio.create_task.
On task completion (success or failure) the worker pushes a one-off
heartbeat carrying a lifecycle delta per decky:
{decky_name, operation, status: succeeded|failed, error?, completed_at}
Master pivots these onto open DeckyLifecycle rows in the heartbeat
handler (next commit). The scheduled 30s heartbeat tick is the
fallback if the immediate push drops.
- decnet/agent/app.py: /deploy and /mutate return 202; dry_run mutate
still validates synchronously and returns 200.
- decnet/agent/executor.py: deploy_async + mutate_async wrap the work
and push the completion delta.
- decnet/agent/heartbeat.py: push_lifecycle_delta() helper builds a
one-off body and POSTs with the same mTLS context as the loop.
- decnet/swarm/client.py: revert deploy/mutate to control timeout
(master no longer holds the HTTP request open for compose work).
Worker state.json gains no lifecycle field -- master DeckyLifecycle is
the source of truth; the master sweep handles crashed-mid-deploy
recovery.
Add decnet.lifecycle package: pure orchestration layer that the
master API will invoke via asyncio.create_task to drive DeckyLifecycle
rows through pending -> running -> succeeded | failed without
holding an HTTP request open.
Strategy classes per (operation, transport):
- LocalDeployStrategy: master-resident, runs engine.deployer.deploy
in a thread.
- SwarmDeployStrategy: shards by host_uuid, dispatches via
AgentClient.deploy; worker drives terminal via heartbeat.
- LocalMutateStrategy: write_compose + compose up.
- SwarmMutateStrategy: AgentClient.mutate; worker drives terminal.
decnet.bus.topics gains decky_lifecycle(name) -> decky.<name>.lifecycle
plus DECKY_LIFECYCLE constant. Payload documented in the wiki
(separate commit). publish_safely keeps bus best-effort.
Nothing is wired to call this yet -- next commits convert worker
/deploy /mutate to 202, then heartbeat delta wiring, then master API.
One row per (decky, operation) attempt. State machine:
pending -> running -> succeeded | failed (+ error text). Rows are
append-only after terminal; retries write a new row.
Sibling of DeckyShard rather than a rework -- DeckyShard tracks
runtime container state observed via heartbeat, this tracks
operation lifecycle. New table, UUID PK.
Adds BaseRepository abstract methods (create_lifecycle,
update_lifecycle, get_lifecycle_by_ids, find_open_lifecycle,
sweep_stale_lifecycle) with SQLModelRepository mixin impl.
Backbone for the upcoming 202-Accepted async API.
- Implement /mutate handler: load_state, update services + last_mutated,
save_state, write_compose, compose up -d via asyncio.to_thread. 404
for missing state / unknown decky_id. dry_run short-circuits before
any side effect.
- Add AgentClient.mutate(decky_id, services, *, dry_run=False) using
_TIMEOUT_DEPLOY (compose up can pull/build, exceeds control timeout).
- mutator/engine.py: in swarm mode with decky.host_uuid set, resolve
worker via _resolve_swarm_host and dispatch through AgentClient.mutate
instead of writing a compose file on master. Master-resident deckies
(unihost mode, or swarm with host_uuid=None) keep the local path.
Adds state_path ServiceConfigField and passes DNS_STATE_PATH into the
container environment. Operator must mount the parent directory on a
volume for persistence to survive container recreation.
Switch burst deque from monotonic() to time.time() (wall-clock, serializable).
Add DNS_STATE_PATH env var: on startup _load_state() reads {src:[ts,...]} JSON
and prunes entries older than the burst window. _flush_state() write-then-renames
atomically; _state_flusher() coroutine flushes every 5s when dirty. Detection of
the 5th event also triggers an immediate flush. No-op when DNS_STATE_PATH is
unset, so the default deployment is unchanged.
Rename _txt_times -> _tunnel_times. Add TYPE_CNAME=5, TYPE_NULL=10,
TYPE_PRIVATE=65399 constants. Guard burst counter with _TUNNEL_QTYPES
frozenset instead of TYPE_TXT only. Mixed-type queries from one source
now share a single burst window, closing iodine NULL/CNAME downlink
and AAAA-encoded uplink evasion gaps.
_is_tunneling now returns str|None (the detection method) instead of bool.
Two new tunables _QNAME_TOTAL_LEN_THRESHOLD=50 and _QNAME_ENTROPY_THRESHOLD=3.5
catch attackers who split a high-entropy payload across multiple short labels.
tunnel_method field added to tunneling_suspect events for downstream correlation.
_parse_edns_size only extracted the requestor UDP size; every other field in
the OPT record (DO bit, EDNS version, extended RCODE, all sub-options) was
invisible. Replaced with _parse_opt_record returning a full dict:
udp_size, ext_rcode, version, do_bit, z, options[(code, len, data)]
NSID request (option code 3) is now detected as fingerprint_probe with
probe=edns_nsid and contributes to recon_burst. DO bit, COOKIE (10), and
other options are not escalated; udp_size continues to drive amp_probe.
Tools like fpdns send OPCODE=IQUERY/STATUS/NOTIFY/UPDATE or set the reserved
Z bit to fingerprint resolver behaviour. Previously all these were parsed as
standard queries with no signal.
- opcode!=0 → fingerprint_probe probe=opcode_<name>, NOTIMP response;
fired before qdcount check so qdcount=0 UPDATE packets are still caught.
- Z bit set OR (AD+CD without RD) → fingerprint_probe probe=header_flags;
AD alone with RD is ignored to avoid tagging DNSSEC-aware stubs.
- Both variants contribute to recon_burst.
qclass=255 in a standard query is unusual enough to be a fingerprinting probe
(fpdns, various scanner scripts). Previously it was logged as a plain query
with qclass=ANY in the event field; now it emits fingerprint_probe with
probe=qclass_any and returns REFUSED — consistent with how we treat other
probe types. Contributes to recon_burst.
The inline probe_map dict inside _handle made tests blind to the probe
catalogue and couldn't be extended without touching the hot path. It is now
module-level _CHAOS_PROBE_MAP. authors.bind. joins the three existing entries
so it gets named correctly instead of carrying the raw qname.
Packets with multiple questions were silently parsed at q0 only; the extra
questions were invisible. Now emits multi_question at severity=5 with the
qdcount and q0 qname, then falls through and answers q0 normally.
Silent drops on <12B packets, qdcount=0, and question-section ValueError gave
fuzzers and scanners a completely dark target. New events malformed_packet,
empty_question_section, and question_parse_error fire at severity=5 so these
probes are visible without counting toward recon_burst.
Adds DNS_FORWARD_BUDGET (default 50) and DNS_FORWARD_WINDOW (default 1.0s)
env vars. _can_forward() maintains a rolling deque of upstream call
timestamps; queries that exceed the budget within the window are answered
with the sinkhole (127.x) instead of being forwarded, making the honeypot
ineligible as a sustained amp vector even when real_recursive is enabled.
Rate limit is global (not per-source) so IP-spoofed amplification floods
hit the ceiling regardless of how many source addresses are rotated.
When DNS_REAL_RECURSIVE=true and DNS_ZONE_MODE=recursive, out-of-zone
queries are forwarded to DNS_UPSTREAM (default 8.8.8.8:53) via async
UDP. Upstream response is relayed as-is; on timeout or error the
already-computed sinkhole (127.x) is returned instead.
_handle() always runs first so logging, tunneling detection, flood
tracking, and recon-burst aggregation fire on every query regardless
of whether the response ultimately comes from upstream. _dispatch()
overlays forwarding on top of the sync handler.
Protocol handlers (UDP datagram_received, TCP session) are now async
via asyncio.ensure_future / await _dispatch(). Service class exposes
real_recursive (bool) and upstream (string) config fields.
RA=1 + empty answer section is immediately detectable as fake by any
open-resolver scanner. Recursive mode now behaves like open mode
(127.0.0.x sinkhole, deterministic on qname) with RA=1 and AA=0,
matching what a real recursive resolver returns.
- Add per-src QPS counter (_qps_window) with flood_suspect event at ≥50 qps/10s;
one event per src per 30s cooldown, does not suppress baseline query events.
- Add tracking_evicted telemetry every 100 LRU evictions so IP-rotation evasion
of _txt_times/_qps_window/_recon_window is observable, not silent.
- Shared _track_lru helper consolidates LRU touch + eviction signalling across
all three bounded OrderedDicts.
- Add TYPE_AAAA=28 support: _fake_ipv6() returns deterministic ULA (fd::/8)
addresses for in-zone names; extra_records parser now accepts and validates
AAAA entries via socket.inet_pton.
- Add per-src recon-burst aggregation (_recon_window): fingerprint_probe +
zone_transfer + amp_probe are tracked per source in a 60s window; recon_burst
fires when ≥2 distinct signal types seen, once per src per 120s cooldown.
- 47 tests passing (19 new across TestAAAARecords, TestFloodDetection, TestReconBurst).
Python asyncio DNS server on UDP+TCP/53 masquerading as BIND 9.x.
Emits four event_type values: query, fingerprint_probe (version.bind /
hostname.bind / id.server CHAOS), zone_transfer (AXFR/IXFR, always
REFUSED), amp_probe (qtype=ANY or EDNS udp_size>1232), and
tunneling_suspect (long high-entropy labels or rapid TXT burst).
Zone persona is generated per-decky from instance_seed (domain name,
SOA serial, NS, A, MX, TXT SPF); overridable via config_schema.
Three zone modes: auth (default), recursive, open (sinkhole).
AttackerData type gets bgp_prefix / rpki_status / rpki_source.
TimelineSection renders prefix inline next to AS number; RPKI status
shows as a green RPKI VALID / red RPKI INVALID badge, or dim
NO ROA for not-found. rpki-status-badge CSS added to Dashboard.css.
Export network block extended with the three new fields.
Import enrich_rpki from decnet.rpki and call it inline after the
ASN lookup. bgp_prefix, rpki_status, rpki_source added to the
record dict that feeds the Attacker upsert. enrich_rpki short-circuits
to (None, None) when asn is None, so private / unannounced IPs
never hit RIPE STAT.
bgp_prefix (max 43 chars, indexed) holds the covering CIDR from
the ASN lookup. rpki_status / rpki_source hold RIPE STAT validation
outcome. All nullable — null means enrichment was skipped or ASN
did not resolve.
RipeStatValidator makes two RIPE STAT calls per uncached IP:
network-info -> announced prefix, rpki-validation -> ROA state.
2-second timeout; any network failure returns status='unknown'.
SQLite cache keyed by IP, 12-hour TTL, pruned on validator init.
Cache avoids per-event HTTP for the high-churn attacker pool —
steady-state cost approaches zero for repeat offenders.
Synthesize the covering CIDR at lookup time from the matched iptoasn
range using ipaddress.summarize_address_range. AsnInfo.prefix is
populated per-query; not persisted in the pickle cache.
enrich_ip now returns (asn, as_name, bgp_prefix, provider_name).
Profiler worker updated to unpack the 4-tuple and write bgp_prefix
into the attacker record dict.
Four RFC 4443 stimuli (port-unreach, hop-limit-exceeded, unknown-NH,
bad-dest-option) produce a 4-char matrix + sha256 fingerprint for IPv6
attackers. Auto-registers via ActiveProbeMeta at priority=860 (after v4
icmp_error=850, before ipv6_leak=999). IPv4 targets fast-return None.
Sends four crafted stimuli (UDP/closed-port, TTL=1, DF+oversized,
bad IP option) and records which ICMP error classes come back, the
per-error RTT, and the bytes echoed in each ICMP body. Absence is
as informative as a reply — Linux rate-limiting is a fingerprint signal.
Returns None when no packets could be sent (no CAP_NET_RAW), so the
probe is a no-op in non-root test environments. Port-free ActiveProbe
subclass (priority=850), metaclass auto-registered in the registry.
Also fixes three sets of stale tests left over from the TlsCertProbe
migration (4b2759e0):
- test_active_probe_registry: closed name/order sets updated for
tls_certificate and icmp_error
- test_prober_rotation: dead patches on worker.fetch_leaf_cert removed
- test_prober_worker (TestProbeCycleTLSCert): rewritten to test
TlsCertProbe as an independent registry probe, patch target updated
from worker.fetch_leaf_cert to probes.tlscert_probe.fetch_leaf_cert
TLS cert capture was the last prober special-case that bypassed
ActiveProbeMeta. Moves logic into TlsCertProbe (priority=200, runs
after JARM) in probes/tlscert_probe.py; drops _capture_tls_cert,
the probe.probe_name=="jarm" name-check, and the direct
fetch_leaf_cert import from worker.py.
ActiveProbe.run/syslog_fields/publish_payload now accept port=None so
non-port-iterating probes can live in the registry. Ipv6LeakProbe replaces
the hand-rolled _ipv6_leak_phase special case in worker.py; it runs last
via priority=999. _probe_cycle no longer has an ad-hoc phase call.
Fixes three stale test files (test_prober_bus, test_prober_rotation,
test_prober_worker) that were broken since the 916b21b6 registry refactor.
_route_info() calls _ip_route_get once and returns (on_link, iface);
worker._ipv6_leak_phase now calls it instead of the two separate helpers.
Bare except clauses at _ip_route_get and response parse now log at debug.
Iterates every template with a Dockerfile, builds decnet/<svc>:latest
with DOCKER_BUILDKIT=1. Supports NO_CACHE=1 and FAIL_FAST=0 flags,
mirrors the style of test-all. Updated help target.
FingerprintGroup switch fell through to FpGeneric (raw JSON dump) for all
four new fingerprint_type values the ingester now produces. Add FpJa4h,
FpHttpSettings, FpJa4Quic components and wire them into the dispatcher;
also register their labels and icons in fpTypeLabel/fpTypeIcon.
ingester: wrap bootstrap get_state() in forever-retry loop — MySQL coming
up after the API process killed the ingestion task permanently before it
ever entered _run_loop. Regression test added.
deps: idna 3.13→3.15 (CVE-2026-45409), twisted 26.4.0rc2→26.4.0
(PYSEC-2026-160), pip 26.1→26.1.1 (CVE-2026-3219 resolved upstream),
behave-core/behave-shell renamed from decnet-behave-* and bumped to 0.1.1.
pre-commit hook updated to reflect current ignore list.
Replace _jarm_phase / _hassh_phase / _tcpfp_phase boilerplate (3×~50
lines of identical port-iteration logic) with a metaclass-registered ABC.
Adding a new port-iterating active probe is now one class + three methods.
- decnet/prober/base.py: ActiveProbeMeta auto-registers subclasses by
probe_name; ActiveProbe ABC enforces run/syslog_fields/publish_payload
with env-driven DECNET_PROBE_PORTS_<NAME> port override.
- decnet/prober/probes/{jarm,hassh,tcpfp}.py: concrete probe classes.
- decnet/prober/worker.py: single _run_probe driver replaces the three
phase functions; _probe_cycle iterates ActiveProbeMeta.all(); drops
the ports=/ssh_ports=/tcpfp_ports= kwargs from prober_worker.
- IPv6 leak and TLS cert capture stay as special cases (different call
shapes; intentionally outside the registry).
- tests/prober/test_active_probe_registry.py: registry contents, sort
order, priority-10 override, ABC contract per probe class.
- tests/prober/test_run_probe_driver.py: dedup, success, None-skip,
exception, rotation, publish paths for _run_probe.
- tests/prober/test_prober_worker.py: updated patch targets and
_probe_cycle call sites; port control via monkeypatch.setattr.
- Add "ipv6_leak" to KNOWN_SOURCE_KINDS in ttp/base.py
- Register Ipv6LeakLifter(store) in factory.py get_tagger()
- Subscribe worker to attacker.fingerprinted; route by Event.type
so JARM/HASSH/ipv6_leak share the topic without source_kind collision
- Add bump_attacker_ipv6_leak() to BaseRepository (abstract) +
TTPMixin (implementation): increments ipv6_leak_count, sets last_ipv6_*
denorm fields, appends-with-dedup to AttackerIdentity.ipv6_link_local_iids
- Call bump_attacker_ipv6_leak from _process_event after insert_tags
- Add DummyRepo stub + coverage call in tests/db/test_base_repo.py
Add inline documentation for all known kind= discriminators on the
fingerprinted topic including the new ipv6_leak variant so future
consumers know what fields to expect without reading the prober source.
Ipv6LeakLifter subscribes to source_kind="ipv6_leak" events from both
the passive sniffer and active prober. Emits T1090 (Proxy) under TA0011
(C2) when fe80:: source address is observed — the attacker's VPN only
tunnels IPv4 so their link-local IID leaks their NIC identity.
Rule R0059 sets base confidence 0.85; iid_kind in the evidence carries
the per-observation strength (eui64 = MAC-derived, deterministic;
stable_privacy = RFC 7217; temporary = RFC 4941).
Add ipv6_leak.py with solicit_ipv6_leak() — sends ICMPv6 Echo to
ff02::1 on the attacker's iface and returns fe80:: evidence when a
link-local response arrives. Gated on _is_on_link(): skips when
attacker is behind a router (no L2 adjacency).
Add _ipv6_leak_phase() to worker.py (Phase 4 in _probe_cycle).
Phase runs once per attacker IP per cycle (sentinel at port 0 in
ip_probed["ipv6_leak"]) and publishes kind="ipv6_leak" via publish_fn.
Add list_v6_addrs(iface) to network.py: returns [(addr, scope)] for
all IPv6 addresses on an interface, required for source-routing ICMPv6
from the correct link-local address.
Add _ipv6_iid_classify() to fingerprint EUI-64 vs stable-privacy IIDs
and derive the MAC OUI from EUI-64-encoded link-local addresses.
SnifferEngine._on_ipv6_packet() observes fe80::/10 sources destined for
known deckies and emits ipv6_link_local_leak syslog + bus events.
on_packet() now dispatches the IPv6 branch before the v4 TCP path.
BPF default widened from "tcp" to "tcp or ip6" so the sniff loop
captures IPv6 frames without config change.
Attacker gains five denormalized cache fields (ipv6_leak_count,
last_ipv6_leak_at, last_ipv6_link_local, last_ipv6_iid_kind,
last_ipv6_mac_oui) mirroring the rotation_count/last_rotation_at pattern.
AttackerIdentity gains ipv6_link_local_iids (JSON list[dict]) for
EUI-64-derived MAC cluster signals that survive VPN/IP rotation.
No ALTER TABLE helpers — direct SQLModel column additions per pre-v1 policy.
Pins the evidence shape for IPv6 link-local leakage findings. All fields
optional (total=False) so partial observation (passive sniffer vs active
solicitation) fills whatever the vector provides. Lifter lands in a
subsequent commit.
"""Campaign clustering — see development/CAMPAIGN_CLUSTERING.md."""
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.