- New useLifecyclePolling(ids, intervalMs) hook: polls
GET /deckies/lifecycle?ids=... every 2s until every row is terminal,
surfaces transient HTTP failures without giving up.
- DeployWizard: drops the 180s axios timeout and the fake-log-driven
deployOk flag. After POST 202, sets lifecycle_ids -> the hook drives
the per-decky pill grid (PENDING / RUNNING / SUCCEEDED / FAILED).
Real terminal lines stream into the log as rows resolve. Auto-close
on all-success after 700ms.
- DeckyFleet.css: .lifecycle-grid + .lifecycle-pill in the existing
fleet vocabulary; running pill pulses, failed pill borders alert.
- Existing 4 wizard render tests still pass; 4 new hook tests cover
empty ids / single-success / polling-until-terminal / HTTP error.
GET /deckies/lifecycle?ids=<uuid>&ids=<uuid> returns the matching
DeckyLifecycle rows so the wizard can poll instead of holding an HTTP
request open across compose work. require_viewer gating -- read-only.
Startup sweep: on master boot, any pending/running row with
started_at older than 1h flips to failed with
error='master restarted during operation'. Pre-v1 substitute for a
durable task queue: if the master crashes mid-deploy, the wizard sees
FAILED on refresh and the operator retries. Idempotent + cheap; runs
unconditionally including in contract-test mode.
This is the unblock for the wizard hang. Both endpoints used to run
docker compose synchronously inside the HTTP handler -- on master
(unihost) or via asyncio.gather of worker /deploy POSTs at 600s
timeout each (swarm) -- blocking every other API request.
New flow:
1. Commit the new config shape to repo state (fast).
2. Create one DeckyLifecycle row per decky (status=pending).
3. Spawn asyncio.create_task(run_deploy / run_mutate) -- the
lifecycle runner drives rows through running -> succeeded|failed
and emits decky.<name>.lifecycle on the bus.
4. Return 202 with {lifecycle_ids: [...]}. Wizard polls
GET /deckies/lifecycle?ids=... (next commit).
mutator/engine.py gains pick_new_services() -- shared between the
async API path and the watch-loop's synchronous mutate_decky().
DeployResponse grows lifecycle_ids[]. The old dispatch_decnet_config
helper still exists for the CLI swarm-deploy command path; it just
isn't called from the API handler anymore.
Test changes: 200 -> 202, drop dispatch_decnet_config mocks (handler
no longer calls it), assert lifecycle_ids in response + committed
state matches expectations.
HeartbeatRequest grows an optional lifecycle field carrying per-decky
completion records from the worker:
[{decky_name, operation, status, error?, completed_at?}]
For each delta, the master finds the most-recently-started open
DeckyLifecycle row for (decky_name, operation, host_uuid) and flips
it to terminal with the worker's error text + timestamp. Stale
duplicates (row already sealed or never existed) are logged and
dropped -- not errors.
Each successful pivot also emits decky.<name>.lifecycle on the bus
so the dashboard sees the transition without waiting for its next
poll tick.
This is the master-side completion channel for the worker's 202
fire-and-forget /deploy and /mutate.
The wizard API used to hang because /deckies/deploy ran docker compose
build && up -d synchronously, holding the request thread for minutes.
The worker side of that pipeline now returns 202 Accepted immediately
and runs the deploy in an asyncio.create_task.
On task completion (success or failure) the worker pushes a one-off
heartbeat carrying a lifecycle delta per decky:
{decky_name, operation, status: succeeded|failed, error?, completed_at}
Master pivots these onto open DeckyLifecycle rows in the heartbeat
handler (next commit). The scheduled 30s heartbeat tick is the
fallback if the immediate push drops.
- decnet/agent/app.py: /deploy and /mutate return 202; dry_run mutate
still validates synchronously and returns 200.
- decnet/agent/executor.py: deploy_async + mutate_async wrap the work
and push the completion delta.
- decnet/agent/heartbeat.py: push_lifecycle_delta() helper builds a
one-off body and POSTs with the same mTLS context as the loop.
- decnet/swarm/client.py: revert deploy/mutate to control timeout
(master no longer holds the HTTP request open for compose work).
Worker state.json gains no lifecycle field -- master DeckyLifecycle is
the source of truth; the master sweep handles crashed-mid-deploy
recovery.
Add decnet.lifecycle package: pure orchestration layer that the
master API will invoke via asyncio.create_task to drive DeckyLifecycle
rows through pending -> running -> succeeded | failed without
holding an HTTP request open.
Strategy classes per (operation, transport):
- LocalDeployStrategy: master-resident, runs engine.deployer.deploy
in a thread.
- SwarmDeployStrategy: shards by host_uuid, dispatches via
AgentClient.deploy; worker drives terminal via heartbeat.
- LocalMutateStrategy: write_compose + compose up.
- SwarmMutateStrategy: AgentClient.mutate; worker drives terminal.
decnet.bus.topics gains decky_lifecycle(name) -> decky.<name>.lifecycle
plus DECKY_LIFECYCLE constant. Payload documented in the wiki
(separate commit). publish_safely keeps bus best-effort.
Nothing is wired to call this yet -- next commits convert worker
/deploy /mutate to 202, then heartbeat delta wiring, then master API.
One row per (decky, operation) attempt. State machine:
pending -> running -> succeeded | failed (+ error text). Rows are
append-only after terminal; retries write a new row.
Sibling of DeckyShard rather than a rework -- DeckyShard tracks
runtime container state observed via heartbeat, this tracks
operation lifecycle. New table, UUID PK.
Adds BaseRepository abstract methods (create_lifecycle,
update_lifecycle, get_lifecycle_by_ids, find_open_lifecycle,
sweep_stale_lifecycle) with SQLModelRepository mixin impl.
Backbone for the upcoming 202-Accepted async API.
- Implement /mutate handler: load_state, update services + last_mutated,
save_state, write_compose, compose up -d via asyncio.to_thread. 404
for missing state / unknown decky_id. dry_run short-circuits before
any side effect.
- Add AgentClient.mutate(decky_id, services, *, dry_run=False) using
_TIMEOUT_DEPLOY (compose up can pull/build, exceeds control timeout).
- mutator/engine.py: in swarm mode with decky.host_uuid set, resolve
worker via _resolve_swarm_host and dispatch through AgentClient.mutate
instead of writing a compose file on master. Master-resident deckies
(unihost mode, or swarm with host_uuid=None) keep the local path.
Adds state_path ServiceConfigField and passes DNS_STATE_PATH into the
container environment. Operator must mount the parent directory on a
volume for persistence to survive container recreation.
Switch burst deque from monotonic() to time.time() (wall-clock, serializable).
Add DNS_STATE_PATH env var: on startup _load_state() reads {src:[ts,...]} JSON
and prunes entries older than the burst window. _flush_state() write-then-renames
atomically; _state_flusher() coroutine flushes every 5s when dirty. Detection of
the 5th event also triggers an immediate flush. No-op when DNS_STATE_PATH is
unset, so the default deployment is unchanged.
Rename _txt_times -> _tunnel_times. Add TYPE_CNAME=5, TYPE_NULL=10,
TYPE_PRIVATE=65399 constants. Guard burst counter with _TUNNEL_QTYPES
frozenset instead of TYPE_TXT only. Mixed-type queries from one source
now share a single burst window, closing iodine NULL/CNAME downlink
and AAAA-encoded uplink evasion gaps.
_is_tunneling now returns str|None (the detection method) instead of bool.
Two new tunables _QNAME_TOTAL_LEN_THRESHOLD=50 and _QNAME_ENTROPY_THRESHOLD=3.5
catch attackers who split a high-entropy payload across multiple short labels.
tunnel_method field added to tunneling_suspect events for downstream correlation.
_parse_edns_size only extracted the requestor UDP size; every other field in
the OPT record (DO bit, EDNS version, extended RCODE, all sub-options) was
invisible. Replaced with _parse_opt_record returning a full dict:
udp_size, ext_rcode, version, do_bit, z, options[(code, len, data)]
NSID request (option code 3) is now detected as fingerprint_probe with
probe=edns_nsid and contributes to recon_burst. DO bit, COOKIE (10), and
other options are not escalated; udp_size continues to drive amp_probe.
Tools like fpdns send OPCODE=IQUERY/STATUS/NOTIFY/UPDATE or set the reserved
Z bit to fingerprint resolver behaviour. Previously all these were parsed as
standard queries with no signal.
- opcode!=0 → fingerprint_probe probe=opcode_<name>, NOTIMP response;
fired before qdcount check so qdcount=0 UPDATE packets are still caught.
- Z bit set OR (AD+CD without RD) → fingerprint_probe probe=header_flags;
AD alone with RD is ignored to avoid tagging DNSSEC-aware stubs.
- Both variants contribute to recon_burst.
qclass=255 in a standard query is unusual enough to be a fingerprinting probe
(fpdns, various scanner scripts). Previously it was logged as a plain query
with qclass=ANY in the event field; now it emits fingerprint_probe with
probe=qclass_any and returns REFUSED — consistent with how we treat other
probe types. Contributes to recon_burst.
The inline probe_map dict inside _handle made tests blind to the probe
catalogue and couldn't be extended without touching the hot path. It is now
module-level _CHAOS_PROBE_MAP. authors.bind. joins the three existing entries
so it gets named correctly instead of carrying the raw qname.
Packets with multiple questions were silently parsed at q0 only; the extra
questions were invisible. Now emits multi_question at severity=5 with the
qdcount and q0 qname, then falls through and answers q0 normally.
Silent drops on <12B packets, qdcount=0, and question-section ValueError gave
fuzzers and scanners a completely dark target. New events malformed_packet,
empty_question_section, and question_parse_error fire at severity=5 so these
probes are visible without counting toward recon_burst.
Adds DNS_FORWARD_BUDGET (default 50) and DNS_FORWARD_WINDOW (default 1.0s)
env vars. _can_forward() maintains a rolling deque of upstream call
timestamps; queries that exceed the budget within the window are answered
with the sinkhole (127.x) instead of being forwarded, making the honeypot
ineligible as a sustained amp vector even when real_recursive is enabled.
Rate limit is global (not per-source) so IP-spoofed amplification floods
hit the ceiling regardless of how many source addresses are rotated.
When DNS_REAL_RECURSIVE=true and DNS_ZONE_MODE=recursive, out-of-zone
queries are forwarded to DNS_UPSTREAM (default 8.8.8.8:53) via async
UDP. Upstream response is relayed as-is; on timeout or error the
already-computed sinkhole (127.x) is returned instead.
_handle() always runs first so logging, tunneling detection, flood
tracking, and recon-burst aggregation fire on every query regardless
of whether the response ultimately comes from upstream. _dispatch()
overlays forwarding on top of the sync handler.
Protocol handlers (UDP datagram_received, TCP session) are now async
via asyncio.ensure_future / await _dispatch(). Service class exposes
real_recursive (bool) and upstream (string) config fields.
RA=1 + empty answer section is immediately detectable as fake by any
open-resolver scanner. Recursive mode now behaves like open mode
(127.0.0.x sinkhole, deterministic on qname) with RA=1 and AA=0,
matching what a real recursive resolver returns.
- Add per-src QPS counter (_qps_window) with flood_suspect event at ≥50 qps/10s;
one event per src per 30s cooldown, does not suppress baseline query events.
- Add tracking_evicted telemetry every 100 LRU evictions so IP-rotation evasion
of _txt_times/_qps_window/_recon_window is observable, not silent.
- Shared _track_lru helper consolidates LRU touch + eviction signalling across
all three bounded OrderedDicts.
- Add TYPE_AAAA=28 support: _fake_ipv6() returns deterministic ULA (fd::/8)
addresses for in-zone names; extra_records parser now accepts and validates
AAAA entries via socket.inet_pton.
- Add per-src recon-burst aggregation (_recon_window): fingerprint_probe +
zone_transfer + amp_probe are tracked per source in a 60s window; recon_burst
fires when ≥2 distinct signal types seen, once per src per 120s cooldown.
- 47 tests passing (19 new across TestAAAARecords, TestFloodDetection, TestReconBurst).
Python asyncio DNS server on UDP+TCP/53 masquerading as BIND 9.x.
Emits four event_type values: query, fingerprint_probe (version.bind /
hostname.bind / id.server CHAOS), zone_transfer (AXFR/IXFR, always
REFUSED), amp_probe (qtype=ANY or EDNS udp_size>1232), and
tunneling_suspect (long high-entropy labels or rapid TXT burst).
Zone persona is generated per-decky from instance_seed (domain name,
SOA serial, NS, A, MX, TXT SPF); overridable via config_schema.
Three zone modes: auth (default), recursive, open (sinkhole).
AttackerData type gets bgp_prefix / rpki_status / rpki_source.
TimelineSection renders prefix inline next to AS number; RPKI status
shows as a green RPKI VALID / red RPKI INVALID badge, or dim
NO ROA for not-found. rpki-status-badge CSS added to Dashboard.css.
Export network block extended with the three new fields.
Import enrich_rpki from decnet.rpki and call it inline after the
ASN lookup. bgp_prefix, rpki_status, rpki_source added to the
record dict that feeds the Attacker upsert. enrich_rpki short-circuits
to (None, None) when asn is None, so private / unannounced IPs
never hit RIPE STAT.
bgp_prefix (max 43 chars, indexed) holds the covering CIDR from
the ASN lookup. rpki_status / rpki_source hold RIPE STAT validation
outcome. All nullable — null means enrichment was skipped or ASN
did not resolve.
RipeStatValidator makes two RIPE STAT calls per uncached IP:
network-info -> announced prefix, rpki-validation -> ROA state.
2-second timeout; any network failure returns status='unknown'.
SQLite cache keyed by IP, 12-hour TTL, pruned on validator init.
Cache avoids per-event HTTP for the high-churn attacker pool —
steady-state cost approaches zero for repeat offenders.
Synthesize the covering CIDR at lookup time from the matched iptoasn
range using ipaddress.summarize_address_range. AsnInfo.prefix is
populated per-query; not persisted in the pickle cache.
enrich_ip now returns (asn, as_name, bgp_prefix, provider_name).
Profiler worker updated to unpack the 4-tuple and write bgp_prefix
into the attacker record dict.
Four RFC 4443 stimuli (port-unreach, hop-limit-exceeded, unknown-NH,
bad-dest-option) produce a 4-char matrix + sha256 fingerprint for IPv6
attackers. Auto-registers via ActiveProbeMeta at priority=860 (after v4
icmp_error=850, before ipv6_leak=999). IPv4 targets fast-return None.
Sends four crafted stimuli (UDP/closed-port, TTL=1, DF+oversized,
bad IP option) and records which ICMP error classes come back, the
per-error RTT, and the bytes echoed in each ICMP body. Absence is
as informative as a reply — Linux rate-limiting is a fingerprint signal.
Returns None when no packets could be sent (no CAP_NET_RAW), so the
probe is a no-op in non-root test environments. Port-free ActiveProbe
subclass (priority=850), metaclass auto-registered in the registry.
Also fixes three sets of stale tests left over from the TlsCertProbe
migration (4b2759e0):
- test_active_probe_registry: closed name/order sets updated for
tls_certificate and icmp_error
- test_prober_rotation: dead patches on worker.fetch_leaf_cert removed
- test_prober_worker (TestProbeCycleTLSCert): rewritten to test
TlsCertProbe as an independent registry probe, patch target updated
from worker.fetch_leaf_cert to probes.tlscert_probe.fetch_leaf_cert
TLS cert capture was the last prober special-case that bypassed
ActiveProbeMeta. Moves logic into TlsCertProbe (priority=200, runs
after JARM) in probes/tlscert_probe.py; drops _capture_tls_cert,
the probe.probe_name=="jarm" name-check, and the direct
fetch_leaf_cert import from worker.py.
ActiveProbe.run/syslog_fields/publish_payload now accept port=None so
non-port-iterating probes can live in the registry. Ipv6LeakProbe replaces
the hand-rolled _ipv6_leak_phase special case in worker.py; it runs last
via priority=999. _probe_cycle no longer has an ad-hoc phase call.
Fixes three stale test files (test_prober_bus, test_prober_rotation,
test_prober_worker) that were broken since the 916b21b6 registry refactor.
_route_info() calls _ip_route_get once and returns (on_link, iface);
worker._ipv6_leak_phase now calls it instead of the two separate helpers.
Bare except clauses at _ip_route_get and response parse now log at debug.
Iterates every template with a Dockerfile, builds decnet/<svc>:latest
with DOCKER_BUILDKIT=1. Supports NO_CACHE=1 and FAIL_FAST=0 flags,
mirrors the style of test-all. Updated help target.
FingerprintGroup switch fell through to FpGeneric (raw JSON dump) for all
four new fingerprint_type values the ingester now produces. Add FpJa4h,
FpHttpSettings, FpJa4Quic components and wire them into the dispatcher;
also register their labels and icons in fpTypeLabel/fpTypeIcon.
ingester: wrap bootstrap get_state() in forever-retry loop — MySQL coming
up after the API process killed the ingestion task permanently before it
ever entered _run_loop. Regression test added.
deps: idna 3.13→3.15 (CVE-2026-45409), twisted 26.4.0rc2→26.4.0
(PYSEC-2026-160), pip 26.1→26.1.1 (CVE-2026-3219 resolved upstream),
behave-core/behave-shell renamed from decnet-behave-* and bumped to 0.1.1.
pre-commit hook updated to reflect current ignore list.
Replace _jarm_phase / _hassh_phase / _tcpfp_phase boilerplate (3×~50
lines of identical port-iteration logic) with a metaclass-registered ABC.
Adding a new port-iterating active probe is now one class + three methods.
- decnet/prober/base.py: ActiveProbeMeta auto-registers subclasses by
probe_name; ActiveProbe ABC enforces run/syslog_fields/publish_payload
with env-driven DECNET_PROBE_PORTS_<NAME> port override.
- decnet/prober/probes/{jarm,hassh,tcpfp}.py: concrete probe classes.
- decnet/prober/worker.py: single _run_probe driver replaces the three
phase functions; _probe_cycle iterates ActiveProbeMeta.all(); drops
the ports=/ssh_ports=/tcpfp_ports= kwargs from prober_worker.
- IPv6 leak and TLS cert capture stay as special cases (different call
shapes; intentionally outside the registry).
- tests/prober/test_active_probe_registry.py: registry contents, sort
order, priority-10 override, ABC contract per probe class.
- tests/prober/test_run_probe_driver.py: dedup, success, None-skip,
exception, rotation, publish paths for _run_probe.
- tests/prober/test_prober_worker.py: updated patch targets and
_probe_cycle call sites; port control via monkeypatch.setattr.
- Add "ipv6_leak" to KNOWN_SOURCE_KINDS in ttp/base.py
- Register Ipv6LeakLifter(store) in factory.py get_tagger()
- Subscribe worker to attacker.fingerprinted; route by Event.type
so JARM/HASSH/ipv6_leak share the topic without source_kind collision
- Add bump_attacker_ipv6_leak() to BaseRepository (abstract) +
TTPMixin (implementation): increments ipv6_leak_count, sets last_ipv6_*
denorm fields, appends-with-dedup to AttackerIdentity.ipv6_link_local_iids
- Call bump_attacker_ipv6_leak from _process_event after insert_tags
- Add DummyRepo stub + coverage call in tests/db/test_base_repo.py
Add inline documentation for all known kind= discriminators on the
fingerprinted topic including the new ipv6_leak variant so future
consumers know what fields to expect without reading the prober source.
Ipv6LeakLifter subscribes to source_kind="ipv6_leak" events from both
the passive sniffer and active prober. Emits T1090 (Proxy) under TA0011
(C2) when fe80:: source address is observed — the attacker's VPN only
tunnels IPv4 so their link-local IID leaks their NIC identity.
Rule R0059 sets base confidence 0.85; iid_kind in the evidence carries
the per-observation strength (eui64 = MAC-derived, deterministic;
stable_privacy = RFC 7217; temporary = RFC 4941).