Commit Graph

78 Commits

Author SHA1 Message Date
6c4ea706f8 feat(api): canary token CRUD router (/api/v1/canary) + tests
Two sub-routers under /api/v1/canary:

blobs (operator-uploaded artifacts, deduped by sha256):
- POST   /blobs          (multipart upload; admin)
- GET    /blobs          (list with token_count; admin)
- DELETE /blobs/{uuid}   (refcount-aware; 409 when referenced; admin)

tokens (per-decky planted artifacts):
- POST   /tokens                          (generate or instrument + plant; admin)
- GET    /tokens?decky_name=&kind=&state= (filter; viewer)
- GET    /tokens/{uuid}                   (detail; viewer)
- GET    /tokens/{uuid}/preview           (instrumented bytes; admin)
- GET    /tokens/{uuid}/triggers          (paged callback log; viewer)
- DELETE /tokens/{uuid}                   (revoke + bus event; admin)

XOR validation: exactly one of blob_uuid / generator must be set.
Path validation rejects relative/NUL/newlines/.. segments. Every
body-bearing route documents 400 plus 401/403/404 as applicable.

Stdlib MIME sniffer (no python-magic dep) covers PNG/JPEG/GIF/PDF/
HTML/XML/DOCX/XLSX/JSON/YAML/TOML/text/plain; everything else falls
through to passthrough.

Tests run end-to-end through the live FastAPI app (planter docker
exec is patched); 17 cases covering dedup, refcount, lifecycle,
XOR validation, path validation, and 404 paths.
2026-04-27 13:18:00 -04:00
f046634d6e feat(web): Persona Generation page under AUTOMATION
New dashboard surface for editing the global emailgen persona pool —
the JSON file fleet (MACVLAN/IPVLAN) and SWARM-shard mail deckies pull
from.  MazeNET topology personas are out of scope here; they're
configured per-topology in the topology editor.

Backend:
* GET/PUT /api/v1/emailgen/personas — admin-write, viewer-read.  PUT
  validates with the same Pydantic schema the worker uses
  (parse_personas), drops invalid entries with a warning, returns 400
  only when the entire payload fails.  Path is operator-discoverable
  on every response so a CLI-driven backup workflow stays visible.

Frontend:
* PersonaGeneration.tsx + .css — table + add/edit modal with the full
  EmailPersona schema (name, email, role, tone, mannerisms list,
  language, signature, active hours, reply latency, uses_llms_heavily).
  Local edits are batched; explicit "SAVE CHANGES" writes back, with a
  dirty-indicator pill and a "DISCARD" reset.  Email uniqueness is
  enforced client-side so the scheduler never picks the same persona
  as both sender + recipient.
* Sidebar AUTOMATION group gains a "Persona Generation" entry next to
  Orchestrator; route registered at /persona-generation.

The worker reads the same on-disk file the API writes — see
decnet.orchestrator.emailgen.global_pool.  The API resets the
in-process cache on every read/write so the worker picks up dashboard
edits within its next tick rather than waiting on mtime.
2026-04-27 09:55:42 -04:00
818aebadfc feat(web): emailgen events in Orchestrator page
The SSE pipe at /orchestrator/events/stream was already streaming
'orchestrator.email.{decky_uuid}' events (the subscription is for the
'orchestrator.>' wildcard), but the consumer side dropped them on the
floor.  Three fixes to close the loop:

* useOrchestratorStream.ts now registers an 'email' SSE listener — the
  EventSource silently ignores frames whose event name has no listener,
  so missing this entry meant every email frame was dropped before
  reaching the page's onEvent handler.

* /api/v1/orchestrator/events accepts kind=email and dispatches to
  list_orchestrator_emails, adapting rows to the existing wire shape:
  subject -> action, sender_email -> src_decky_uuid, recipient_email
  -> dst_decky_uuid, plus email-specific extras (thread_id, language,
  mail_decky_uuid, message_id, in_reply_to) ride along as top-level
  keys.

* Orchestrator.tsx gains an 'email' tab in the kind filter and a
  branch in the row renderer / inspector that:
   - shows full sender / recipient (no UUID truncation),
   - chips the language code next to the subject,
   - relabels ACTION as SUBJECT in the inspector and surfaces
     thread / in-reply-to / mail-decky details.

The 'all' tab continues to show traffic+file only (today's behavior);
operators see emails by switching to the email tab.  A union view at
the API layer is the obvious follow-up but not necessary for now.
2026-04-26 22:56:48 -04:00
5b5ff54fa2 feat(web): orchestrator events read API + SSE stream
GET /api/v1/orchestrator/events — paginated list with optional
kind=traffic|file filter. GET /api/v1/orchestrator/events/stream —
SSE: snapshot on connect, live forward of orchestrator.> bus events
mapped to 'traffic' / 'file' SSE event names.

Repo gains list_orchestrator_events(limit, offset, kind?, since_ts?),
count_orchestrator_events(kind?), and prune_orchestrator_events
(per_dst_cap=10000) for periodic worker-side trimming.
2026-04-26 19:58:12 -04:00
d531cea536 feat(web): read-only campaigns API + SSE + frontend
API: /api/v1/campaigns (paginated list), /api/v1/campaigns/{uuid}
(soft-merge chain follow), /api/v1/campaigns/{uuid}/identities
(member identities), and /api/v1/campaigns/events (SSE under
campaign.> + JWT-via-?token=, snapshot-on-connect). Mirror of the
identity router; same auth, same shape, same OpenAPI tags pattern.

Frontend: CampaignDetail.tsx page (same visual vocabulary as
IdentityDetail), useCampaignStream hook (mirror of
useIdentityStream), /campaigns/:id route, IdentityDetail's
CAMPAIGN badge becomes clickable and navigates to the campaign.
useIdentityStream now listens for identity.campaign.assigned so
the badge appears live without a manual refresh.
2026-04-26 09:20:17 -04:00
97aa57faed feat(api): SSE stream for identity events at /api/v1/identities/events
Mirrors GET /api/v1/topologies/{id}/events: subscribes to identity.>
on the bus for the duration of the request and forwards each event as
a named SSE frame (formed / observation.linked / merged / unmerged).

The endpoint is broadly scoped (every identity event, not per-uuid)
because both AttackerDetail and IdentityDetail need the same
firehose: AttackerDetail watches for an identity.formed that finally
binds its identity_id; IdentityDetail watches for
observation.linked / merged / unmerged against its current row. A
per-uuid filter would force the client to know its identity before
subscribing, which it doesn't always.

JWT via ?token= (EventSource can't set headers), require_stream_viewer
gate, sse_connection_slot per-user cap, snapshot-on-connect with
the first 50 identities so the client buffer renders without a
separate REST call.

Bus-disabled / unreachable path keeps the connection alive on
keepalives so the client doesn't reconnect-storm; it can re-poll
the REST API on its own timer.
2026-04-26 08:36:17 -04:00
181c792753 feat(api): GET /credential-reuse list + detail endpoints
Read-only routes for the credential-reuse findings produced by the
correlator. Mirrors the /credentials route shape: JWT-gated via
require_viewer, paginated with optional secret_kind /
min_target_count filters, and a 404-on-missing detail route.

No POST/PUT/PATCH (and no body parsing) so no 400 contract is
documented.
2026-04-26 03:40:08 -04:00
4566146d50 feat(api): GET /credentials endpoint
Surfaces the Credential table (deduped attacker auth attempts) via
a new /api/v1/credentials route. Mirrors the Bounty cache pattern
(5s TTL on the unfiltered default page) and reuses the existing
get_credentials / get_total_credentials repo methods + the already
defined CredentialsResponse DTO. Filters: search, service, attacker_ip.
2026-04-25 07:51:20 -04:00
ee176a6f79 Revert "feat(mazenet): per-LAN swarm host pin"
This reverts commit 0d92170a57.
2026-04-25 03:26:19 -04:00
0d92170a57 feat(mazenet): per-LAN swarm host pin
Adds nullable LAN.host_uuid (FK swarm_hosts.uuid). Resolution order
when deploying a LAN: lan.host_uuid → topology.target_host_uuid →
master. A LAN is one Docker bridge so the bridge cannot span hosts;
this pin forces every decky in the LAN onto the named host.

LANCreateRequest / LANUpdateRequest accept host_uuid; both validate
that the host exists, returning 400 on unknown UUIDs. PATCH still
gated by the existing pending-only guard, so reassignment of a live
LAN is not yet possible (deferred to mutator support).

LANRow surfaces the field so the frontend can render per-host badges.
2026-04-25 03:04:23 -04:00
c214cdd7bb fix(api/topology): map duplicate-name IntegrityError to 409
POST /topologies raised a 500 with a raw SQLAlchemy IntegrityError
traceback when the name collided with an existing topology. Catch
the error at the router, verify it's the ix_topologies_name
constraint (so unrelated integrity failures still surface as 500s
with their real traceback), and return 409 with a helpful detail.

Test covers the create-then-duplicate-create flow.
2026-04-24 19:06:37 -04:00
2bcef50ac5 feat(webhooks): circuit breaker auto-disables misbehaving subscriptions
After DECNET_WEBHOOK_CIRCUIT_THRESHOLD (default 5) consecutive failed
deliveries, the worker calls trip_webhook_circuit(uuid, ts) which
flips enabled=False and stamps auto_disabled_at. The worker sets its
reload flag so the next dispatch epoch stops consuming events for the
tripped sub entirely — one dead receiver can't poison the shared
egress pool anymore.

Operator clears the trip via PATCH — setting enabled=True when the
sub was previously disabled clears auto_disabled_at, zeros
consecutive_failures, and clears last_error. Admin-pause → re-enable
hits the same path harmlessly.

Three observable states now distinguishable in the UI:
- Active              enabled=True,  auto_disabled_at=NULL
- Admin-paused        enabled=False, auto_disabled_at=NULL
- Tripped             enabled=False, auto_disabled_at=<ts>

UI surfaces a TRIPPED · <ts> chip on the row (red, alert-styled) and
a "N TRIPPED" count in the page header. Hover tooltip tells the
operator how to reset ("Re-enable via Edit").

record_webhook_failure now returns the new consecutive_failures count
so the worker can compare against the threshold without a second
roundtrip. trip_webhook_circuit is idempotent — re-tripping just
re-stamps auto_disabled_at.

Closes THREAT_MODEL WH-02 and DEBT-037 §1.
2026-04-24 16:24:33 -04:00
638236113d feat(webhooks): non-blocking http:// warning + WH-03 accepted risk
WebhookResponse now carries a `warnings: list[str]` field. When the
subscription's URL starts with http://, an `insecure_url` advisory is
surfaced on every GET/CREATE without blocking the request. HMAC still
detects tampering regardless of transport — only read-confidentiality
is lost over plaintext — and test/dev environments without TLS stay
usable.

Matches the operator-trust posture already established by DA-06
(admin-on-admin protection is out of scope). The alternative — hard
rejection at admin time — was considered and declined; warning-plus-
visibility is the right shape.

THREAT_MODEL WH-03 accepted risk registered; revisit triggers are
multi-admin delegation, a regulated customer, or an operator ticket
asking for a DECNET_WEBHOOK_REQUIRE_HTTPS enforcement knob.
2026-04-24 15:53:30 -04:00
b70845a85d feat(webhooks): subscription CRUD + HMAC-signed delivery client
Introduces the webhook egress foundation — a new WebhookSubscription
table, admin-gated CRUD under /api/v1/webhooks, and the shared
delivery client that both the test-ping route and the upcoming worker
will use. No worker yet; this commit is API + model + client only.

Simple-mode enum (AttackerDetail / DeckyStatus / SystemStatus) expands
to bus-topic patterns at the router layer; storage is always the raw
pattern list. Advanced mode lets admins supply raw NATS-style patterns
directly. Filter-at-subscribe: the worker (next commit) will subscribe
to the union of patterns across enabled subscriptions.

Delivery client handles HMAC-SHA256 signing (X-DECNET-Signature),
retry on 429/5xx/network errors with jittered backoff, no-retry on
4xx. Secrets never leave the server on GET/LIST — only the create
response carries the secret for copy-out.

CRUD routes publish WEBHOOK_SUBSCRIPTIONS_CHANGED on the bus after
every mutation so the (future) worker can hot-reload.

Opens DEBT-037 for the deferred items (circuit breaker, dead-letter,
batch delivery, payload templates, secret-at-rest).
2026-04-24 15:30:05 -04:00
162f7c1194 feat(api/sse): per-user connection cap + viewer-safe invariant
New decnet/web/sse_limits.py provides sse_connection_slot, an async
context manager that counts live SSE connections per user UUID and
raises 429 when a per-user cap is exceeded (default 5, override via
DECNET_SSE_MAX_PER_USER). Wired into both SSE generators as their
first async with, so the cap check fires before any stream data is
yielded.

The cap must sit inside the generator — StreamingResponse returns
before the generator body runs, so a handler-level wrapper would
release the slot immediately. Put prefetch + slot + loop all under
the one async with.

Also documents F6/I (role leakage) as mitigated-by-construction via
handler docstrings: every event type on both streams wraps data
already reachable via viewer-gated REST, so no per-event filter is
needed until a new event family is introduced. The invariant is
written into the handler docstrings so a future PR can't silently
add admin-only events.

Resolves THREAT_MODEL F6/I and F6/D.
2026-04-24 15:01:20 -04:00
e53b580767 test(api): RBAC contract test — viewer JWT on every classified route
New test walks app.routes, classifies each APIRoute as admin/viewer/open
by identity-matching require_admin / require_viewer closures inside the
route's dependency tree, then asserts:
  - admin routes return 403 to a viewer JWT
  - viewer routes return neither 401 nor 403 to a viewer JWT
SSE routes skipped (separate scope under F6). Role hints deliberately
NOT encoded in the OpenAPI spec — classification stays server-side so
/openapi.json can't be used to enumerate admin routes.

Resolves THREAT_MODEL F2/I + F5/E; paired with the existing
test_schemathesis.py::test_auth_enforcement (401-half coverage).
2026-04-24 14:00:12 -04:00
99ccd41bb5 feat(api/artifacts): explicit Content-Disposition + X-Content-Type-Options
Harden the attacker-controlled artifact download path (F7) with explicit
response headers instead of relying on Starlette's defaults (which only
emit attachment for non-ASCII filenames and never set nosniff). Also
resolves the THREAT_MODEL F7 path-traversal row (containment check was
already in _resolve_artifact_path) and the fleet-deploy detail=str(e)
audit (all four sites are admin-gated deliberate validator UX or
structured worker-response fields).
2026-04-24 13:24:34 -04:00
323077b383 fix(web/transcripts): fall back to shard-scan when Log row has no shard_path
sessrec.c emits the session_recorded SD blob with sid/service/src_ip/
duration_s/bytes/truncated — it never emitted shard_path. The web
handler still asked for fields.shard_path, got "", tripped the
sessions-YYYY-MM-DD.jsonl basename regex and returned
400 "invalid shard name" for every legitimate transcript request.

Handler now:
- Fast-paths when fields.shard_path IS present and validates
  (for any future emitter or ingester that backfills it).
- Otherwise enumerates sessions-YYYY-MM-DD.jsonl shards under
  ARTIFACTS_ROOT/{decky}/{service}/transcripts/ (newest first) and
  returns the first one whose per-sid index contains our sid.
- Security invariant preserved: only files whose basename matches the
  _SHARD_BASENAME_RE are ever opened, and they always resolve inside
  ARTIFACTS_ROOT. A forged fields.shard_path is silently ignored.
- Soft-fails OSError/PermissionError on the transcripts dir (decky
  containers often write it with a uid the API can't read) — returns
  404 instead of a 500 traceback.

test_forged_shard_path_blocked updated to match the new semantics:
forgery is ignored, the real shard is served via fallback. The
invariant (no /etc/passwd access) is still asserted by the fact
that status is 200 with data from the test shard.
2026-04-24 01:18:40 -04:00
ef4179ea1f feat(api): opaque 500 handler + error_id correlation for unhandled exceptions
Registers a generic @app.exception_handler(Exception) that catches anything
uncaught in route handlers / dependencies. Prod response is opaque:
{detail: 'Internal Server Error', error_id: <uuid4 hex>}. Dev mode
(DECNET_DEVELOPER=True) adds exception_type and traceback fields so
failures are debuggable without tailing server logs.

The error_id is logged alongside the full traceback server-side, letting
operators correlate a user's 500 report with the exact exception via
`grep <error_id> /var/log/decnet.log`.

FastAPI's own HTTPException routing and the existing
RequestValidationError / ValidationError / RateLimitExceeded handlers
still take precedence — this handler only fires on genuinely-uncaught
exceptions.

Flips threat model F1/I 'traceback / stack trace leakage' from ? to M
and logs a follow-up checklist entry for 4 detail=str(e) sites in the
fleet deploy router (admin-gated, different threat class, separate
audit).
2026-04-23 14:07:32 -04:00
2f4f81e5de feat(api): rate-limit /auth/login + scaffold threat model
Adds slowapi two-bucket rate limit on /auth/login — 10 attempts per
5 minutes per-IP AND per-username, tripping either → 429. Per-IP
catches botnets hitting one account; per-username catches distributed
credential stuffing against one account. In-memory storage: dashboard
API is single-process, Redis is disproportionate for v1.

X-Forwarded-For is deliberately NOT trusted (spoofable); reverse-proxy
deployments get one shared bucket per proxy IP. Logged in the threat
model as accepted risk DA-08, to be revisited when a verified-proxy
config lands.

Also scaffolds development/THREAT_MODEL.md with STRIDE-per-element
methodology, system-context DFD, and Dashboard↔API as the first fully
worked component (7 sub-flows, ~50 threat entries). F1 Authn ships
with 3 threats mitigated: rate limit (new), uniform 401 (verified
already in place), bcrypt length clamp (verified already in place via
Pydantic max_length=72).
2026-04-23 13:25:28 -04:00
c50448995b feat(smtp): capture full messages + attachments to disk
SMTP template now writes each accepted DATA body as a .eml file into a
bind-mounted per-decky quarantine dir and emits a `message_stored` log
with sha256, size, decoded headers, and an attachment manifest
(filename + sha256 + size + content-type). Attachment hashing uses the
*decoded* payload so operators can match against VT / MalwareBazaar
directly. Body accumulator is capped at SMTP_MAX_BODY_BYTES (default
10 MB, matching the EHLO SIZE advert) so a streaming client can't OOM
the container.

The existing /api/v1/artifacts/{decky}/{stored_as} endpoint now takes
an optional ?service= query param (defaults to ssh for back-compat)
and can serve .eml files out of the smtp subdir. Forensic metadata
rides the normal log pipeline, same as SSH file_captured.
2026-04-22 22:17:50 -04:00
13ea916943 feat(workers): add start + start-all endpoints (systemd supervisor)
POST /api/v1/workers/{name}/start — 202 on acceptance, 404 unknown
worker, 503 if the unit file is not installed, 502 if systemctl
returns non-zero (stderr snippet in detail, full stack logged).
Admin only.

POST /api/v1/workers/start-all — best-effort: walks the worker list
in dependency order (bus → api → data-plane), skips already-active
and uninstalled units, aggregates outcomes into
{started, already_running, failed[]}. Returns 200 even on partial
failure; the caller reads the three lists.

Both endpoints delegate to the systemd_control helper, so the attack
surface for "what gets executed" is locked to `decnet-<validated-name>
.service` at two layers (router KNOWN_WORKERS + helper regex).
2026-04-22 14:12:29 -04:00
0fbb07c2ec feat(workers): bus-backed Workers panel (registry, control, installed flag)
Ships the backend half of Config → Workers:

* Worker registry aggregates `system.*.health` + `system.bus.health`
  heartbeats into a last-seen dict; OK / STALE / UNKNOWN tiers drop
  out of a 90s window (3× the 30s heartbeat interval).
* `GET /api/v1/workers` returns the snapshot plus `bus_connected`
  (so the UI can explain "all UNKNOWN" when the bus socket is down)
  and a per-row `installed` flag populated from
  `systemctl list-unit-files decnet-*.service` (cached 30s).
* `POST /api/v1/workers/{name}/stop` publishes a stop intent on
  `system.<name>.control`; workers listen via the shared control
  listener in `bus/publish.py`.
* Heartbeat + control listener wired into collector / profiler /
  sniffer / prober / mutator worker loops. API self-heartbeats too
  so the panel always has one ground-truth row.
* Topic helper `system_control(name)` + tests covering builder
  validation, control listener shutdown path, and the API surface
  (auth gating, bus-connected field, unknown-name 404).

Adds `StartFailure` / `StartAllResponse` models in anticipation of
the upcoming start endpoints (DEBT-034).
2026-04-22 14:10:39 -04:00
6725197d58 test(web): transcripts API + attacker-transcripts router coverage
Paging, truncation surfacing, admin gate, path traversal, sid-regex and
decky-mismatch rejection for /transcripts; mirror coverage for
/attackers/{uuid}/transcripts. Flips the Session Recording box in the
roadmap (sessrec pty relay now shipping end-to-end).
2026-04-21 23:11:40 -04:00
1968f6e741 test(mutator,web): cover bus publishes, bus-wake, and SSE events route
- tests/topology/test_mutator.py: reconcile_topologies publishes
  applying+applied on success, applying+failed+status on failure; and
  stays safe when bus=None. _wake_on_enqueue sets its asyncio.Event
  on every matching enqueue event.
- tests/api/topology/test_mutations.py: POST /mutations publishes
  mutation.enqueued after a successful DB write, via a FakeBus
  injected in place of the app-wide bus singleton.
- tests/api/topology/test_events_stream.py: SSE route returns 401
  unauthenticated, 404 for unknown topologies, and (driving the
  async generator directly) emits a snapshot on connect plus
  forwards a published mutation.applied as an `event: mutation.applied`
  SSE frame.
2026-04-21 14:39:12 -04:00
12e18b75db feat(swarm): expose needs_resync on TopologySummary + upsert record_error
Two small observability follow-ups to the phase-1 agent/topology wiring:

TopologySummary now carries needs_resync so operators can see the
heartbeat's resync flag via the topology list/detail API without
dropping into the DB.

TopologyStore.record_error becomes an upsert — when a docker/compose
failure fires during the first materialise (put() never reached), we
still land a marker row so GET /topology/state surfaces the error and
the next heartbeat carries an empty applied_version_hash. That empty
hash is what master's heartbeat check relies on to flag the topology
for resync instead of assuming the apply succeeded.
2026-04-21 01:41:30 -04:00
5a0cf5d7c8 feat(topology): add target_host_uuid to pin topologies to swarm agents
Adds the `target_host_uuid` FK on `Topology` plus wiring through the
two create endpoints (`POST /topologies`, `POST /topologies/blank`).
Validates the mode/host pair: `mode='agent'` now requires a known,
routable host; `mode='unihost'` must leave the field unset.
Surfaced on `TopologySummary` so list/detail responses expose it.
Purely additive at the schema level — existing unihost flows unchanged
(field defaults to `NULL`).

Step 1 of the agent <-> topology integration.
2026-04-21 01:19:45 -04:00
d06b04221f feat(api/topology): live mutation queue endpoints (POST/GET /mutations) 2026-04-20 19:38:55 -04:00
ff0b2efbb0 feat(api/topology): pending-only child CRUD for LANs, deckies, edges 2026-04-20 19:37:16 -04:00
999113e3c3 feat(api/topology): POST/DELETE/deploy endpoints for MazeNET topologies 2026-04-20 19:34:35 -04:00
f182c98ffa feat(api): phase 3 step 2 — topology read endpoints (list/get/status/catalog)
GET /api/v1/topologies — paginated list with status filter. Extends
repo.list_topologies() to accept limit/offset and adds count_topologies()
for the total envelope field.

GET /api/v1/topologies/{id} — hydrated TopologyDetail; 404 if missing.
GET /api/v1/topologies/{id}/status-events — audit trail, limit-capped.

Catalog helpers for the phase-4 canvas UI:
* GET /topologies/services — full service catalog.
* GET /topologies/next-subnet?base=172.20 — wraps SubnetAllocator against
  reserved_subnets across non-torn-down topologies.
* GET /topologies/{id}/lans/{lan_id}/next-ip — IPAllocator pre-seeded
  with existing decky IPs in that LAN.

All read routes are viewer-or-admin. Sub-routers are included in an
order that keeps literal catalog paths (/services, /next-subnet) from
being shadowed by the /{topology_id} trie branch.
2026-04-20 18:25:33 -04:00
2379b2aeda feat(api): phase 3 step 1 — topology request/response models + router skeleton
Add Pydantic DTOs in decnet/web/db/models.py covering every phase-3
endpoint shape: TopologyGenerateRequest, TopologySummary/Detail, child
create/update requests, MutationEnqueueRequest (Literal op guard),
MutationRow with JSON-payload decoder, validation/version/not-editable
error envelopes, and the three catalog responses.

Create decnet/web/router/topology/ as an import-safe package exporting
topology_router (prefix /topologies) — sub-routers land step-by-step in
subsequent commits. Mount under the main api router alongside swarm_mgmt.

tests/api/topology/test_models.py pins repo-dict ↔ DTO parity so future
repo-row drift breaks the contract test before the endpoints.
2026-04-20 18:16:30 -04:00
47f2ca8d5f added(tests): schemathesis contract fuzzing at the agent and swarmctl level
Some checks failed
CI / Lint (ruff) (push) Successful in 17s
CI / SAST (bandit) (push) Failing after 19s
CI / Dependency audit (pip-audit) (push) Failing after 38s
CI / Test (Standard) (3.11) (push) Has been skipped
CI / Test (Standard) (3.12) (push) Has been skipped
CI / Test (Live) (3.11) (push) Has been skipped
CI / Test (Fuzz) (3.11) (push) Has been skipped
CI / Merge dev → testing (push) Has been skipped
CI / Prepare Merge to Main (push) Has been skipped
CI / Finalize Merge to Main (push) Has been skipped
2026-04-20 01:27:39 -04:00
262a84ca53 refactor(cli): split decnet/cli.py monolith into decnet/cli/ package
The 1,878-line cli.py held every Typer command plus process/HTTP helpers
and mode-gating logic. Split into one module per command using a
register(app) pattern so submodules never import app at module scope,
eliminating circular-import risk.

- utils.py: process helpers, _http_request, _kill_all_services, console, log
- gating.py: MASTER_ONLY_* sets, _require_master_mode, _gate_commands_by_mode
- deploy.py: deploy + _deploy_swarm (tightly coupled)
- lifecycle.py: status, teardown, redeploy
- workers.py: probe, collect, mutate, correlate
- inventory.py, swarm.py, db.py, and one file per remaining command

__init__.py calls register(app) on each module then runs the mode gate
last, and re-exports the private symbols tests patch against
(_db_reset_mysql_async, _kill_all_services, _require_master_mode, etc.).

Test patches retargeted to the submodule where each name now resolves.
Enroll-bundle tarball test updated to assert decnet/cli/__init__.py.

No behavioral change.
2026-04-19 22:42:52 -04:00
9d68bb45c7 feat(web): async teardowns — 202 + background task, UI allows parallel queue
Teardowns were synchronous all the way through: POST blocked on the
worker's docker-compose-down cycle (seconds to minutes), the frontend
locked tearingDown to a single string so only one button could be armed
at a time, and operators couldn't queue a second teardown until the
first returned. On a flaky worker that meant staring at a spinner for
the whole RTT.

Backend: POST /swarm/hosts/{uuid}/teardown returns 202 the instant the
request is validated. Affected shards flip to state='tearing_down'
synchronously before the response so the UI reflects progress
immediately, then the actual AgentClient call + DB cleanup run in an
asyncio.create_task (tracked in a module-level set to survive GC and
to be drainable by tests). On failure the shard flips to
'teardown_failed' with the error recorded — nothing is re-raised,
since there's no caller to catch it.

Frontend: swap tearingDown / decommissioning from 'string | null' to
'Set<string>'. Each button tracks its own in-flight state; the poll
loop picks up the final shard state from the backend. Multiple
teardowns can now be queued without blocking each other.
2026-04-19 20:30:56 -04:00
07ec4bc269 fix(fleet): INI fully replaces prior decky state on redeploy
Submitting an INI with a single [decky1] was silently redeploying the
deckies from the *previous* deploy too. POST /deckies/deploy merged the
new INI into the stored DecnetConfig by name, so a 1-decky INI on top of
a prior 3-decky run still pushed 3 deckies to the worker. Those stale
decky2/decky3 kept their old IPs, collided on the parent NIC, and the
agent failed with 'Address already in use' — the deploy the operator
never asked for.

The INI is the source of truth for which deckies exist this deploy.
Full replace: config.deckies = list(new_decky_configs). Operators who
want to add more deckies should list them all in the INI.

Update the deploy-limit test to reflect the new replace semantics, and
add a regression test asserting prior state is dropped.
2026-04-19 20:24:29 -04:00
5dad1bb315 feat(swarm): remote teardown API + UI (per-decky and per-host)
Agents already exposed POST /teardown; the master was missing the plumbing
to reach it. Add:

- POST /api/v1/swarm/hosts/{uuid}/teardown — admin-gated. Body
  {decky_id: str|null}: null tears the whole host, a value tears one decky.
  On worker failure the master returns 502 and leaves DB shards intact so
  master and agent stay aligned.
- BaseRepository.delete_decky_shard(name) + sqlmodel impl for per-decky
  cleanup after a single-decky teardown.
- SwarmHosts page: "Teardown all" button (keeps host enrolled).
- SwarmDeckies page: per-row "Teardown" button.

Also exclude setuptools' build/ staging dir from the enrollment tarball —
`pip install -e` on the master generates build/lib/decnet_web/node_modules
and the bundle walker was leaking it to agents. Align pyproject's bandit
exclude with the git-hook invocation so both skip decnet/templates/.
2026-04-19 19:39:28 -04:00
2bef3edb72 feat(swarm): unbundle master-only code from agent tarball + sync systemd units on update
Agents now ship with collector/prober/sniffer as systemd services; mutator,
profiler, web, and API stay master-only (profiler rebuilds attacker profiles
against the master DB — no per-host DB exists). Expand _EXCLUDES to drop the
full decnet/web, decnet/mutator, decnet/profiler, and decnet_web trees from
the enrollment bundle.

Updater now calls _heal_path_symlink + _sync_systemd_units after rotation so
fleets pick up new unit files and /usr/local/bin/decnet tracks the shared venv
without a manual reinstall. daemon-reload runs once per update when any unit
changed.

Fix _service_registry matchers to accept systemd-style /usr/local/bin/decnet
cmdlines (psutil returns a list — join to string before substring-checking)
so agent-mode `decnet status` reports collector/prober/sniffer correctly.
2026-04-19 19:19:17 -04:00
6d7877c679 feat(swarm): per-host microservices as systemd units, mutator off agents
Previously `decnet status` on an agent showed every microservice as DOWN
because deploy's auto-spawn was unihost-scoped and the agent CLI gate
hid the per-host commands. Now:

  - collect, probe, profiler, sniffer drop out of MASTER_ONLY_COMMANDS
    (they run per-host; master-side work stays master-gated).
  - mutate stays master-only (it orchestrates swarm-wide respawns).
  - decnet/mutator/ excluded from agent tarballs — never invoked there.
  - decnet/web exclusion tightened: ship db/ + auth.py + dependencies.py
    (profiler needs the repo singleton), drop api.py, swarm_api.py,
    ingester.py, router/, templates/.
  - Four new systemd unit templates (decnet-collector/prober/profiler/
    sniffer) shipped in every enrollment tarball.
  - enroll_bootstrap.sh enables + starts all four alongside agent and
    forwarder at install time.
  - updater restarts the aux units on code push so they pick up the new
    release (best-effort — legacy enrollments without the units won't
    fail the update).
  - status table hides Mutator + API rows in agent mode.
2026-04-19 18:58:48 -04:00
ee9ade4cd5 feat(enroll): strip master API and frontend from agent tarball
Agents never run the FastAPI master app (decnet/web/) or serve the React
frontend (decnet_web/) — they run decnet.agent, decnet.updater, and
decnet.forwarder, none of which import decnet.web. Shipping the master
tree bloats every enrollment payload and needlessly widens the worker's
attack surface.

Excluded paths are unreachable on the worker (all cli.py imports of
decnet.web are inside master-only command bodies that the agent-mode
gate strips). Tests assert neither tree leaks into the tarball.
2026-04-19 18:47:03 -04:00
a0a241f65d feat(enroll): decnet-updater now runs under systemd, not a --daemon fork
Bootstrap used to end with `decnet updater --daemon` which forks and
detaches — invisible to systemctl, no auto-restart, dies on reboot.
Ships a decnet-updater.service template matching the pattern of the
other units (Restart=on-failure, log to /var/log/decnet/decnet.updater.log,
certs from /etc/decnet/updater, install tree at /opt/decnet), bundles
it alongside agent/forwarder/engine units, and the installer now
`systemctl enable --now`s it when --with-updater is set.
2026-04-19 18:19:24 -04:00
5df995fda1 feat(enroll): opt-in IPvlan per-agent for Wi-Fi-bridged VMs
Wi-Fi APs bind one MAC per associated station, so VirtualBox/VMware
guests bridged over Wi-Fi rotate the VM's DHCP lease when Docker's
macvlan starts emitting container-MAC frames through the vNIC. Adds a
`use_ipvlan` toggle on the Agent Enrollment tab (mirrors the updater
daemon checkbox): flips the flag on SwarmHost, bakes `ipvlan=true` into
the agent's decnet.ini, and `_worker_config` forces ipvlan=True on the
per-host shard at dispatch. Safe no-op on wired/bare-metal agents.
2026-04-19 17:57:45 -04:00
6d7567b6bb fix(fleet): reset stale host_uuid on carried-over deckies before dispatch
Deckies merged in from a prior deployment's saved state kept their
original host_uuid — which dispatch_decnet_config then 404'd on if that
host had since been decommissioned or re-enrolled at a different uuid.
Before round-robin assignment, drop any host_uuid that isn't in the live
swarm_hosts set so orphaned entries get reassigned instead of exploding
with 'unknown host_uuid'.
2026-04-19 06:27:34 -04:00
79db999030 feat(fleet): auto-swarm deploy — shard across enrolled workers when master
POST /deckies/deploy now branches on DECNET_MODE + enrolled host presence:
when the caller is a master with at least one reachable swarm host, round-
robin host_uuids are assigned over new deckies and the config is dispatched
via AgentClient. Falls back to local docker-compose otherwise.

Extracts the dispatch loop from api_deploy_swarm into dispatch_decnet_config
so both endpoints share the same shard/dispatch/persist path. Adds
GET /system/deployment-mode for the UI to show 'will shard across N hosts'
vs 'will deploy locally' before the operator clicks deploy.
2026-04-19 06:09:08 -04:00
899ea559d9 feat(enroll): systemd units for agent/forwarder/engine + log-directory INI key
Rename log-file-path -> log-directory (maps to DECNET_LOG_DIRECTORY). Bundle
now ships three systemd units rendered with agent_name/master_host and installs
them into /etc/systemd/system/. Bootstrap replaces direct 'decnet X --daemon'
calls with systemctl enable --now. Each unit pins DECNET_SYSTEM_LOGS so agent,
forwarder, and deckies logs land at decnet.{agent,forwarder}.log and decnet.log
under /var/log/decnet.
2026-04-19 05:46:08 -04:00
ff4c993617 refactor(swarm-mgmt): backfill host address from agent's .tgz source IP 2026-04-19 05:20:29 -04:00
e32fdf9cbf feat(swarm-mgmt): agent_host + updater opt-in; prevent duplicate forwarder spawn 2026-04-19 05:12:55 -04:00
95ae175e1b fix(swarm-mgmt): exclude .env from bundle, chmod +x decnet, mkdir log 2026-04-19 04:58:55 -04:00
b4df9ea0a1 fix(swarm-mgmt): bundle URLs target master_host, not dashboard base_url 2026-04-19 04:52:20 -04:00
c6f7de30d2 feat(swarm-mgmt): agent enrollment bundle flow + admin swarm endpoints 2026-04-19 04:25:57 -04:00