Commit Graph

232 Commits

Author SHA1 Message Date
e2d6f857b5 refactor(swarm): move router DTOs into decnet/web/db/models.py
_schemas.py was a local exception to the codebase convention. The rest
of the app keeps all API request/response DTOs in decnet/web/db/models.py
alongside UserResponse, DeployIniRequest, etc. — the swarm endpoints now
follow the same convention (SwarmEnrollRequest, SwarmHostView, etc).
Deletes decnet/web/router/swarm/_schemas.py.
2026-04-18 19:28:15 -04:00
811136e600 refactor(swarm): one file per endpoint, matching existing router layout
Splits the three grouped router files into eight api_<verb>_<resource>.py
modules under decnet/web/router/swarm/ to match the convention used by
router/fleet/ and router/config/. Shared request/response models live in
_schemas.py. Keeps each endpoint easy to locate and modify without
stepping on siblings.
2026-04-18 19:23:06 -04:00
63b0a58527 feat(swarm): master-side SWARM controller (swarmctl) + agent CLI
Adds decnet/web/swarm_api.py as an independent FastAPI app with routers
for host enrollment, deployment dispatch (sharding DecnetConfig across
enrolled workers via AgentClient), and active health probing. Runs as
its own uvicorn subprocess via 'decnet swarmctl', mirroring the isolation
pattern used by 'decnet api'. Also wires up 'decnet agent' CLI entry for
the worker side.

29 tests added under tests/swarm/test_swarm_api.py cover enrollment
(including bundle generation + duplicate rejection), host CRUD, sharding
correctness, non-swarm-mode rejection, teardown, and health probes with
a stubbed AgentClient.
2026-04-18 19:18:33 -04:00
cd0057c129 feat(swarm): DeckyConfig.host_uuid + fix agent log/status field refs
- decnet.models.DeckyConfig grows an optional 'host_uuid' (the SwarmHost
  that runs this decky). Defaults to None so legacy unihost state files
  deserialize unchanged.
- decnet.agent.executor: replace non-existent config.name references
  with config.mode / config.interface in logs and status payload.
- tests/swarm/test_state_schema.py covers legacy-dict roundtrip, field
  default, and swarm-mode assignments.
2026-04-18 19:10:25 -04:00
0c77cdab32 feat(swarm): master AgentClient — mTLS httpx wrapper around worker API
decnet.swarm.client exposes:
- MasterIdentity / ensure_master_identity(): the master's own CA-signed
  client bundle, issued once into ~/.decnet/ca/master/.
- AgentClient: async-context httpx wrapper that talks to a worker agent
  over mTLS. health/status/deploy/teardown methods mirror the agent API.

SSL context is built from a bare ssl.SSLContext(PROTOCOL_TLS_CLIENT)
instead of httpx.create_ssl_context — the latter layers on default-CA
and purpose logic that broke private-CA mTLS. Server cert is pinned by
CA + chain, not DNS (workers enroll with arbitrary SANs).

tests/swarm/test_client_agent_roundtrip.py spins uvicorn in-process
with real certs on disk and verifies:
- A CA-signed master client passes health + status calls.
- An impostor whose cert comes from a different CA cannot connect.
2026-04-18 19:08:36 -04:00
8257bcc031 feat(swarm): worker agent + fix pre-existing base_repo coverage test
Worker agent (decnet.agent):
- mTLS FastAPI service exposing /deploy, /teardown, /status, /health,
  /mutate. uvicorn enforces CERT_REQUIRED with the DECNET CA pinned.
- executor.py offloads the blocking deployer onto asyncio.to_thread so
  the event loop stays responsive.
- server.py refuses to start without an enrolled bundle in
  ~/.decnet/agent/ — unauthenticated agents are not a supported mode.
- docs/openapi disabled on the agent — narrow attack surface.

tests/test_base_repo.py: DummyRepo was missing get_attacker_artifacts
(pre-existing abstractmethod) and so could not be instantiated. Added
the stub + coverage for the new swarm CRUD surface on BaseRepository.
2026-04-18 07:15:53 -04:00
d3b90679c5 feat(swarm): PKI module — self-managed CA for master/worker mTLS
decnet.swarm.pki provides:
- generate_ca() / ensure_ca() — self-signed root, PKCS8 PEM, 4096-bit.
- issue_worker_cert() — per-worker keypair + cert signed by the CA with
  serverAuth + clientAuth EKU so the same identity backs the agent's
  HTTPS endpoint AND the syslog-over-TLS upstream.
- write_worker_bundle() / load_worker_bundle() — persist with 0600 on
  private keys.
- fingerprint() — SHA-256 DER hex for master-side pinning.

tests/swarm/test_pki.py covers:
- CA idempotency on disk.
- Signed chain validates against CA subject.
- SAN population (DNS + IP).
- Bundle roundtrip with 0600 key perms.
- End-to-end mTLS handshake between two CA-issued peers.
- Cross-CA client rejection (handshake fails).
2026-04-18 07:09:58 -04:00
6657d3e097 feat(swarm): add SwarmHost and DeckyShard tables + repo CRUD
Introduces the master-side persistence layer for swarm mode:
- SwarmHost: enrolled worker metadata, cert fingerprint, heartbeat.
- DeckyShard: per-decky host assignment, state, last error.
Repo methods are added as default-raising on BaseRepository so unihost
deployments are untouched; SQLModelRepository implements them (shared
between the sqlite and mysql subclasses per the existing pattern).
2026-04-18 07:09:29 -04:00
293da364a6 chores: fix linting 2026-04-18 06:46:10 -04:00
2bf886e18e feat(sniffer): probe ipvlan host iface when macvlan is absent
The host-side sniffer interface depends on the deploy's driver choice
(--ipvlan flag). Instead of hardcoding HOST_MACVLAN_IFACE, probe both
names and pick whichever exists; warn and disable cleanly if neither
is present. Explicit DECNET_SNIFFER_IFACE still wins.
2026-04-18 05:37:20 -04:00
8bdc5b98c9 feat(collector): parse real PROCID and extract IPs from logger kv pairs
- Relaxed RFC 5424 regex to accept either NILVALUE or a numeric PROCID;
  sshd / sudo go through rsyslog with their real PID, while
  syslog_bridge emitters keep using '-'.
- Added a fallback pass that scans the MSG body for IP-shaped
  key=value tokens. This rescues attacker attribution for plain logger
  callers like the SSH PROMPT_COMMAND shim, which emits
  'CMD … src=IP …' without SD-element params.
2026-04-18 05:37:08 -04:00
41fd496128 feat(web): attacker artifacts endpoint + UI drawer
Adds the server-side wiring and frontend UI to surface files captured
by the SSH honeypot for a given attacker.

- New repository method get_attacker_artifacts (abstract + SQLModel
  impl) that joins the attacker's IP to `file_captured` log rows.
- New route GET /attackers/{uuid}/artifacts.
- New router /artifacts/{decky}/{service}/{stored_as} that streams a
  quarantined file back to an authenticated viewer.
- AttackerDetail grows an ArtifactDrawer panel with per-file metadata
  (sha256, size, orig_path) and a download action.
- ssh service fragment now sets NODE_NAME=decky_name so logs and the
  host-side artifacts bind-mount share the same decky identifier.
2026-04-18 05:36:48 -04:00
8dd4c78b33 refactor: strip DECNET tokens from container-visible surface
Rename the container-side logging module decnet_logging → syslog_bridge
(canonical at templates/syslog_bridge.py, synced into each template by
the deployer). Drop the stale per-template copies; setuptools find was
picking them up anyway. Swap useradd/USER/chown "decnet" for "logrelay"
so no obvious token appears in the rendered container image.

Apply the same cloaking pattern to the telnet template that SSH got:
syslog pipe moves to /run/systemd/journal/syslog-relay and the relay
is cat'd via exec -a "systemd-journal-fwd". rsyslog.d conf rename
99-decnet.conf → 50-journal-forward.conf. SSH capture script:
/var/decnet/captured → /var/lib/systemd/coredump (real systemd path),
logger tag decnet-capture → systemd-journal. Compose volume updated
to match the new in-container quarantine path.

SD element ID shifts decnet@55555 → relay@55555; synced across
collector, parser, sniffer, prober, formatter, tests, and docs so the
host-side pipeline still matches what containers emit.
2026-04-17 22:57:53 -04:00
a773dddd5c feat(ssh): capture attacker-dropped files with session attribution
inotifywait watches writable paths in the SSH decky and mirrors any
file close_write/moved_to into a per-decky host-mounted quarantine dir.
Each artifact carries a .meta.json with attacker attribution resolved
by walking the writer PID's PPid chain to the sshd session leader,
then cross-referencing ss and utmp for source IP/user/login time.
Also emits an RFC 5424 syslog line per capture for SIEM correlation.
2026-04-17 22:20:05 -04:00
fb69a06ab3 fix(db): detach session cleanup onto fresh task on cancellation
Previous attempt (shield + sync invalidate fallback) didn't work
because shield only protects against cancellation from *other* tasks.
When the caller task itself is cancelled mid-query, its next await
re-raises CancelledError as soon as the shielded coroutine yields —
rollback inside session.close() never completes, the aiomysql
connection is orphaned, and the pool logs 'non-checked-in connection'
when GC finally reaches it.

Hand exception-path cleanup to loop.create_task() so the new task
isn't subject to the caller's pending cancellation. close() (and the
invalidate() fallback for a dead connection) runs to completion.
Success path is unchanged — still awaits close() inline so callers
see commit visibility and pool release before proceeding.
2026-04-17 21:13:43 -04:00
1446f6da94 fix(db): invalidate pool connection when cancelled close fails
Under high-concurrency MySQL load, uvicorn cancels request tasks when
clients disconnect.  If cancellation lands mid-query, session.close()
tries to ROLLBACK on a connection that aiomysql has already marked as
closed — raising InterfaceError("Cancelled during execution") and
leaving the connection checked-out until GC, which the pool then
warns about as a 'non-checked-in connection'.

The old fallback tried sync.rollback() + sync.close(), but those still
go through the async driver and fail the same way on a dead connection.
Replace them with session.sync_session.invalidate(), which just flips
the pool's internal record — no I/O, so it can't be cancelled — and
tells the pool to drop the connection immediately instead of waiting
for garbage collection.
2026-04-17 21:04:04 -04:00
e967aaabfb perf: cache get_user_by_username on the login hot path
Locust @task(2) hammers /auth/login in steady state on top of the
on_start burst. After caching the uuid-keyed user lookup and every
other read endpoint, login alone accounted for 47% of total
_execute at 500c/u — pure DB queueing on SELECT users WHERE
username=?.

5s TTL, positive hits only (misses bypass so a freshly-created
user can log in immediately). Password verify still runs against
the cached hash, so security is unchanged — the only staleness
window is: a changed password accepts the old password for up to
5s until invalidate_user_cache fires (it's called on every write).
2026-04-17 20:36:39 -04:00
255c2e5eb7 perf: cache auth user-lookup and admin list_users
The per-request SELECT users WHERE uuid=? in require_role was the
hidden tax behind every authed endpoint — it kept _execute at ~60%
across the profile even after the page caches landed. Even /health
(with its DB and Docker probes cached) was still 52% _execute from
this one query.

- dependencies.py: 10s TTL cache on get_user_by_uuid, well below JWT
  expiry. invalidate_user_cache(uuid) is called on password change,
  role change, and user delete.
- api_get_config.py: 5s TTL cache on the admin branch's list_users()
  (previously fetched every /config call). Invalidated on user
  create/update/delete.
- api_change_pass.py + api_manage_users.py: invalidation hooks on
  all user-mutating endpoints.
2026-04-17 19:56:39 -04:00
2dd86fb3bb perf: cache /bounty, /logs/histogram, /deckies; bump /config TTL to 5s
Round-2 follow-up: profile at 500c/u showed _execute still dominating
the uncached read endpoints (/bounty 76%, /logs/histogram 73%,
/deckies 56%). Same router-level TTL pattern as /stats — 5s window,
asyncio.Lock to collapse concurrent calls into one DB hit.

- /bounty: cache default unfiltered page (limit=50, offset=0,
  bounty_type=None, search=None). Filtered requests bypass.
- /logs/histogram: cache default (interval_minutes=15, no filters).
  Filtered / non-default interval requests bypass.
- /deckies: cache full response (endpoint takes no params).
- /config: bump _STATE_TTL from 1.0 to 5.0 — admin writes are rare,
  1s was too short for bursts to coalesce at high concurrency.
2026-04-17 19:30:11 -04:00
3106d03135 perf(db): default pool_pre_ping=false for SQLite
SQLite is a local file — a SELECT 1 per session checkout is pure
overhead. Env var DECNET_DB_POOL_PRE_PING stays for anyone running
on a network-mounted volume. MySQL backend keeps its current default.
2026-04-17 19:11:07 -04:00
3cc5ba36e8 fix(cli): keep FileNotFoundError handling on decnet api
Popen moved inside the try so a missing uvicorn falls through to the
existing error message instead of crashing the CLI. test_cli was still
patching the old subprocess.run entrypoint; switched both api command
tests to patch subprocess.Popen / os.killpg to match the current path.
2026-04-17 19:09:15 -04:00
6301504c0e perf(api): TTL-cache /stats + unfiltered pagination counts
Every /stats call ran SELECT count(*) FROM logs + SELECT count(DISTINCT
attacker_ip) FROM logs; every /logs and /attackers call ran an
unfiltered count for the paginator. At 500 concurrent users these
serialize through aiosqlite's worker threads and dominate wall time.

Cache at the router layer (repo stays dialect-agnostic):
  - /stats response: 5s TTL
  - /logs total (only when no filters): 2s TTL
  - /attackers total (only when no filters): 2s TTL

Filtered paths bypass the cache. Pattern reused from api_get_config
and api_get_health (asyncio.Lock + time.monotonic window + lazy lock).
2026-04-17 19:09:15 -04:00
de4b64d857 perf(auth): avoid duplicate user lookup in require_role
require_role._check previously chained from get_current_user, which
already loaded the user — then looked it up again. Inline the decode +
single user fetch + must_change_password + role check so every
authenticated request costs one SELECT users WHERE uuid=? instead of
two.
2026-04-17 17:48:42 -04:00
b5d7bf818f feat(health): 3-tier status (healthy / degraded / unhealthy)
Only database, docker, and ingestion_worker now count as critical
(→ 503 unhealthy). attacker/sniffer/collector failures drop overall
status to degraded (still 200) so the dashboard doesn't panic when a
non-essential worker isn't running.
2026-04-17 17:48:42 -04:00
a10aee282f perf(ingester): batch log writes into bulk commits
The ingester now accumulates up to DECNET_BATCH_SIZE rows (default 100)
or DECNET_BATCH_MAX_WAIT_MS (default 250ms) before flushing through
repo.add_logs — one transaction, one COMMIT per batch instead of per
row. Under attacker traffic this collapses N commits into ⌈N/100⌉ and
takes most of the SQLite writer-lock contention off the hot path.

Flush semantics are cancel-safe: _position only advances after a batch
commits successfully, and the flush helper bails without touching the
DB if the enclosing task is being cancelled (lifespan teardown).
Un-flushed lines stay in the file and are re-read on next startup.

Tests updated to assert on add_logs (bulk) instead of the per-row
add_log that the ingester no longer uses, plus a new test that 250
lines flush in ≤5 calls.
2026-04-17 16:37:34 -04:00
11b9e85874 feat(db): bulk add_logs for one-commit ingestion batches
Adds BaseRepository.add_logs (default: loops add_log for backwards
compatibility) and a real single-session/single-commit implementation
on SQLModelRepository. Introduces DECNET_BATCH_SIZE (default 100) and
DECNET_BATCH_MAX_WAIT_MS (default 250) so the ingester can flush on
either a size or a time bound when it adopts the new method.

Ingester wiring is deferred to a later pass — the single-log path was
deadlocking tests when flushed during lifespan teardown, so this change
ships the DB primitive alone.
2026-04-17 16:23:09 -04:00
45039bd621 fix(cache): lazy-init TTL cache locks to survive event-loop turnover
A module-level asyncio.Lock binds to the loop it was first awaited on.
Under pytest-anyio (and xdist) each test spins up a new loop; any later
test that hit /health or /config would wait on a lock owned by a dead
loop and the whole worker would hang.

Create the lock on first use and drop it in the test-reset helpers so a
fresh loop always gets a fresh lock.
2026-04-17 16:23:00 -04:00
4ea1c2ff4f fix(health): move Docker client+ping off the event loop
Under CPU saturation the sync docker.from_env()/ping() calls could miss
their socket timeout, cache _docker_healthy=False, and return 503 for
the full 5s TTL window. Both calls now run on a thread so the event
loop keeps serving other requests while Docker is being probed.
2026-04-17 15:43:51 -04:00
bb8d782e42 fix(cli): kill uvicorn worker tree on Ctrl+C
With --workers > 1, SIGINT from the terminal raced uvicorn's supervisor:
some workers got signaled directly, the supervisor respawned them, and
the result behaved like a forkbomb. Start uvicorn in its own session and
signal the whole process group (SIGTERM → 10s grace → SIGKILL) when we
catch KeyboardInterrupt.
2026-04-17 15:32:08 -04:00
342916ca63 feat(cli): expose --workers on decnet api
Forwards straight to uvicorn's --workers. Default stays at 1 so the
single-worker efficiency direction is preserved; raising it is available
for threat-actor load scenarios where the honeypot needs to soak real
attack traffic without queueing on one event loop.
2026-04-17 15:22:45 -04:00
32340bea0d perf: migrate hot-path JSON serialization to orjson
stdlib json was FastAPI's default. Every response body, every SSE frame,
and every add_log/state/payload write paid the stdlib encode cost.

- pyproject.toml: add orjson>=3.10 as a core dep.
- decnet/web/api.py: default_response_class=ORJSONResponse on the
  FastAPI app, so every endpoint return goes through orjson without
  touching call sites. Explicit JSONResponse sites in the validation
  exception handlers migrated to ORJSONResponse for consistency.
- health endpoint's explicit JSONResponse → ORJSONResponse.
- SSE stream (api_stream_events.py): 6 json.dumps call sites →
  orjson.dumps(...).decode() — the per-event frames that fire on every
  sse tick.
- sqlmodel_repo.py: encode sites on the log-insert path switched to
  orjson (fields, payload, state value). Parser sites (json.loads)
  left as-is for now — not on the measured hot path.
2026-04-17 15:07:28 -04:00
f1e14280c0 perf: 1s TTL cache for /health DB probe and /config state reads
Locust hit /health and /config on every @task(3), so each request was
firing repo.get_total_logs() and two repo.get_state() calls against
aiosqlite — filling the driver queue for data that changes on the order
of seconds, not milliseconds.

Both caches follow the shape already used by the existing Docker cache:
- asyncio.Lock with double-checked TTL so concurrent callers collapse
  into one DB hit per 1s window.
- _reset_* helpers called from tests/api/conftest.py::setup_db so the
  module-level cache can't leak across tests.

tests/test_health_config_cache.py asserts 50 concurrent callers
produce exactly 1 repo call, and the cache expires after TTL.
2026-04-17 15:05:18 -04:00
931f33fb06 perf: cache Docker daemon ping in /health (5s TTL)
Creating a new docker.from_env() client per /health request opened a
fresh unix-socket connection each time. Under load that's wasteful and
hammers dockerd.

Keep a module-level client + last-check timestamp; actually ping every
5 seconds, return cached state in between. Reset helper provided for
tests.
2026-04-17 15:01:53 -04:00
467511e997 db: switch MySQL driver to asyncmy, env-tune pool, serialize DDL
- aiomysql → asyncmy on both sides of the URL/import (faster, maintained).
- Pool sizing now reads DECNET_DB_POOL_SIZE / MAX_OVERFLOW / RECYCLE /
  PRE_PING for both SQLite and MySQL engines so stress runs can bump
  without code edits.
- MySQL initialize() now wraps schema DDL in a GET_LOCK advisory lock so
  concurrent uvicorn workers racing create_all() don't hit 'Table was
  skipped since its definition is being modified by concurrent DDL'.
- sqlite & mysql repo get_log_histogram use the shared _session() helper
  instead of session_factory() for consistency with the rest of the repo.
- SSE stream_events docstring updated to asyncmy.
2026-04-17 15:01:49 -04:00
3945e72e11 perf: run bcrypt on a thread so it doesn't block the event loop
verify_password / get_password_hash are CPU-bound and take ~250ms each
at rounds=12. Called directly from async endpoints, they stall every
other coroutine for that window — the single biggest single-worker
bottleneck on the login path.

Adds averify_password / ahash_password that wrap the sync versions in
asyncio.to_thread. Sync versions stay put because _ensure_admin_user and
tests still use them.

5 call sites updated: login, change-password, create-user, reset-password.
tests/test_auth_async.py asserts parallel averify runs concurrently (~1x
of a single verify, not 2x).
2026-04-17 14:52:22 -04:00
bd406090a7 fix: re-seed admin password when still unfinalized (must_change_password=True)
_ensure_admin_user was strict insert-if-missing: once a stale hash landed
in decnet.db (e.g. from a deploy that used a different DECNET_ADMIN_PASSWORD),
login silently 401'd because changing the env var later had no effect.

Now on startup: if the admin still has must_change_password=True (they
never finalized their own password), re-sync the hash from the current
env var. Once the admin sets a real password, we leave it alone.

Found via locustfile.py login storm — see tests/test_admin_seed.py.

Note: this commit also bundles uncommitted pool-management work already
present in sqlmodel_repo.py from prior sessions.
2026-04-17 14:49:13 -04:00
cb12e7c475 fix: logging handler must not crash its caller on reopen failure
When decnet.system.log is root-owned (e.g. created by a pre-fix 'sudo
decnet deploy') and a subsequent non-root process tries to log, the
InodeAwareRotatingFileHandler raised PermissionError out of emit(),
which propagated up through logger.debug/info and killed the collector's
log stream loop ('log stream ended ... reason=[Errno 13]').

Now matches stdlib behaviour: wrap _open() in try/except OSError and
defer to handleError() on failure. Adds a regression test.

Also: scripts/profile/view.sh 'pyinstrument' keyword was matching
memray-flamegraph-*.html files. Exclude the memray-* prefix.
2026-04-17 14:01:36 -04:00
bf4afac70f fix: RotatingFileHandler reopens on external deletion/rotation
Mirrors the inode-check fix from 935a9a5 (collector worker) for the
stdlib-handler-based log paths. Both decnet.system.log (config.py) and
decnet.log (logging/file_handler.py) now use a subclass that stats the
target path before each emit and reopens on inode/device mismatch —
matching the behavior of stdlib WatchedFileHandler while preserving
size-based rotation.

Previously: rm decnet.system.log → handler kept writing to the orphaned
inode until maxBytes triggered; all lines between were lost.
2026-04-17 13:42:15 -04:00
4b15b7eb35 fix: chown log files to sudo-invoking user so non-root API can append
'sudo decnet deploy' needs root for MACVLAN, but the log files it creates
(decnet.log and decnet.system.log) end up owned by root. A subsequent
non-root 'decnet api' then crashes on PermissionError appending to them.

New decnet.privdrop helper reads SUDO_UID/SUDO_GID and chowns files/dirs
back to the invoking user. Best-effort: no-op when not root, not under
sudo, path missing, or chown fails. Applied at both log-file creation
sites (config.py system log, logging/file_handler.py syslog file).
2026-04-17 13:39:09 -04:00
140d2fbaad fix: gate embedded sniffer behind DECNET_EMBED_SNIFFER (default off)
The API's lifespan unconditionally spawned a MACVLAN sniffer task, which
duplicated the standalone 'decnet sniffer --daemon' process that
'decnet deploy' always starts — causing two workers to sniff the same
interface, double events, and wasted CPU.

Mirror the existing DECNET_EMBED_PROFILER pattern: sniffer is OFF by
default, opt in explicitly. Static regression tests guard against
accidental removal of the gate.
2026-04-17 13:35:43 -04:00
319c1dbb61 added: profiling toolchain (py-spy, pyinstrument, pytest-benchmark, memray, snakeviz)
New `profile` optional-deps group, opt-in Pyinstrument ASGI middleware
gated by DECNET_PROFILE_REQUESTS, bench marker + tests/perf/ micro-benchmarks
for repository hot paths, and scripts/profile/ helpers for py-spy/cProfile/memray.
2026-04-17 13:13:00 -04:00
9b59f8672e chores: cleanup; added: viteconfig 2026-04-16 02:09:30 -04:00
89099b903d fix: resolve schemathesis and live test failures
- Add 403 response to all RBAC-gated endpoints (schemathesis UndefinedStatusCode)
- Add 400 response to all endpoints accepting JSON bodies (malformed input)
- Add required 'title' field to schemathesis.toml for schemathesis 4.15+
- Add xdist_group markers to live tests with module-scoped fixtures to
  prevent xdist from distributing them across workers (fixture isolation)
2026-04-16 01:39:04 -04:00
29578d9d99 fix: resolve all ruff and bandit lint/security issues
- Remove unused Optional import (F401) in telemetry.py
- Move imports above module-level code (E402) in web/db/models.py
- Default API/web hosts to 127.0.0.1 instead of 0.0.0.0 (B104)
- Add usedforsecurity=False to MD5 calls in JA3/HASSH fingerprinting (B324)
- Annotate intentional try/except/pass blocks with nosec (B110)
- Remove stale nosec comments that no longer suppress anything
2026-04-16 01:04:57 -04:00
70d8ffc607 feat: complete OTEL tracing across all services with pipeline bridge and docs
Extends tracing to every remaining module: all 23 API route handlers,
correlation engine, sniffer (fingerprint/p0f/syslog), prober (jarm/hassh/tcpfp),
profiler behavioral analysis, logging subsystem, engine, and mutator.

Bridges the ingester→SSE trace gap by persisting trace_id/span_id columns on
the logs table and creating OTEL span links in the SSE endpoint. Adds log-trace
correlation via _TraceContextFilter injecting otel_trace_id into Python LogRecords.

Includes development/docs/TRACING.md with full span reference (76 spans),
pipeline propagation architecture, quick start guide, and troubleshooting.
2026-04-16 00:58:08 -04:00
04db13afae feat: cross-stage trace propagation and granular per-event spans
Collector now creates a span per event and injects W3C trace context
into JSON records. Ingester extracts that context and creates child
spans, connecting the full event journey: collector -> ingester ->
db.add_log + extract_bounty -> db.add_bounty.

Profiler now creates per-IP spans inside update_profiles with rich
attributes (event_count, is_traversal, bounty_count, command_count).

Traces in Jaeger now show the complete execution map from capture
through ingestion and profiling.
2026-04-15 23:52:13 -04:00
d1a88e75bd fix: dynamic TracedRepository proxy + disable tracing in test suite
Replace brittle explicit method-by-method proxy with __getattr__-based
dynamic proxy that forwards all args/kwargs to the inner repo. Fixes
TypeError on get_logs_after_id() where concrete repo accepts extra
kwargs beyond the ABC signature.

Pin DECNET_DEVELOPER_TRACING=false in conftest.py so .env.local
settings don't leak into the test suite.
2026-04-15 23:46:46 -04:00
65ddb0b359 feat: add OpenTelemetry distributed tracing across all DECNET services
Gated by DECNET_DEVELOPER_TRACING env var (default off, zero overhead).
When enabled, traces flow through FastAPI routes, background workers
(collector, ingester, profiler, sniffer, prober), engine/mutator
operations, and all DB calls via TracedRepository proxy.

Includes Jaeger docker-compose for local dev and 18 unit tests.
2026-04-15 23:23:13 -04:00
b437bc8eec fix: use unbuffered reads in proxy for SSE streaming
resp.read(4096) blocks until 4096 bytes accumulate, which stalls SSE
events (~100-500 bytes each) in the proxy buffer indefinitely. Switch
to read1() which returns bytes immediately available without waiting
for more. Also disable the 120s socket timeout for SSE connections.
2026-04-15 23:03:03 -04:00
a1ca5d699b fix: use dedicated thread pools for collector and sniffer workers
The collector spawned one permanent thread per Docker container via
asyncio.to_thread(), saturating the default asyncio executor. This
starved short-lived to_thread(load_state) calls in get_deckies() and
get_stats_summary(), causing the SSE stream and deckies endpoints to
hang indefinitely while other DB-only endpoints worked fine.

Give the collector and sniffer their own ThreadPoolExecutor so they
never compete with the default pool.
2026-04-15 22:57:03 -04:00