Commit Graph

395 Commits

Author SHA1 Message Date
45039bd621 fix(cache): lazy-init TTL cache locks to survive event-loop turnover
A module-level asyncio.Lock binds to the loop it was first awaited on.
Under pytest-anyio (and xdist) each test spins up a new loop; any later
test that hit /health or /config would wait on a lock owned by a dead
loop and the whole worker would hang.

Create the lock on first use and drop it in the test-reset helpers so a
fresh loop always gets a fresh lock.
2026-04-17 16:23:00 -04:00
4ea1c2ff4f fix(health): move Docker client+ping off the event loop
Under CPU saturation the sync docker.from_env()/ping() calls could miss
their socket timeout, cache _docker_healthy=False, and return 503 for
the full 5s TTL window. Both calls now run on a thread so the event
loop keeps serving other requests while Docker is being probed.
2026-04-17 15:43:51 -04:00
bb8d782e42 fix(cli): kill uvicorn worker tree on Ctrl+C
With --workers > 1, SIGINT from the terminal raced uvicorn's supervisor:
some workers got signaled directly, the supervisor respawned them, and
the result behaved like a forkbomb. Start uvicorn in its own session and
signal the whole process group (SIGTERM → 10s grace → SIGKILL) when we
catch KeyboardInterrupt.
2026-04-17 15:32:08 -04:00
342916ca63 feat(cli): expose --workers on decnet api
Forwards straight to uvicorn's --workers. Default stays at 1 so the
single-worker efficiency direction is preserved; raising it is available
for threat-actor load scenarios where the honeypot needs to soak real
attack traffic without queueing on one event loop.
2026-04-17 15:22:45 -04:00
d3f4bbb62b perf(locust): skip change-password in on_start when not required
Previously every user did login → change-pass → re-login in on_start
regardless of whether the server actually required a password change.
With bcrypt at ~250ms/call that's 3 bcrypt-bound requests per user.
At 2500 users the on_start queue was ~10k bcrypt ops — users never
escaped warmup, so @task endpoints never fired.

Login already returns must_change_password; only run the change-pass
+ re-login dance when the server says we have to. Cuts on_start from
3 requests to 1 for every user after the first DB initialization.
2026-04-17 15:15:59 -04:00
32340bea0d perf: migrate hot-path JSON serialization to orjson
stdlib json was FastAPI's default. Every response body, every SSE frame,
and every add_log/state/payload write paid the stdlib encode cost.

- pyproject.toml: add orjson>=3.10 as a core dep.
- decnet/web/api.py: default_response_class=ORJSONResponse on the
  FastAPI app, so every endpoint return goes through orjson without
  touching call sites. Explicit JSONResponse sites in the validation
  exception handlers migrated to ORJSONResponse for consistency.
- health endpoint's explicit JSONResponse → ORJSONResponse.
- SSE stream (api_stream_events.py): 6 json.dumps call sites →
  orjson.dumps(...).decode() — the per-event frames that fire on every
  sse tick.
- sqlmodel_repo.py: encode sites on the log-insert path switched to
  orjson (fields, payload, state value). Parser sites (json.loads)
  left as-is for now — not on the measured hot path.
2026-04-17 15:07:28 -04:00
f1e14280c0 perf: 1s TTL cache for /health DB probe and /config state reads
Locust hit /health and /config on every @task(3), so each request was
firing repo.get_total_logs() and two repo.get_state() calls against
aiosqlite — filling the driver queue for data that changes on the order
of seconds, not milliseconds.

Both caches follow the shape already used by the existing Docker cache:
- asyncio.Lock with double-checked TTL so concurrent callers collapse
  into one DB hit per 1s window.
- _reset_* helpers called from tests/api/conftest.py::setup_db so the
  module-level cache can't leak across tests.

tests/test_health_config_cache.py asserts 50 concurrent callers
produce exactly 1 repo call, and the cache expires after TTL.
2026-04-17 15:05:18 -04:00
931f33fb06 perf: cache Docker daemon ping in /health (5s TTL)
Creating a new docker.from_env() client per /health request opened a
fresh unix-socket connection each time. Under load that's wasteful and
hammers dockerd.

Keep a module-level client + last-check timestamp; actually ping every
5 seconds, return cached state in between. Reset helper provided for
tests.
2026-04-17 15:01:53 -04:00
467511e997 db: switch MySQL driver to asyncmy, env-tune pool, serialize DDL
- aiomysql → asyncmy on both sides of the URL/import (faster, maintained).
- Pool sizing now reads DECNET_DB_POOL_SIZE / MAX_OVERFLOW / RECYCLE /
  PRE_PING for both SQLite and MySQL engines so stress runs can bump
  without code edits.
- MySQL initialize() now wraps schema DDL in a GET_LOCK advisory lock so
  concurrent uvicorn workers racing create_all() don't hit 'Table was
  skipped since its definition is being modified by concurrent DDL'.
- sqlite & mysql repo get_log_histogram use the shared _session() helper
  instead of session_factory() for consistency with the rest of the repo.
- SSE stream_events docstring updated to asyncmy.
2026-04-17 15:01:49 -04:00
3945e72e11 perf: run bcrypt on a thread so it doesn't block the event loop
verify_password / get_password_hash are CPU-bound and take ~250ms each
at rounds=12. Called directly from async endpoints, they stall every
other coroutine for that window — the single biggest single-worker
bottleneck on the login path.

Adds averify_password / ahash_password that wrap the sync versions in
asyncio.to_thread. Sync versions stay put because _ensure_admin_user and
tests still use them.

5 call sites updated: login, change-password, create-user, reset-password.
tests/test_auth_async.py asserts parallel averify runs concurrently (~1x
of a single verify, not 2x).
2026-04-17 14:52:22 -04:00
bd406090a7 fix: re-seed admin password when still unfinalized (must_change_password=True)
_ensure_admin_user was strict insert-if-missing: once a stale hash landed
in decnet.db (e.g. from a deploy that used a different DECNET_ADMIN_PASSWORD),
login silently 401'd because changing the env var later had no effect.

Now on startup: if the admin still has must_change_password=True (they
never finalized their own password), re-sync the hash from the current
env var. Once the admin sets a real password, we leave it alone.

Found via locustfile.py login storm — see tests/test_admin_seed.py.

Note: this commit also bundles uncommitted pool-management work already
present in sqlmodel_repo.py from prior sessions.
2026-04-17 14:49:13 -04:00
e22d057e68 added: scripts/profile/aggregate_requests.py — roll up pyinstrument request profiles
Parses every HTML in profiles/, reattributes [self]/[await] synthetic
leaves to their parent function, and reports per-endpoint wall-time
(mean/p50/p95/max) plus top hot functions by cumulative self-time.

Makes post-locust profile dirs actually readable — otherwise they're
just a pile of hundred-plus HTML files.
2026-04-17 14:48:59 -04:00
cb12e7c475 fix: logging handler must not crash its caller on reopen failure
When decnet.system.log is root-owned (e.g. created by a pre-fix 'sudo
decnet deploy') and a subsequent non-root process tries to log, the
InodeAwareRotatingFileHandler raised PermissionError out of emit(),
which propagated up through logger.debug/info and killed the collector's
log stream loop ('log stream ended ... reason=[Errno 13]').

Now matches stdlib behaviour: wrap _open() in try/except OSError and
defer to handleError() on failure. Adds a regression test.

Also: scripts/profile/view.sh 'pyinstrument' keyword was matching
memray-flamegraph-*.html files. Exclude the memray-* prefix.
2026-04-17 14:01:36 -04:00
c29ca977fd added: scripts/profile/classify_usage.py — classify memray usage_over_time.csv
Reads the memray usage CSV and emits a verdict based on tail-drop-from-
peak: CLIMB-AND-DROP, MOSTLY-RELEASED, or SUSTAINED-AT-PEAK. Deliberately
ignores net-growth-vs-baseline since any active workload grows vs. a cold
interpreter — that metric is misleading as a leak signal.
2026-04-17 13:54:37 -04:00
bf4afac70f fix: RotatingFileHandler reopens on external deletion/rotation
Mirrors the inode-check fix from 935a9a5 (collector worker) for the
stdlib-handler-based log paths. Both decnet.system.log (config.py) and
decnet.log (logging/file_handler.py) now use a subclass that stats the
target path before each emit and reopens on inode/device mismatch —
matching the behavior of stdlib WatchedFileHandler while preserving
size-based rotation.

Previously: rm decnet.system.log → handler kept writing to the orphaned
inode until maxBytes triggered; all lines between were lost.
2026-04-17 13:42:15 -04:00
4b15b7eb35 fix: chown log files to sudo-invoking user so non-root API can append
'sudo decnet deploy' needs root for MACVLAN, but the log files it creates
(decnet.log and decnet.system.log) end up owned by root. A subsequent
non-root 'decnet api' then crashes on PermissionError appending to them.

New decnet.privdrop helper reads SUDO_UID/SUDO_GID and chowns files/dirs
back to the invoking user. Best-effort: no-op when not root, not under
sudo, path missing, or chown fails. Applied at both log-file creation
sites (config.py system log, logging/file_handler.py syslog file).
2026-04-17 13:39:09 -04:00
140d2fbaad fix: gate embedded sniffer behind DECNET_EMBED_SNIFFER (default off)
The API's lifespan unconditionally spawned a MACVLAN sniffer task, which
duplicated the standalone 'decnet sniffer --daemon' process that
'decnet deploy' always starts — causing two workers to sniff the same
interface, double events, and wasted CPU.

Mirror the existing DECNET_EMBED_PROFILER pattern: sniffer is OFF by
default, opt in explicitly. Static regression tests guard against
accidental removal of the gate.
2026-04-17 13:35:43 -04:00
064c8760b6 fix: memray run needs --trace-python-allocators for frame attribution
Without it, 'Total number of frames seen: 0' in memray stats and flamegraphs
render empty / C-only. Also added --follow-fork so uvicorn workers spawned
as child processes are tracked.
2026-04-17 13:24:55 -04:00
6572c5cbaf added: scripts/profile/view.sh — auto-pick newest artifact and open viewer
Dispatches by extension: .prof -> snakeviz, memray .bin -> memray flamegraph
(overridable via VIEW=table|tree|stats|summary|leaks), .svg/.html -> xdg-open.
Positional arg can be a file path or a type keyword (cprofile, memray, pyspy,
pyinstrument).
2026-04-17 13:20:05 -04:00
ba448bae13 docs: py-spy 0.4.1 lacks Python 3.14 support; wrapper aborts early
Root cause of 'No python processes found in process <pid>': py-spy needs
per-release ABI knowledge and 0.4.1 (latest PyPI) predates 3.14. Wrapper
now detects the interpreter and points users at pyinstrument/memray/cProfile.
2026-04-17 13:17:23 -04:00
1a18377b0a fix: mysql url builder tests expect asyncmy, not aiomysql
The builder in decnet/web/db/mysql/database.py emits 'mysql+asyncmy://' URLs
(asyncmy is the declared dep in pyproject.toml). Tests were stale from a
prior aiomysql era.
2026-04-17 13:13:36 -04:00
319c1dbb61 added: profiling toolchain (py-spy, pyinstrument, pytest-benchmark, memray, snakeviz)
New `profile` optional-deps group, opt-in Pyinstrument ASGI middleware
gated by DECNET_PROFILE_REQUESTS, bench marker + tests/perf/ micro-benchmarks
for repository hot paths, and scripts/profile/ helpers for py-spy/cProfile/memray.
2026-04-17 13:13:00 -04:00
c1d8102253 modified: DEVELOPMENT roadmap. one step closer to v1 2026-04-16 11:39:07 -04:00
49f3002c94 added: docs; modified: .gitignore
Some checks failed
CI / Lint (ruff) (push) Successful in 18s
CI / SAST (bandit) (push) Successful in 19s
CI / Dependency audit (pip-audit) (push) Successful in 40s
CI / Test (Standard) (3.11) (push) Successful in 2m38s
CI / Test (Standard) (3.12) (push) Successful in 2m56s
CI / Test (Live) (3.11) (push) Failing after 1m3s
CI / Test (Fuzz) (3.11) (push) Has been skipped
CI / Merge dev → testing (push) Has been skipped
CI / Prepare Merge to Main (push) Has been skipped
CI / Finalize Merge to Main (push) Has been skipped
2026-04-16 02:10:38 -04:00
9b59f8672e chores: cleanup; added: viteconfig 2026-04-16 02:09:30 -04:00
296979003d fix: pytest -m live works without extra flags
Root cause: test_schemathesis.py mutates decnet.web.auth.SECRET_KEY at
module-level import time, poisoning JWT verification for all other tests
in the same process — even when fuzz tests are deselected.

- Add pytest_ignore_collect hook in tests/api/conftest.py to skip
  collecting test_schemathesis.py unless -m fuzz is selected
- Add --dist loadscope to addopts so xdist groups by module (protects
  module-scoped fixtures in live tests)
- Remove now-unnecessary xdist_group markers from live test classes
2026-04-16 01:55:38 -04:00
89099b903d fix: resolve schemathesis and live test failures
- Add 403 response to all RBAC-gated endpoints (schemathesis UndefinedStatusCode)
- Add 400 response to all endpoints accepting JSON bodies (malformed input)
- Add required 'title' field to schemathesis.toml for schemathesis 4.15+
- Add xdist_group markers to live tests with module-scoped fixtures to
  prevent xdist from distributing them across workers (fixture isolation)
2026-04-16 01:39:04 -04:00
29578d9d99 fix: resolve all ruff and bandit lint/security issues
- Remove unused Optional import (F401) in telemetry.py
- Move imports above module-level code (E402) in web/db/models.py
- Default API/web hosts to 127.0.0.1 instead of 0.0.0.0 (B104)
- Add usedforsecurity=False to MD5 calls in JA3/HASSH fingerprinting (B324)
- Annotate intentional try/except/pass blocks with nosec (B110)
- Remove stale nosec comments that no longer suppress anything
2026-04-16 01:04:57 -04:00
70d8ffc607 feat: complete OTEL tracing across all services with pipeline bridge and docs
Extends tracing to every remaining module: all 23 API route handlers,
correlation engine, sniffer (fingerprint/p0f/syslog), prober (jarm/hassh/tcpfp),
profiler behavioral analysis, logging subsystem, engine, and mutator.

Bridges the ingester→SSE trace gap by persisting trace_id/span_id columns on
the logs table and creating OTEL span links in the SSE endpoint. Adds log-trace
correlation via _TraceContextFilter injecting otel_trace_id into Python LogRecords.

Includes development/docs/TRACING.md with full span reference (76 spans),
pipeline propagation architecture, quick start guide, and troubleshooting.
2026-04-16 00:58:08 -04:00
04db13afae feat: cross-stage trace propagation and granular per-event spans
Collector now creates a span per event and injects W3C trace context
into JSON records. Ingester extracts that context and creates child
spans, connecting the full event journey: collector -> ingester ->
db.add_log + extract_bounty -> db.add_bounty.

Profiler now creates per-IP spans inside update_profiles with rich
attributes (event_count, is_traversal, bounty_count, command_count).

Traces in Jaeger now show the complete execution map from capture
through ingestion and profiling.
2026-04-15 23:52:13 -04:00
d1a88e75bd fix: dynamic TracedRepository proxy + disable tracing in test suite
Replace brittle explicit method-by-method proxy with __getattr__-based
dynamic proxy that forwards all args/kwargs to the inner repo. Fixes
TypeError on get_logs_after_id() where concrete repo accepts extra
kwargs beyond the ABC signature.

Pin DECNET_DEVELOPER_TRACING=false in conftest.py so .env.local
settings don't leak into the test suite.
2026-04-15 23:46:46 -04:00
65ddb0b359 feat: add OpenTelemetry distributed tracing across all DECNET services
Gated by DECNET_DEVELOPER_TRACING env var (default off, zero overhead).
When enabled, traces flow through FastAPI routes, background workers
(collector, ingester, profiler, sniffer, prober), engine/mutator
operations, and all DB calls via TracedRepository proxy.

Includes Jaeger docker-compose for local dev and 18 unit tests.
2026-04-15 23:23:13 -04:00
b437bc8eec fix: use unbuffered reads in proxy for SSE streaming
resp.read(4096) blocks until 4096 bytes accumulate, which stalls SSE
events (~100-500 bytes each) in the proxy buffer indefinitely. Switch
to read1() which returns bytes immediately available without waiting
for more. Also disable the 120s socket timeout for SSE connections.
2026-04-15 23:03:03 -04:00
a1ca5d699b fix: use dedicated thread pools for collector and sniffer workers
The collector spawned one permanent thread per Docker container via
asyncio.to_thread(), saturating the default asyncio executor. This
starved short-lived to_thread(load_state) calls in get_deckies() and
get_stats_summary(), causing the SSE stream and deckies endpoints to
hang indefinitely while other DB-only endpoints worked fine.

Give the collector and sniffer their own ThreadPoolExecutor so they
never compete with the default pool.
2026-04-15 22:57:03 -04:00
e9d151734d feat: deduplicate bounties on (bounty_type, attacker_ip, payload)
Before inserting a bounty, check whether an identical row already exists.
Drops silent duplicates to prevent DB saturation from aggressive scanners.
2026-04-15 18:02:52 -04:00
0ab97d0ade docs: document decnet domain models and fleet transformation 2026-04-15 18:01:27 -04:00
60de16be84 docs: document decnet collector worker 2026-04-15 17:56:24 -04:00
82ec7f3117 fix: gate embedded profiler behind DECNET_EMBED_PROFILER to prevent dual-instance cursor conflict
decnet deploy spawns a standalone profiler daemon AND api.py was also starting
attacker_profile_worker as an asyncio task inside the web server. Both instances
shared the same attacker_worker_cursor key in the state table, causing a race
where one instance could skip events already claimed by the other or overwrite
the cursor mid-batch.

Default is now OFF (embedded profiler disabled). The standalone daemon started
by decnet deploy is the single authoritative instance. Set DECNET_EMBED_PROFILER=true
only when running decnet api in isolation without a full deploy.
2026-04-15 17:49:18 -04:00
11d749f13d fix: wire prober tcpfp_fingerprint events into sniffer_rollup for OS/hop detection
The active prober emits tcpfp_fingerprint events with TTL, window, MSS etc.
from the attacker's SYN-ACK. These were invisible to the behavioral profiler
for two reasons:

1. target_ip (prober's field name for attacker IP) was not in _IP_FIELDS in
   collector/worker.py or correlation/parser.py, so the profiler re-parsed
   raw_lines and got attacker_ip=None, never attributing prober events to
   the attacker profile.

2. sniffer_rollup only handled tcp_syn_fingerprint (passive sniffer) and
   ignored tcpfp_fingerprint (active prober). Prober events use different
   field names: window_size/window_scale/sack_ok vs window/wscale/has_sack.

Changes:
- Add target_ip to _IP_FIELDS in collector and parser
- Add _PROBER_TCPFP_EVENT and _INITIAL_TTL table to behavioral.py
- sniffer_rollup now processes tcpfp_fingerprint: maps field names, derives
  OS from TTL via _os_from_ttl, computes hop_distance = initial_ttl - observed
- Expand prober DEFAULT_TCPFP_PORTS to [22,80,443,8080,8443,445,3389] for
  better SYN-ACK coverage on attacker machines
- Add 4 tests covering prober OS detection, hop distance, and field mapping
2026-04-15 17:36:40 -04:00
a4798946c1 fix: add remote_addr to IP field lookup so http/https/k8s events are attributed correctly
Templates for http, https, k8s, and docker_api log the client IP as
remote_addr (Flask's request.remote_addr) instead of src_ip. The collector
and correlation parser only checked src_ip/src/client_ip/remote_ip/ip, so
every request event from those services was stored with attacker_ip="Unknown"
and never associated with any attacker profile.

Adding remote_addr to _IP_FIELDS in both collector/worker.py and
correlation/parser.py fixes attribution. The profiler cursor was also reset
to 0 so the worker performs a cold rebuild and re-ingests existing events with
the corrected field mapping.
2026-04-15 17:23:33 -04:00
d869eb3d23 docs: document decnet engine orchestrator 2026-04-15 17:13:13 -04:00
89887ec6fd fix: serialize HTTP headers as JSON so tool detection and bounty extraction work
templates/decnet_logging.py calls str(v) on all SD-PARAM values, turning a
headers dict into Python repr ('{'User-Agent': ...}') rather than JSON.
detect_tools_from_headers() called json.loads() on that string and silently
swallowed the error, returning [] for every HTTP event. Same bug prevented
the ingester from extracting User-Agent bounty fingerprints.

- templates/http/server.py: wrap headers dict in json.dumps() before passing
  to syslog_line so the value is a valid JSON string in the syslog record
- behavioral.py: add ast.literal_eval fallback for existing DB rows that were
  stored with the old Python repr format
- ingester.py: parse headers as JSON string in _extract_bounty so User-Agent
  fingerprints are stored correctly going forward
- tests: add test_json_string_headers and test_python_repr_headers_fallback
  to exercise both formats in detect_tools_from_headers
2026-04-15 17:03:52 -04:00
02e73a19d5 fix: promote TCP-fingerprinted nmap to tool_guesses (detects -sC sans HTTP) 2026-04-15 16:44:45 -04:00
b3efd646f6 feat: replace tool attribution stat with dedicated DETECTED TOOLS block 2026-04-15 16:37:54 -04:00
2ec64ef2ef fix: rename BEHAVIOR label to ATTACK PATTERN for clarity 2026-04-15 16:36:19 -04:00
e67624452e feat: centralize microservice logging to DECNET_SYSTEM_LOGS (default: decnet.system.log) 2026-04-15 16:23:28 -04:00
e05b632e56 feat: update AttackerDetail UI for new behavior classes and multi-tool badges 2026-04-15 15:49:03 -04:00
c8f05df4d9 feat: overhaul behavioral profiler — multi-tool detection, improved classification, TTL OS fallback 2026-04-15 15:47:02 -04:00
935a9a58d2 fix: reopen collector log handles after deletion or log rotation
Replaces the single persistent open() with inode-based reopen logic.
If decnet.log or decnet.json is deleted or renamed by logrotate, the
next write detects the stale inode, closes the old handle, and creates
a fresh file — preventing silent data loss to orphaned inodes.
2026-04-15 14:04:54 -04:00
63efe6c7ba fix: persist ingester position and profiler cursor across restarts
- Ingester now loads byte-offset from DB on startup (key: ingest_worker_position)
  and saves it after each batch — prevents full re-read on every API restart
- On file truncation/rotation the saved offset is reset to 0
- Profiler worker now loads last_log_id from DB on startup — every restart
  becomes an incremental update instead of a full cold rebuild
- Updated all affected tests to mock get_state/set_state; added new tests
  covering position restore, set_state call, truncation reset, and cursor
  restore/cold-start paths
2026-04-15 13:58:12 -04:00