Commit Graph

18 Commits

Author SHA1 Message Date
0fbb07c2ec feat(workers): bus-backed Workers panel (registry, control, installed flag)
Ships the backend half of Config → Workers:

* Worker registry aggregates `system.*.health` + `system.bus.health`
  heartbeats into a last-seen dict; OK / STALE / UNKNOWN tiers drop
  out of a 90s window (3× the 30s heartbeat interval).
* `GET /api/v1/workers` returns the snapshot plus `bus_connected`
  (so the UI can explain "all UNKNOWN" when the bus socket is down)
  and a per-row `installed` flag populated from
  `systemctl list-unit-files decnet-*.service` (cached 30s).
* `POST /api/v1/workers/{name}/stop` publishes a stop intent on
  `system.<name>.control`; workers listen via the shared control
  listener in `bus/publish.py`.
* Heartbeat + control listener wired into collector / profiler /
  sniffer / prober / mutator worker loops. API self-heartbeats too
  so the panel always has one ground-truth row.
* Topic helper `system_control(name)` + tests covering builder
  validation, control listener shutdown path, and the API surface
  (auth gating, bus-connected field, unknown-name 404).

Adds `StartFailure` / `StartAllResponse` models in anticipation of
the upcoming start endpoints (DEBT-034).
2026-04-22 14:10:39 -04:00
a448dbe283 feat(collector): publish system.log per ingested event (DEBT-031 worker 5)
log_collector_worker connects the bus at startup, builds a thread-safe
system.log publisher, and hands it to each container-stream thread
through _stream_container's new publish_fn parameter.  Publishing fires
right after the JSON record is written — same rate-limiter path, no
extra parsing, compact payload (decky/service/event_type/attacker_ip/
timestamp) so subscribers can redraw without re-reading the DB.

Bus stays optional: if get_bus() fails or DECNET_BUS_ENABLED=false the
factory returns a no-op publisher and the stream thread calls it
unconditionally.  Hook failures are logged and never abort the thread.
2026-04-21 16:57:21 -04:00
0cdcfe2653 feat(agent/collector): topology-label discovery and master-authoritative supersede
Legacy fleet deckies live in decnet-state.json; MazeNET topology
containers don't. Tag them at compose-time with
decnet.topology.service=true and let the collector match on that label.
Spin up the agent's log collector on the first successful /topology/apply
(not in the lifespan — that would break the no-docker-on-boot invariant)
and tear it down with the app. Land log lines in DECNET_AGENT_LOG_FILE,
separate from master-side DECNET_INGEST_LOG_FILE, so a dev box running
both roles can't forward its own ingest back to itself.

When master pushes a topology that differs from whatever is pinned
locally, teardown the predecessor and accept the new one. Refusing with
409 left the agent stranded after partial deploys. record_error now
persists the hydrated blob so a later teardown can still walk the LAN
list — otherwise a half-failed apply strands containers + bridges with
no breadcrumb back to them.
2026-04-21 10:23:10 -04:00
8bdc5b98c9 feat(collector): parse real PROCID and extract IPs from logger kv pairs
- Relaxed RFC 5424 regex to accept either NILVALUE or a numeric PROCID;
  sshd / sudo go through rsyslog with their real PID, while
  syslog_bridge emitters keep using '-'.
- Added a fallback pass that scans the MSG body for IP-shaped
  key=value tokens. This rescues attacker attribution for plain logger
  callers like the SSH PROMPT_COMMAND shim, which emits
  'CMD … src=IP …' without SD-element params.
2026-04-18 05:37:08 -04:00
8dd4c78b33 refactor: strip DECNET tokens from container-visible surface
Rename the container-side logging module decnet_logging → syslog_bridge
(canonical at templates/syslog_bridge.py, synced into each template by
the deployer). Drop the stale per-template copies; setuptools find was
picking them up anyway. Swap useradd/USER/chown "decnet" for "logrelay"
so no obvious token appears in the rendered container image.

Apply the same cloaking pattern to the telnet template that SSH got:
syslog pipe moves to /run/systemd/journal/syslog-relay and the relay
is cat'd via exec -a "systemd-journal-fwd". rsyslog.d conf rename
99-decnet.conf → 50-journal-forward.conf. SSH capture script:
/var/decnet/captured → /var/lib/systemd/coredump (real systemd path),
logger tag decnet-capture → systemd-journal. Compose volume updated
to match the new in-container quarantine path.

SD element ID shifts decnet@55555 → relay@55555; synced across
collector, parser, sniffer, prober, formatter, tests, and docs so the
host-side pipeline still matches what containers emit.
2026-04-17 22:57:53 -04:00
29578d9d99 fix: resolve all ruff and bandit lint/security issues
- Remove unused Optional import (F401) in telemetry.py
- Move imports above module-level code (E402) in web/db/models.py
- Default API/web hosts to 127.0.0.1 instead of 0.0.0.0 (B104)
- Add usedforsecurity=False to MD5 calls in JA3/HASSH fingerprinting (B324)
- Annotate intentional try/except/pass blocks with nosec (B110)
- Remove stale nosec comments that no longer suppress anything
2026-04-16 01:04:57 -04:00
04db13afae feat: cross-stage trace propagation and granular per-event spans
Collector now creates a span per event and injects W3C trace context
into JSON records. Ingester extracts that context and creates child
spans, connecting the full event journey: collector -> ingester ->
db.add_log + extract_bounty -> db.add_bounty.

Profiler now creates per-IP spans inside update_profiles with rich
attributes (event_count, is_traversal, bounty_count, command_count).

Traces in Jaeger now show the complete execution map from capture
through ingestion and profiling.
2026-04-15 23:52:13 -04:00
65ddb0b359 feat: add OpenTelemetry distributed tracing across all DECNET services
Gated by DECNET_DEVELOPER_TRACING env var (default off, zero overhead).
When enabled, traces flow through FastAPI routes, background workers
(collector, ingester, profiler, sniffer, prober), engine/mutator
operations, and all DB calls via TracedRepository proxy.

Includes Jaeger docker-compose for local dev and 18 unit tests.
2026-04-15 23:23:13 -04:00
a1ca5d699b fix: use dedicated thread pools for collector and sniffer workers
The collector spawned one permanent thread per Docker container via
asyncio.to_thread(), saturating the default asyncio executor. This
starved short-lived to_thread(load_state) calls in get_deckies() and
get_stats_summary(), causing the SSE stream and deckies endpoints to
hang indefinitely while other DB-only endpoints worked fine.

Give the collector and sniffer their own ThreadPoolExecutor so they
never compete with the default pool.
2026-04-15 22:57:03 -04:00
11d749f13d fix: wire prober tcpfp_fingerprint events into sniffer_rollup for OS/hop detection
The active prober emits tcpfp_fingerprint events with TTL, window, MSS etc.
from the attacker's SYN-ACK. These were invisible to the behavioral profiler
for two reasons:

1. target_ip (prober's field name for attacker IP) was not in _IP_FIELDS in
   collector/worker.py or correlation/parser.py, so the profiler re-parsed
   raw_lines and got attacker_ip=None, never attributing prober events to
   the attacker profile.

2. sniffer_rollup only handled tcp_syn_fingerprint (passive sniffer) and
   ignored tcpfp_fingerprint (active prober). Prober events use different
   field names: window_size/window_scale/sack_ok vs window/wscale/has_sack.

Changes:
- Add target_ip to _IP_FIELDS in collector and parser
- Add _PROBER_TCPFP_EVENT and _INITIAL_TTL table to behavioral.py
- sniffer_rollup now processes tcpfp_fingerprint: maps field names, derives
  OS from TTL via _os_from_ttl, computes hop_distance = initial_ttl - observed
- Expand prober DEFAULT_TCPFP_PORTS to [22,80,443,8080,8443,445,3389] for
  better SYN-ACK coverage on attacker machines
- Add 4 tests covering prober OS detection, hop distance, and field mapping
2026-04-15 17:36:40 -04:00
a4798946c1 fix: add remote_addr to IP field lookup so http/https/k8s events are attributed correctly
Templates for http, https, k8s, and docker_api log the client IP as
remote_addr (Flask's request.remote_addr) instead of src_ip. The collector
and correlation parser only checked src_ip/src/client_ip/remote_ip/ip, so
every request event from those services was stored with attacker_ip="Unknown"
and never associated with any attacker profile.

Adding remote_addr to _IP_FIELDS in both collector/worker.py and
correlation/parser.py fixes attribution. The profiler cursor was also reset
to 0 so the worker performs a cold rebuild and re-ingests existing events with
the corrected field mapping.
2026-04-15 17:23:33 -04:00
935a9a58d2 fix: reopen collector log handles after deletion or log rotation
Replaces the single persistent open() with inode-based reopen logic.
If decnet.log or decnet.json is deleted or renamed by logrotate, the
next write detects the stale inode, closes the old handle, and creates
a fresh file — preventing silent data loss to orphaned inodes.
2026-04-15 14:04:54 -04:00
f6cb90ee66 perf: rate-limit connect/disconnect events in collector to spare ingester
Connection-lifecycle events (connect, disconnect, accept, close) fire once
per TCP connection. During a portscan or credential-stuffing run this
firehoses the SQLite ingester with tiny WAL writes and starves all reads
until the queue drains.

The collector now deduplicates these events by
(attacker_ip, decky, service, event_type) over a 1-second window before
writing to the .json ingestion stream. The raw .log file is untouched, so
rsyslog/SIEM still see every event for forensic fidelity.

Tunable via DECNET_COLLECTOR_RL_WINDOW_SEC and DECNET_COLLECTOR_RL_EVENT_TYPES.
2026-04-15 12:04:04 -04:00
df3f04c10e revert: undo service badge filter, parser normalization, and SSH relay
Reverts commits 8c249f6, a6c7cfd, 7ff5703. The SSH log relay approach
requires container redeployment and doesn't retroactively fix existing
attacker profiles. Rolling back to reassess the approach.
2026-04-14 02:14:46 -04:00
7ff5703250 feat: SSH log relay emits proper DECNET syslog for sshd events
New log_relay.py replaces raw 'cat' on the rsyslog pipe. Intercepts
sshd and bash lines and re-emits them as structured RFC 5424 events:
login_success, session_opened, disconnect, connection_closed, command.
Parsers updated to accept non-nil PROCID (sshd uses PID).
2026-04-14 02:07:35 -04:00
a6c7cfdf66 fix: normalize SSH bash CMD lines to service=ssh, event_type=command
The SSH honeypot logs commands via PROMPT_COMMAND logger as:
  <14>1 ... bash - - -  CMD uid=0 pwd=/root cmd=ls
These lines had service=bash and event_type=-, so the attacker worker
never recognized them as commands. Both the collector and correlation
parsers now detect the CMD pattern and normalize to service=ssh,
event_type=command, with uid/pwd/command in fields.
2026-04-14 01:54:36 -04:00
035499f255 feat: add component-aware RFC 5424 application logging system
- Modify Rfc5424Formatter to read decnet_component from LogRecord
  and use it as RFC 5424 APP-NAME field (falls back to 'decnet')
- Add get_logger(component) factory in decnet/logging/__init__.py
  with _ComponentFilter that injects decnet_component on each record
- Wire all five layers to their component tag:
    cli -> 'cli', engine -> 'engine', api -> 'api' (api.py, ingester,
    routers), mutator -> 'mutator', collector -> 'collector'
- Add structured INFO/DEBUG/WARNING/ERROR log calls throughout each
  layer per the defined vocabulary; DEBUG calls are suppressed unless
  DECNET_DEVELOPER=true
- Add tests/test_logging.py covering factory, filter, formatter
  component-awareness, fallback behaviour, and level gating
2026-04-13 07:39:01 -04:00
c384a3103a refactor: separate engine, collector, mutator, and fleet into independent subpackages
- decnet/engine/ — container lifecycle (deploy, teardown, status); _kill_api removed
- decnet/collector/ — Docker log streaming (moved from web/collector.py)
- decnet/mutator/ — mutation engine (no longer imports from cli or duplicates deployer code)
- decnet/fleet.py — shared decky-building logic extracted from cli.py

Cross-contamination eliminated:
- web router no longer imports from decnet.cli
- mutator no longer imports from decnet.cli
- cli no longer imports from decnet.web
- _kill_api() moved to cli (process management, not engine concern)
- _compose_with_retry duplicate removed from mutator
2026-04-12 00:26:22 -04:00