Native sshd and pam_unix lines route through rsyslog without the
relay@55555 SD wrapper and without key=value pairs, so attacker_ip
fell through to "Unknown". Add a prose-IP fallback to both parsers:
anchored patterns (from/rhost/client/src) win first so we never pick
the local listener in "Connection from X port Y on Z port 22", with
a bare-IPv4 scan as the last resort.
Hit live on first VPS deploy: a window between the initial
client.containers.list() snapshot and the client.events() start-event
stream let topology service containers slip through, requiring an
operator restart for them to be picked up.
Two fixes:
* `_watch_events` now wraps the events() call in a retry loop with
exponential backoff (1s -> 30s cap). A docker.errors.APIError, daemon
reload, or SDK stream-decode hiccup used to make the executor task
return cleanly, leaving the collector "running" with no event
subscription. Future container starts were silently dropped until
the unit was restarted.
* New `_reconcile_loop` async task ticks every
DECNET_COLLECTOR_RECONCILE_S (default 30s), re-scans
client.containers.list(), and calls _spawn for any service container
not already in `active`. Belt to the event watcher's suspenders:
even if a start event is dropped during a reconnect window, the
reconciler picks it up within one cycle. Also prunes finished
futures from `active` so the dict's bounded by current container
count rather than agent lifetime churn.
The events watcher's start-event filter previously called
_load_service_container_names(), which reads decnet-state.json on
every event. decnet deploy writes that state file out-of-band
with docker compose up, so a container's start event could
arrive before the state was committed — the watcher then dropped
the event silently and never tailed the container's stdout. The
visible symptom was an empty Credentials view (and Logs/Bounty)
after a fresh deploy until the collector was manually restarted.
Fix: stamp decnet.fleet.{service,decky,service_name} labels on
every fleet service container at compose-time, and let the
collector recognize either the fleet or topology label without
touching the state file. The state-file name match remains as a
fallback for legacy containers that predate the new labels.
Ships the backend half of Config → Workers:
* Worker registry aggregates `system.*.health` + `system.bus.health`
heartbeats into a last-seen dict; OK / STALE / UNKNOWN tiers drop
out of a 90s window (3× the 30s heartbeat interval).
* `GET /api/v1/workers` returns the snapshot plus `bus_connected`
(so the UI can explain "all UNKNOWN" when the bus socket is down)
and a per-row `installed` flag populated from
`systemctl list-unit-files decnet-*.service` (cached 30s).
* `POST /api/v1/workers/{name}/stop` publishes a stop intent on
`system.<name>.control`; workers listen via the shared control
listener in `bus/publish.py`.
* Heartbeat + control listener wired into collector / profiler /
sniffer / prober / mutator worker loops. API self-heartbeats too
so the panel always has one ground-truth row.
* Topic helper `system_control(name)` + tests covering builder
validation, control listener shutdown path, and the API surface
(auth gating, bus-connected field, unknown-name 404).
Adds `StartFailure` / `StartAllResponse` models in anticipation of
the upcoming start endpoints (DEBT-034).
log_collector_worker connects the bus at startup, builds a thread-safe
system.log publisher, and hands it to each container-stream thread
through _stream_container's new publish_fn parameter. Publishing fires
right after the JSON record is written — same rate-limiter path, no
extra parsing, compact payload (decky/service/event_type/attacker_ip/
timestamp) so subscribers can redraw without re-reading the DB.
Bus stays optional: if get_bus() fails or DECNET_BUS_ENABLED=false the
factory returns a no-op publisher and the stream thread calls it
unconditionally. Hook failures are logged and never abort the thread.
Legacy fleet deckies live in decnet-state.json; MazeNET topology
containers don't. Tag them at compose-time with
decnet.topology.service=true and let the collector match on that label.
Spin up the agent's log collector on the first successful /topology/apply
(not in the lifespan — that would break the no-docker-on-boot invariant)
and tear it down with the app. Land log lines in DECNET_AGENT_LOG_FILE,
separate from master-side DECNET_INGEST_LOG_FILE, so a dev box running
both roles can't forward its own ingest back to itself.
When master pushes a topology that differs from whatever is pinned
locally, teardown the predecessor and accept the new one. Refusing with
409 left the agent stranded after partial deploys. record_error now
persists the hydrated blob so a later teardown can still walk the LAN
list — otherwise a half-failed apply strands containers + bridges with
no breadcrumb back to them.
- Relaxed RFC 5424 regex to accept either NILVALUE or a numeric PROCID;
sshd / sudo go through rsyslog with their real PID, while
syslog_bridge emitters keep using '-'.
- Added a fallback pass that scans the MSG body for IP-shaped
key=value tokens. This rescues attacker attribution for plain logger
callers like the SSH PROMPT_COMMAND shim, which emits
'CMD … src=IP …' without SD-element params.
Rename the container-side logging module decnet_logging → syslog_bridge
(canonical at templates/syslog_bridge.py, synced into each template by
the deployer). Drop the stale per-template copies; setuptools find was
picking them up anyway. Swap useradd/USER/chown "decnet" for "logrelay"
so no obvious token appears in the rendered container image.
Apply the same cloaking pattern to the telnet template that SSH got:
syslog pipe moves to /run/systemd/journal/syslog-relay and the relay
is cat'd via exec -a "systemd-journal-fwd". rsyslog.d conf rename
99-decnet.conf → 50-journal-forward.conf. SSH capture script:
/var/decnet/captured → /var/lib/systemd/coredump (real systemd path),
logger tag decnet-capture → systemd-journal. Compose volume updated
to match the new in-container quarantine path.
SD element ID shifts decnet@55555 → relay@55555; synced across
collector, parser, sniffer, prober, formatter, tests, and docs so the
host-side pipeline still matches what containers emit.
Collector now creates a span per event and injects W3C trace context
into JSON records. Ingester extracts that context and creates child
spans, connecting the full event journey: collector -> ingester ->
db.add_log + extract_bounty -> db.add_bounty.
Profiler now creates per-IP spans inside update_profiles with rich
attributes (event_count, is_traversal, bounty_count, command_count).
Traces in Jaeger now show the complete execution map from capture
through ingestion and profiling.
Gated by DECNET_DEVELOPER_TRACING env var (default off, zero overhead).
When enabled, traces flow through FastAPI routes, background workers
(collector, ingester, profiler, sniffer, prober), engine/mutator
operations, and all DB calls via TracedRepository proxy.
Includes Jaeger docker-compose for local dev and 18 unit tests.
The collector spawned one permanent thread per Docker container via
asyncio.to_thread(), saturating the default asyncio executor. This
starved short-lived to_thread(load_state) calls in get_deckies() and
get_stats_summary(), causing the SSE stream and deckies endpoints to
hang indefinitely while other DB-only endpoints worked fine.
Give the collector and sniffer their own ThreadPoolExecutor so they
never compete with the default pool.
The active prober emits tcpfp_fingerprint events with TTL, window, MSS etc.
from the attacker's SYN-ACK. These were invisible to the behavioral profiler
for two reasons:
1. target_ip (prober's field name for attacker IP) was not in _IP_FIELDS in
collector/worker.py or correlation/parser.py, so the profiler re-parsed
raw_lines and got attacker_ip=None, never attributing prober events to
the attacker profile.
2. sniffer_rollup only handled tcp_syn_fingerprint (passive sniffer) and
ignored tcpfp_fingerprint (active prober). Prober events use different
field names: window_size/window_scale/sack_ok vs window/wscale/has_sack.
Changes:
- Add target_ip to _IP_FIELDS in collector and parser
- Add _PROBER_TCPFP_EVENT and _INITIAL_TTL table to behavioral.py
- sniffer_rollup now processes tcpfp_fingerprint: maps field names, derives
OS from TTL via _os_from_ttl, computes hop_distance = initial_ttl - observed
- Expand prober DEFAULT_TCPFP_PORTS to [22,80,443,8080,8443,445,3389] for
better SYN-ACK coverage on attacker machines
- Add 4 tests covering prober OS detection, hop distance, and field mapping
Templates for http, https, k8s, and docker_api log the client IP as
remote_addr (Flask's request.remote_addr) instead of src_ip. The collector
and correlation parser only checked src_ip/src/client_ip/remote_ip/ip, so
every request event from those services was stored with attacker_ip="Unknown"
and never associated with any attacker profile.
Adding remote_addr to _IP_FIELDS in both collector/worker.py and
correlation/parser.py fixes attribution. The profiler cursor was also reset
to 0 so the worker performs a cold rebuild and re-ingests existing events with
the corrected field mapping.
Replaces the single persistent open() with inode-based reopen logic.
If decnet.log or decnet.json is deleted or renamed by logrotate, the
next write detects the stale inode, closes the old handle, and creates
a fresh file — preventing silent data loss to orphaned inodes.
Connection-lifecycle events (connect, disconnect, accept, close) fire once
per TCP connection. During a portscan or credential-stuffing run this
firehoses the SQLite ingester with tiny WAL writes and starves all reads
until the queue drains.
The collector now deduplicates these events by
(attacker_ip, decky, service, event_type) over a 1-second window before
writing to the .json ingestion stream. The raw .log file is untouched, so
rsyslog/SIEM still see every event for forensic fidelity.
Tunable via DECNET_COLLECTOR_RL_WINDOW_SEC and DECNET_COLLECTOR_RL_EVENT_TYPES.
Reverts commits 8c249f6, a6c7cfd, 7ff5703. The SSH log relay approach
requires container redeployment and doesn't retroactively fix existing
attacker profiles. Rolling back to reassess the approach.
New log_relay.py replaces raw 'cat' on the rsyslog pipe. Intercepts
sshd and bash lines and re-emits them as structured RFC 5424 events:
login_success, session_opened, disconnect, connection_closed, command.
Parsers updated to accept non-nil PROCID (sshd uses PID).
The SSH honeypot logs commands via PROMPT_COMMAND logger as:
<14>1 ... bash - - - CMD uid=0 pwd=/root cmd=ls
These lines had service=bash and event_type=-, so the attacker worker
never recognized them as commands. Both the collector and correlation
parsers now detect the CMD pattern and normalize to service=ssh,
event_type=command, with uid/pwd/command in fields.
- Modify Rfc5424Formatter to read decnet_component from LogRecord
and use it as RFC 5424 APP-NAME field (falls back to 'decnet')
- Add get_logger(component) factory in decnet/logging/__init__.py
with _ComponentFilter that injects decnet_component on each record
- Wire all five layers to their component tag:
cli -> 'cli', engine -> 'engine', api -> 'api' (api.py, ingester,
routers), mutator -> 'mutator', collector -> 'collector'
- Add structured INFO/DEBUG/WARNING/ERROR log calls throughout each
layer per the defined vocabulary; DEBUG calls are suppressed unless
DECNET_DEVELOPER=true
- Add tests/test_logging.py covering factory, filter, formatter
component-awareness, fallback behaviour, and level gating