Files
DECNET/development/docs/TRACING.md
anti 70d8ffc607 feat: complete OTEL tracing across all services with pipeline bridge and docs
Extends tracing to every remaining module: all 23 API route handlers,
correlation engine, sniffer (fingerprint/p0f/syslog), prober (jarm/hassh/tcpfp),
profiler behavioral analysis, logging subsystem, engine, and mutator.

Bridges the ingester→SSE trace gap by persisting trace_id/span_id columns on
the logs table and creating OTEL span links in the SSE endpoint. Adds log-trace
correlation via _TraceContextFilter injecting otel_trace_id into Python LogRecords.

Includes development/docs/TRACING.md with full span reference (76 spans),
pipeline propagation architecture, quick start guide, and troubleshooting.
2026-04-16 00:58:08 -04:00

8.1 KiB

Distributed Tracing

OpenTelemetry (OTEL) distributed tracing across all DECNET services. Gated by the DECNET_DEVELOPER_TRACING environment variable (off by default). When disabled, zero overhead: no OTEL imports occur, @traced returns the original unwrapped function, and no middleware is installed.

Quick Start

# 1. Start Jaeger (OTLP receiver on :4317, UI on :16686)
docker compose -f development/docker-compose.otel.yml up -d

# 2. Run DECNET with tracing enabled
DECNET_DEVELOPER_TRACING=true decnet web

# 3. Open Jaeger UI — service name is "decnet"
open http://localhost:16686
Variable Default Purpose
DECNET_DEVELOPER_TRACING false Enable/disable all tracing
DECNET_OTEL_ENDPOINT http://localhost:4317 OTLP gRPC exporter target

Architecture

The core module is decnet/telemetry.py. All tracing flows through it.

Export Purpose
setup_tracing(app) Init TracerProvider, instrument FastAPI, enable log-trace correlation
shutdown_tracing() Flush and shut down the TracerProvider
get_tracer(component) Return an OTEL Tracer or _NoOpTracer when disabled
@traced(name) Decorator wrapping sync/async functions in spans (no-op when disabled)
wrap_repository(repo) Dynamic __getattr__ proxy adding db.* spans to every async method
inject_context(record) Embed W3C trace context into a JSON record under _trace
extract_context(record) Recover trace context from _trace and remove it from the record
start_span_with_context(tracer, name, ctx) Start a span as child of an extracted context

TracerProvider config: Resource(service.name=decnet, service.version=0.2.0), BatchSpanProcessor, OTLP gRPC exporter.

When disabled: _NoOpTracer and _NoOpSpan stubs are returned. No OTEL SDK packages are imported. The @traced decorator returns the original function object at decoration time.

Pipeline Trace Propagation

The DECNET data pipeline is decoupled through JSON files and the database, which normally breaks trace continuity. Four mechanisms bridge the gaps:

  1. Collector → JSON: inject_context() embeds W3C traceparent/tracestate into each JSON log record under a _trace key.
  2. JSON → Ingester: extract_context() recovers the parent context. The ingester creates ingester.process_record as a child span, preserving the collector→ingester parent-child relationship.
  3. Ingester → DB: The ingester persists the current span's trace_id and span_id as columns on the logs table before calling repo.add_log().
  4. DB → SSE: The SSE endpoint reads trace_id/span_id from log rows and creates OTEL span links (FOLLOWS_FROM) on sse.emit_logs, connecting the read path back to the original ingestion traces.

Log-trace correlation: _TraceContextFilter (installed by enable_trace_context()) injects otel_trace_id and otel_span_id into Python LogRecord objects, bridging structured logs with trace context.

Span Reference

API Endpoints (20 spans)

Span Endpoint
api.login POST /auth/login
api.change_password POST /auth/change-password
api.get_logs GET /logs
api.get_logs_histogram GET /logs/histogram
api.get_bounties GET /bounty
api.get_attackers GET /attackers
api.get_attacker_detail GET /attackers/{uuid}
api.get_attacker_commands GET /attackers/{uuid}/commands
api.get_stats GET /stats
api.get_deckies GET /fleet/deckies
api.deploy_deckies POST /fleet/deploy
api.mutate_decky POST /fleet/mutate/{decky_id}
api.update_mutate_interval POST /fleet/mutate-interval/{decky_id}
api.get_config GET /config
api.update_deployment_limit PUT /config/deployment-limit
api.update_global_mutation_interval PUT /config/global-mutation-interval
api.create_user POST /config/users
api.delete_user DELETE /config/users/{uuid}
api.update_user_role PUT /config/users/{uuid}/role
api.reset_user_password PUT /config/users/{uuid}/password
api.reinit POST /config/reinit
api.get_health GET /health
api.stream_events GET /stream

DB Layer (dynamic)

Every async method on BaseRepository is automatically wrapped by TracedRepository as db.<method_name> (e.g. db.add_log, db.get_attackers, db.upsert_attacker).

Collector

Span Type
collector.stream_container @traced
collector.event inline

Ingester

Span Type
ingester.process_record inline (with parent context)
ingester.extract_bounty @traced

Profiler

Span Type
profiler.incremental_update @traced
profiler.update_profiles @traced
profiler.process_ip inline
profiler.timing_stats @traced
profiler.classify_behavior @traced
profiler.detect_tools_from_headers @traced
profiler.phase_sequence @traced
profiler.sniffer_rollup @traced
profiler.build_behavior_record @traced
profiler.behavior_summary inline

Sniffer

Span Type
sniffer.worker @traced
sniffer.sniff_loop @traced
sniffer.tcp_syn_fingerprint inline
sniffer.tls_client_hello inline
sniffer.tls_server_hello inline
sniffer.tls_certificate inline
sniffer.parse_client_hello @traced
sniffer.parse_server_hello @traced
sniffer.parse_certificate @traced
sniffer.ja3 @traced
sniffer.ja3s @traced
sniffer.ja4 @traced
sniffer.ja4s @traced
sniffer.session_resumption_info @traced
sniffer.p0f_guess_os @traced
sniffer.write_event @traced

Prober

Span Type
prober.worker @traced
prober.discover_attackers @traced
prober.probe_cycle @traced
prober.jarm_phase @traced
prober.hassh_phase @traced
prober.tcpfp_phase @traced
prober.jarm_hash @traced
prober.jarm_send_probe @traced
prober.hassh_server @traced
prober.hassh_ssh_connect @traced
prober.tcp_fingerprint @traced
prober.tcpfp_send_syn @traced

Engine

Span Type
engine.deploy @traced
engine.teardown @traced
engine.compose_with_retry @traced

Mutator

Span Type
mutator.mutate_decky @traced
mutator.mutate_all @traced
mutator.watch_loop @traced

Correlation

Span Type
correlation.ingest_file @traced
correlation.ingest_file.summary inline
correlation.traversals @traced
correlation.report_json @traced
correlation.traversal_syslog_lines @traced

Logging

Span Type
logging.init_file_handler @traced
logging.probe_log_target @traced

SSE

Span Type
sse.emit_logs inline (with span links to ingestion traces)

Adding New Traces

from decnet.telemetry import traced as _traced, get_tracer as _get_tracer

# Decorator (preferred for entire functions)
@_traced("component.operation")
async def my_function():
    ...

# Inline (for sub-sections within a function)
with _get_tracer("component").start_as_current_span("component.sub_op") as span:
    span.set_attribute("key", "value")
    ...

Naming convention: component.operation (e.g. prober.jarm_hash, profiler.timing_stats).

Troubleshooting

Symptom Check
No traces in Jaeger DECNET_DEVELOPER_TRACING=true? Jaeger running on port 4317?
ImportError on OTEL packages Run pip install -e ".[dev]" (OTEL is in optional deps)
Partial traces (ingester orphaned) Verify _trace key present in JSON log file records
SSE spans have no links Confirm trace_id/span_id columns exist in logs table
Performance concern BatchSpanProcessor adds ~1ms per span; zero overhead when disabled