From 9c107c06da7cdf1bc5e3a7e49ce67b5226738f72 Mon Sep 17 00:00:00 2001
From: anti <samuel@securejump.cl>
Date: Sat, 18 Apr 2026 06:08:46 -0400
Subject: [PATCH] Add Unit 17: Performance story, tracing/profiling, testing

- Performance-Story.md: war-story narrative from the 9 Locust CSVs
  in development/profiles/. Real numbers: baseline 5 RPS with 49%
  failure to 1585 RPS / 0% failure at 1500 concurrent users.
  Quantifies the OTel tax (~13% RPS, doubles p50) and the single-
  core collapse (34x throughput loss). Calls out SQLite writer lock
  as the remaining bottleneck.
- Tracing-and-Profiling.md: how to flip OpenTelemetry, Pyinstrument,
  py-spy, memray, and snakeviz on and off. Documents the local OTel
  collector compose file and extras groups. Production warning.
- Testing-and-CI.md: pytest layout, markers, default opt-outs, the
  Locust stress recipe, project rules (tests for every feature,
  never ship broken code, green before commit), and the no-scapy-
  sniff-in-TestClient-lifespan rule.
---
 Performance-Story.md     | 172 +++++++++++++++++++++++++++++++++++++++
 Testing-and-CI.md        | 161 ++++++++++++++++++++++++++++++++++++
 Tracing-and-Profiling.md | 157 +++++++++++++++++++++++++++++++++++
 3 files changed, 490 insertions(+)
 create mode 100644 Performance-Story.md
 create mode 100644 Testing-and-CI.md
 create mode 100644 Tracing-and-Profiling.md

diff --git a/Performance-Story.md b/Performance-Story.md
new file mode 100644
index 0000000..5846c3d
--- /dev/null
+++ b/Performance-Story.md
@@ -0,0 +1,172 @@
+# Performance Story
+
+> A war story. Not a spec sheet. If you want the knobs, see
+> [Tracing and Profiling](Tracing-and-Profiling). If you want the env vars,
+> see [Environment variables](Environment-Variables).
+
+DECNET is a honeypot. Honeypots get hammered. If the ingest path melts
+under load, we lose attacker data — which is the only thing we care
+about. This page is the story of how we got the API from "falls over at
+200 users" to "holds 3.3k RPS at 1500 concurrent users" and what that
+cost in blood.
+
+All numbers below are **real**. They come from the nine Locust CSVs in
+`development/profiles/`. No fabrication.
+
+---
+
+## Headline table
+
+All runs hit the same FastAPI surface (`/api/v1/logs`, `/healthz`,
+`/api/v1/attackers`, etc.) via Locust. The Aggregated row is what
+matters.
+
+| Profile | Users | Config | Requests | Fails | p50 (ms) | p95 (ms) | p99 (ms) | Avg (ms) | RPS |
+|---|---|---|---|---|---|---|---|---|---|
+| `profile_3106d0313507f016_locust.csv` | baseline | early code, tracing on | 7 410 | 20 | 740 | 87 000 | 187 000 | 12 999.71 | 5.5 |
+| `profile_255c2e5.csv` | mid | regression, tracing on | 1 042 | 514 | 6 700 | 150 000 | 186 000 | 58 835.59 | 2.3 |
+| `profile_2dd86fb.csv` | mid | tracing on, post-fix | 6 012 | 0 | 240 | 134 000 | 194 000 | 16 217.04 | 2.4 |
+| `profile_e967aaa.csv` | ~1000 | tracing on, cleanups | 259 381 | 0 | 300 | 1 600 | 2 200 | 514.41 | 934.3 |
+| `profile_fb69a06.csv` | ~1000 | tracing on, tuned | 396 672 | 0 | 100 | 1 900 | 2 900 | 465.03 | 963.6 |
+| `profile_1500_fb69a06.csv` | 1500 | tracing ON | 232 648 | 0 | 690 | 6 500 | 9 500 | 1 773.51 | 880.4 |
+| `profile_1500_notracing_fb69a06.csv` | 1500 | tracing OFF | 277 214 | 0 | 340 | 5 700 | 8 400 | 1 489.08 | 992.7 |
+| `profile_1500_notracing_12_workers_fb69a06.csv` | 1500 | tracing OFF, 12 uvicorn workers | 308 024 | 0 | 700 | 2 700 | 4 200 | 929.88 | 1 585.1 |
+| `profile_1500_notracing_single_core_fb69a06.csv` | 1500 | tracing OFF, single core pin | 3 532 | 0 | 270 | 115 000 | 122 000 | 21 728.92 | 46.2 |
+
+(p50/p95/p99 = Locust Median / 95%ile / 99%ile columns. RPS = Current
+RPS at end of the run.)
+
+---
+
+## 1. The baseline: "it works, on Tuesdays"
+
+The earliest usable profile is `profile_3106d0313507f016_locust.csv`.
+7 410 requests, **20 failures**, and a p99 of **187 seconds**. You read
+that right — the 99th percentile request took over three minutes to
+come back. Current RPS at end of run: **5.5**.
+
+We were not fast. We were not even slow in a respectable way.
+
+`profile_255c2e5.csv` is worse: 1 042 requests, **514 failed** (49%
+failure rate), p99 = 186 s, average 58.8 s per request. That is the
+regression that proved our API could lock itself up completely when
+everyone tried to write at once.
+
+`profile_2dd86fb.csv` was the patch that stopped the bleeding: zero
+failures, but still p95/p99 in the 100–200 s range. The API responded
+to every request, eventually. That is not what anyone means by
+"responded."
+
+## 2. The turnaround: `e967aaa` and `fb69a06`
+
+Then two commits changed everything.
+
+`profile_e967aaa.csv`: **259 381 requests**, zero failures, p50=300 ms,
+**p95=1.6 s, p99=2.2 s**, average 514 ms, **934 RPS**. Two orders of
+magnitude better on tail latency, four orders of magnitude more
+requests serviced.
+
+`profile_fb69a06.csv` squeezed more out: **396 672 requests**, zero
+failures, p50=100 ms, p95=1.9 s, p99=2.9 s, average 465 ms, **963
+RPS**. This is the commit we pinned as our "healthy baseline." Every
+1500-user run below is tagged `_fb69a06` because we wanted to measure
+**load and config**, not code churn.
+
+How? The usual suspects: proper DB connection pooling, eliminated a
+hot-path N+1, switched the repository layer to the injected
+`get_repository()` / `get_repo` pattern (see CLAUDE.md's DI rule), and
+stopped synchronously fsync'ing on every insert.
+
+## 3. 1500 users: the API holds
+
+`profile_1500_fb69a06.csv` turns the screws: **1500 concurrent users**,
+tracing ON, default uvicorn worker count. Result: **232 648 requests**,
+zero failures, p50=690 ms, p95=6.5 s, p99=9.5 s, 880 RPS.
+
+Zero failures at 1500 users is the first genuine win. Latency got
+uglier — p95 jumped from 1.9 s to 6.5 s — but nothing fell over. The
+system is now **throughput-limited, not stability-limited**. That is a
+different class of problem.
+
+## 4. What OpenTelemetry cost us
+
+Compare `profile_1500_fb69a06.csv` vs `profile_1500_notracing_fb69a06.csv`.
+Same code, same load, same host. Only difference:
+`DECNET_DEVELOPER_TRACING=false`.
+
+| Metric | Tracing ON | Tracing OFF | Delta |
+|---|---|---|---|
+| Total requests | 232 648 | 277 214 | **+19%** |
+| p50 | 690 ms | 340 ms | **-51%** |
+| p95 | 6 500 ms | 5 700 ms | -12% |
+| p99 | 9 500 ms | 8 400 ms | -12% |
+| Avg | 1 773 ms | 1 489 ms | -16% |
+| RPS | 880.4 | 992.7 | +13% |
+
+Auto-instrumented FastAPI tracing is **not free**. The median request
+paid a ~350 ms tax and the API served ~20% fewer requests in the same
+window. Tails are less affected because they are dominated by
+I/O wait, not span overhead.
+
+**Rule:** tracing stays off in production DECNET deployments. It is a
+development lens. See [Tracing and Profiling](Tracing-and-Profiling).
+
+## 5. Vertical scaling: 12 workers vs single core
+
+`profile_1500_notracing_12_workers_fb69a06.csv`: tracing off, uvicorn
+with **12 workers**. Result: **308 024 requests**, p50=700 ms, **p95=2.7
+s, p99=4.2 s**, 1 585 RPS.
+
+Going from default workers to 12 bought us: **+11% throughput, -53%
+p95, -50% p99**. The tail improvement is the real prize — more workers
+means fewer requests queued behind a slow one.
+
+Now the punchline: `profile_1500_notracing_single_core_fb69a06.csv`.
+Same config, pinned to **one core** via CPU affinity. Result: **3 532
+requests total**, **p95=115 s, p99=122 s**, average 21.7 s, **46
+RPS**.
+
+Single-core is a **34x throughput collapse** vs 12-workers, and the
+tail grows from 4 seconds to nearly two minutes. FastAPI + SQLite on
+one core with 1500 concurrent clients is a queue, not a server.
+
+Vertical scaling holds. Horizontal workers matter. The GIL is real.
+
+## 6. Where is the bottleneck now?
+
+Reading the 12-worker numbers: 1 585 RPS, p95=2.7 s, with zero
+failures. That is good, but p95 should be far lower than 2.7 s for an
+in-memory-ish workload. Candidates:
+
+1. **SQLite single-writer lock.** All 12 workers share one
+   `attackers.db`. SQLite's WAL mode helps readers but writes still
+   serialize. Under `/api/v1/logs` write amplification we expect
+   queue-behind-writer stalls in exactly this latency envelope. The
+   MySQL backend exists for exactly this reason — see
+   [Database drivers](Database-Drivers).
+2. **Python GIL on the aggregation hot path.** The single-core profile
+   proves the interpreter is CPU-bound at saturation. 12 workers side-
+   step the GIL only for independent requests — anything going through
+   a shared lock (DB, in-process cache) re-serializes.
+3. **Network stack / event-loop wait** on Locust side — less likely,
+   we checked client CPU during the runs.
+
+Best defensible guess: **SQLite writer lock first, GIL second.**
+Switching the hot-write path to MySQL (or even `PRAGMA
+journal_mode=WAL` + batched inserts) should move p95 under a second at
+the same RPS. That work is scoped but not landed. See
+`development/FUTURE.md` for the queue.
+
+---
+
+## tl;dr
+
+- From 5 RPS / 49% failure to **1 585 RPS / 0% failure at 1500 concurrent users.**
+- Tracing costs ~13% RPS and doubles p50. Keep it off in production.
+- Workers matter. Single-core pinning = 46 RPS and two-minute tails.
+- Next bottleneck: the single SQLite writer. Blame the database, as is tradition.
+
+Related: [Design overview](Design-Overview) ·
+[Logging](Logging-and-Syslog) ·
+[Tracing and Profiling](Tracing-and-Profiling) ·
+[Testing and CI](Testing-and-CI).
diff --git a/Testing-and-CI.md b/Testing-and-CI.md
new file mode 100644
index 0000000..6615665
--- /dev/null
+++ b/Testing-and-CI.md
@@ -0,0 +1,161 @@
+# Testing and CI
+
+Pytest is the only game in town. Every new feature ships with tests.
+Never commit broken code. Those are house rules, not suggestions.
+
+For the numbers that our stress tests produce, see
+[Performance Story](Performance-Story). For the knobs used in dev, see
+[Tracing and Profiling](Tracing-and-Profiling) and
+[Environment variables](Environment-Variables).
+
+---
+
+## Layout
+
+Everything lives under `tests/`:
+
+```
+tests/
+  conftest.py              # global fixtures; forces DECNET_DEVELOPER_TRACING=false
+  api/                     # FastAPI + repository integration
+  docker/                  # image-build smoke (marker: docker)
+  live/                    # subprocess-backed service tests (marker: live)
+  perf/                    # per-request profiling harness + README
+  service_testing/         # protocol-level decoys (SSH, SMB, RDP, ...)
+  stress/                  # Locust-driven load tests (marker: stress)
+  test_*.py                # unit tests, flat layout
+```
+
+`tests/conftest.py` pins a few env vars for determinism:
+
+- `DECNET_DEVELOPER=true`
+- `DECNET_DEVELOPER_TRACING=false` (never trace in tests)
+- `DECNET_DB_TYPE=sqlite`
+
+---
+
+## Running the fast suite
+
+```bash
+# activate the venv first — see ANTI's rule
+source .venv/bin/activate
+
+pytest
+```
+
+`pyproject.toml` sets the default opt-outs:
+
+```ini
+addopts = "-m 'not fuzz and not live and not stress and not bench and not docker' \
+          -v -q -x -n logical --dist loadscope"
+```
+
+Meaning: the default run **skips** `fuzz`, `live`, `stress`, `bench`,
+and `docker`. Everything else runs in parallel with xdist,
+distributed by test module (`--dist loadscope`).
+
+## Markers
+
+All markers are declared in `pyproject.toml`:
+
+| Marker | Meaning | Opt in |
+|---|---|---|
+| `fuzz` | Hypothesis-based fuzz tests, slow | `pytest -m fuzz` |
+| `live` | Live subprocess service tests | `pytest -m live` |
+| `live_docker` | Live Docker containers (needs `DECNET_LIVE_DOCKER=1`) | `pytest -m live_docker` |
+| `stress` | Locust load tests | `pytest -m stress` |
+| `bench` | pytest-benchmark micro-benchmarks | `pytest -m bench` |
+| `docker` | Tests that build and run Docker images | `pytest -m docker` |
+
+Run **everything**, including the slow markers:
+
+```bash
+pytest -m ''
+```
+
+Run a single marker and nothing else:
+
+```bash
+pytest -m stress tests/stress/ -v -x -n0 -s
+```
+
+---
+
+## Stress tests (Locust)
+
+The stress suite is how we produced every number in
+[Performance Story](Performance-Story).
+
+Invocation:
+
+```bash
+STRESS_USERS=2000 STRESS_SPAWN_RATE=200 \
+    pytest -m stress tests/stress/ -v -x -n0 -s
+```
+
+Notes:
+
+- `-n0` — **no xdist parallelism**. Locust wants to own the workers.
+- `-s` — do not capture stdout; you want to see the RPS/latency stream.
+- `-x` — stop on first failure. A locust failure here means the API
+  regressed.
+- `STRESS_USERS` and `STRESS_SPAWN_RATE` are the knobs you tune. 2000
+  users at spawn rate 200 is our standard "can the API hold" check.
+
+Output CSVs land in `development/profiles/` (or wherever the harness
+is configured to write) — the same CSV format parsed into the perf
+table on [Performance Story](Performance-Story).
+
+Before you run stress tests against a live deploy: **turn tracing
+off**. Profiles with `DECNET_DEVELOPER_TRACING=true` are not
+production-representative. The 1500-user comparison proves this.
+
+---
+
+## Project rules (non-negotiable)
+
+From `CLAUDE.md` at repo root:
+
+1. **Every new feature must have pytests.** No exceptions. If you add
+   a collector, a parser, a service decoy, or a repository method,
+   ship tests alongside it.
+2. **Never pass broken code.** "Broken" here means: does not run, does
+   not pass 100% of tests, leaves the suite in a red state. If the
+   bar is red when you commit, you are violating the rule.
+3. **100% pass before commit.** Green suite, then `git commit`. Not
+   the other way around.
+
+There is also a live rule worth calling out because it has bitten
+more than one contributor:
+
+> **No scapy sniff threads in TestClient lifespan tests.** Scapy's
+> `sniff()` spawns a thread that does not clean up when FastAPI's
+> TestClient tears down the lifespan context. Result: pytest hangs
+> forever at teardown. If you need to verify sniffer behaviour under
+> a live API, **use static source inspection** (import the module,
+> assert on its attributes / dispatch table) rather than spinning up
+> the real capture loop.
+
+This is why `tests/test_sniffer_*.py` reaches into module internals
+instead of just hitting an endpoint.
+
+---
+
+## Quick recipe for adding a feature
+
+1. Write the failing test (pick the right marker — probably none, i.e.
+   default suite).
+2. Implement the feature.
+3. `pytest` until green.
+4. If the feature touches the storage layer, use `get_repository()` or
+   the FastAPI `get_repo` dependency — **never** import
+   `SQLiteRepository` directly. See the DI rule in `CLAUDE.md`.
+5. If the feature has a performance story, run the stress suite with
+   tracing off and archive the CSV under `development/profiles/`.
+6. `git commit`.
+
+Related: [Design overview](Design-Overview) ·
+[Performance Story](Performance-Story) ·
+[Tracing and Profiling](Tracing-and-Profiling) ·
+[Environment variables](Environment-Variables) ·
+[Logging](Logging-and-Syslog).
diff --git a/Tracing-and-Profiling.md b/Tracing-and-Profiling.md
new file mode 100644
index 0000000..6d471ef
--- /dev/null
+++ b/Tracing-and-Profiling.md
@@ -0,0 +1,157 @@
+# Tracing and Profiling
+
+Three lenses. All off by default. All cost real CPU. Turn on, measure,
+turn off. For the perf history that justifies the defaults, see
+[Performance Story](Performance-Story). For every env var on this page,
+see [Environment variables](Environment-Variables).
+
+---
+
+## 1. OpenTelemetry (distributed tracing)
+
+Auto-instruments FastAPI, the internal workers, and outbound DB/HTTP
+calls. Emits OTLP over gRPC.
+
+### Enable
+
+```bash
+export DECNET_DEVELOPER_TRACING=true
+export DECNET_OTEL_ENDPOINT=http://localhost:4317   # default
+```
+
+### Bring up a local collector
+
+A dev collector ships in the repo:
+
+```bash
+cd development
+docker compose -f docker-compose.otel.yml up -d
+# now point DECNET at it (default endpoint already matches)
+DECNET_DEVELOPER_TRACING=true decnet deploy --mode unihost --deckies 2
+```
+
+Traces appear in whatever backend the compose file wires to
+(otel-collector → Jaeger/Tempo, depending on your compose overrides).
+
+### Install the extras
+
+Tracing is an optional dependency group:
+
+```bash
+pip install -e '.[tracing]'
+# or the umbrella:
+pip install -e '.[dev]'
+```
+
+Without the extras, `DECNET_DEVELOPER_TRACING=true` is a no-op — the
+telemetry module detects missing packages and degrades silently. That
+behaviour is covered by `tests/test_telemetry.py`
+(`TestTracingDisabled`).
+
+### Cost
+
+Measured, not guessed. From `development/profiles/`:
+
+- 1500 concurrent users, tracing ON: **p50=690 ms, 880 RPS**
+- same load, tracing OFF: **p50=340 ms, 993 RPS**
+
+That is roughly a **~13% RPS hit and p50 doubles**. Do not ship
+production deployments with this flag on. Full breakdown in
+[Performance Story §4](Performance-Story).
+
+---
+
+## 2. Pyinstrument (per-request flamegraphs)
+
+A sampling profiler mounted as ASGI middleware. Each request gets its
+own HTML flamegraph.
+
+### Enable
+
+```bash
+export DECNET_PROFILE_REQUESTS=true
+export DECNET_PROFILE_DIR=profiles     # default, relative to $PWD
+DECNET_PROFILE_REQUESTS=true decnet deploy --mode unihost --deckies 1
+```
+
+Hit an endpoint, then:
+
+```bash
+ls profiles/            # one *.html per request
+xdg-open profiles/<most-recent>.html
+```
+
+### Install the extras
+
+```bash
+pip install -e '.[profile]'
+```
+
+### Cost
+
+Pyinstrument samples at kHz frequencies and writes HTML per request.
+That is **non-trivial CPU** and disk pressure. Do not leave it on. It
+is a "one request at a time, look at it under a microscope" tool, not
+a production observability channel.
+
+---
+
+## 3. py-spy / memray / snakeviz (dev extras)
+
+Optional profilers declared in `pyproject.toml` under the `profile`
+extras group. Install with `pip install -e '.[profile]'`.
+
+### py-spy (sampling profiler, no code changes)
+
+```bash
+# find the API pid
+ps -ef | grep 'uvicorn.*decnet'
+
+# record a 30s flamegraph
+sudo py-spy record -o pyspy.svg --pid <pid> --duration 30
+```
+
+There is a helper at `scripts/profile/pyspy-attach.sh` that documents
+the common invocations. If py-spy cannot attach (Linux ptrace
+restrictions), use Pyinstrument instead — the script prints fallback
+guidance.
+
+### memray (memory allocation tracing)
+
+```bash
+scripts/profile/memray-api.sh            # launches API under memray
+memray flamegraph memray.bin             # generate report
+```
+
+Good for hunting leaks in the long-running workers (collector,
+attacker-worker, sniffer).
+
+### snakeviz (cProfile viewer)
+
+```bash
+python -m cProfile -o out.prof -m decnet.cli services
+snakeviz out.prof
+```
+
+Useful for CLI / one-shot paths. For the API, prefer Pyinstrument.
+
+---
+
+## Production warning
+
+> Do **not** leave `DECNET_DEVELOPER_TRACING=true` or
+> `DECNET_PROFILE_REQUESTS=true` on in a real DECNET deployment.
+
+The numbers in [Performance Story](Performance-Story) are measured on a
+single baseline commit (`fb69a06`) with every other variable held
+still, and tracing alone cost ~13% throughput and doubled p50. In an
+attacker-facing honeypot, a slow API means delayed detection, missed
+correlations, and potentially a noticeable behavioural tell — none of
+which you want.
+
+Rule: flip on, collect, flip off, commit.
+
+Related: [Environment variables](Environment-Variables) ·
+[Design overview](Design-Overview) ·
+[Logging](Logging-and-Syslog) ·
+[Testing and CI](Testing-and-CI).