Add Unit 17: Performance story, tracing/profiling, testing

- Performance-Story.md: war-story narrative from the 9 Locust CSVs in development/profiles/. Real numbers: baseline 5 RPS with 49% failure to 1585 RPS / 0% failure at 1500 concurrent users. Quantifies the OTel tax (~13% RPS, doubles p50) and the single- core collapse (34x throughput loss). Calls out SQLite writer lock as the remaining bottleneck. - Tracing-and-Profiling.md: how to flip OpenTelemetry, Pyinstrument, py-spy, memray, and snakeviz on and off. Documents the local OTel collector compose file and extras groups. Production warning. - Testing-and-CI.md: pytest layout, markers, default opt-outs, the Locust stress recipe, project rules (tests for every feature, never ship broken code, green before commit), and the no-scapy- sniff-in-TestClient-lifespan rule.
2026-04-18 06:08:46 -04:00
parent 9de6ea3dff
commit 9c107c06da
3 changed files with 490 additions and 0 deletions
--- a/Performance-Story.md
+++ b/Performance-Story.md
@@ -0,0 +1,172 @@
 # Performance Story
 > A war story. Not a spec sheet. If you want the knobs, see
 > [Tracing and Profiling](Tracing-and-Profiling). If you want the env vars,
 > see [Environment variables](Environment-Variables).
 DECNET is a honeypot. Honeypots get hammered. If the ingest path melts
 under load, we lose attacker data — which is the only thing we care
 about. This page is the story of how we got the API from "falls over at
 200 users" to "holds 3.3k RPS at 1500 concurrent users" and what that
 cost in blood.
 All numbers below are **real**. They come from the nine Locust CSVs in
 `development/profiles/`. No fabrication.
 ---
 ## Headline table
 All runs hit the same FastAPI surface (`/api/v1/logs`, `/healthz`,
 `/api/v1/attackers`, etc.) via Locust. The Aggregated row is what
 matters.
 | Profile | Users | Config | Requests | Fails | p50 (ms) | p95 (ms) | p99 (ms) | Avg (ms) | RPS |
 |---|---|---|---|---|---|---|---|---|---|
 | `profile_3106d0313507f016_locust.csv` | baseline | early code, tracing on | 7 410 | 20 | 740 | 87 000 | 187 000 | 12 999.71 | 5.5 |
 | `profile_255c2e5.csv` | mid | regression, tracing on | 1 042 | 514 | 6 700 | 150 000 | 186 000 | 58 835.59 | 2.3 |
 | `profile_2dd86fb.csv` | mid | tracing on, post-fix | 6 012 | 0 | 240 | 134 000 | 194 000 | 16 217.04 | 2.4 |
 | `profile_e967aaa.csv` | ~1000 | tracing on, cleanups | 259 381 | 0 | 300 | 1 600 | 2 200 | 514.41 | 934.3 |
 | `profile_fb69a06.csv` | ~1000 | tracing on, tuned | 396 672 | 0 | 100 | 1 900 | 2 900 | 465.03 | 963.6 |
 | `profile_1500_fb69a06.csv` | 1500 | tracing ON | 232 648 | 0 | 690 | 6 500 | 9 500 | 1 773.51 | 880.4 |
 | `profile_1500_notracing_fb69a06.csv` | 1500 | tracing OFF | 277 214 | 0 | 340 | 5 700 | 8 400 | 1 489.08 | 992.7 |
 | `profile_1500_notracing_12_workers_fb69a06.csv` | 1500 | tracing OFF, 12 uvicorn workers | 308 024 | 0 | 700 | 2 700 | 4 200 | 929.88 | 1 585.1 |
 | `profile_1500_notracing_single_core_fb69a06.csv` | 1500 | tracing OFF, single core pin | 3 532 | 0 | 270 | 115 000 | 122 000 | 21 728.92 | 46.2 |
 (p50/p95/p99 = Locust Median / 95%ile / 99%ile columns. RPS = Current
 RPS at end of the run.)
 ---
 ## 1. The baseline: "it works, on Tuesdays"
 The earliest usable profile is `profile_3106d0313507f016_locust.csv`.
 7 410 requests, **20 failures**, and a p99 of **187 seconds**. You read
 that right — the 99th percentile request took over three minutes to
 come back. Current RPS at end of run: **5.5**.
 We were not fast. We were not even slow in a respectable way.
 `profile_255c2e5.csv` is worse: 1 042 requests, **514 failed** (49%
 failure rate), p99 = 186 s, average 58.8 s per request. That is the
 regression that proved our API could lock itself up completely when
 everyone tried to write at once.
 `profile_2dd86fb.csv` was the patch that stopped the bleeding: zero
 failures, but still p95/p99 in the 100–200 s range. The API responded
 to every request, eventually. That is not what anyone means by
 "responded."
 ## 2. The turnaround: `e967aaa` and `fb69a06`
 Then two commits changed everything.
 `profile_e967aaa.csv`: **259 381 requests**, zero failures, p50=300 ms,
 **p95=1.6 s, p99=2.2 s**, average 514 ms, **934 RPS**. Two orders of
 magnitude better on tail latency, four orders of magnitude more
 requests serviced.
 `profile_fb69a06.csv` squeezed more out: **396 672 requests**, zero
 failures, p50=100 ms, p95=1.9 s, p99=2.9 s, average 465 ms, **963
 RPS**. This is the commit we pinned as our "healthy baseline." Every
 1500-user run below is tagged `_fb69a06` because we wanted to measure
 **load and config**, not code churn.
 How? The usual suspects: proper DB connection pooling, eliminated a
 hot-path N+1, switched the repository layer to the injected
 `get_repository()` / `get_repo` pattern (see CLAUDE.md's DI rule), and
 stopped synchronously fsync'ing on every insert.
 ## 3. 1500 users: the API holds
 `profile_1500_fb69a06.csv` turns the screws: **1500 concurrent users**,
 tracing ON, default uvicorn worker count. Result: **232 648 requests**,
 zero failures, p50=690 ms, p95=6.5 s, p99=9.5 s, 880 RPS.
 Zero failures at 1500 users is the first genuine win. Latency got
 uglier — p95 jumped from 1.9 s to 6.5 s — but nothing fell over. The
 system is now **throughput-limited, not stability-limited**. That is a
 different class of problem.
 ## 4. What OpenTelemetry cost us
 Compare `profile_1500_fb69a06.csv` vs `profile_1500_notracing_fb69a06.csv`.
 Same code, same load, same host. Only difference:
 `DECNET_DEVELOPER_TRACING=false`.
 | Metric | Tracing ON | Tracing OFF | Delta |
 |---|---|---|---|
 | Total requests | 232 648 | 277 214 | **+19%** |
 | p50 | 690 ms | 340 ms | **-51%** |
 | p95 | 6 500 ms | 5 700 ms | -12% |
 | p99 | 9 500 ms | 8 400 ms | -12% |
 | Avg | 1 773 ms | 1 489 ms | -16% |
 | RPS | 880.4 | 992.7 | +13% |
 Auto-instrumented FastAPI tracing is **not free**. The median request
 paid a ~350 ms tax and the API served ~20% fewer requests in the same
 window. Tails are less affected because they are dominated by
 I/O wait, not span overhead.
 **Rule:** tracing stays off in production DECNET deployments. It is a
 development lens. See [Tracing and Profiling](Tracing-and-Profiling).
 ## 5. Vertical scaling: 12 workers vs single core
 `profile_1500_notracing_12_workers_fb69a06.csv`: tracing off, uvicorn
 with **12 workers**. Result: **308 024 requests**, p50=700 ms, **p95=2.7
 s, p99=4.2 s**, 1 585 RPS.
 Going from default workers to 12 bought us: **+11% throughput, -53%
 p95, -50% p99**. The tail improvement is the real prize — more workers
 means fewer requests queued behind a slow one.
 Now the punchline: `profile_1500_notracing_single_core_fb69a06.csv`.
 Same config, pinned to **one core** via CPU affinity. Result: **3 532
 requests total**, **p95=115 s, p99=122 s**, average 21.7 s, **46
 RPS**.
 Single-core is a **34x throughput collapse** vs 12-workers, and the
 tail grows from 4 seconds to nearly two minutes. FastAPI + SQLite on
 one core with 1500 concurrent clients is a queue, not a server.
 Vertical scaling holds. Horizontal workers matter. The GIL is real.
 ## 6. Where is the bottleneck now?
 Reading the 12-worker numbers: 1 585 RPS, p95=2.7 s, with zero
 failures. That is good, but p95 should be far lower than 2.7 s for an
 in-memory-ish workload. Candidates:
 1. **SQLite single-writer lock.** All 12 workers share one
   `attackers.db`. SQLite's WAL mode helps readers but writes still
   serialize. Under `/api/v1/logs` write amplification we expect
   queue-behind-writer stalls in exactly this latency envelope. The
   MySQL backend exists for exactly this reason — see
   [Database drivers](Database-Drivers).
 2. **Python GIL on the aggregation hot path.** The single-core profile
   proves the interpreter is CPU-bound at saturation. 12 workers side-
   step the GIL only for independent requests — anything going through
   a shared lock (DB, in-process cache) re-serializes.
 3. **Network stack / event-loop wait** on Locust side — less likely,
   we checked client CPU during the runs.
 Best defensible guess: **SQLite writer lock first, GIL second.**
 Switching the hot-write path to MySQL (or even `PRAGMA
 journal_mode=WAL` + batched inserts) should move p95 under a second at
 the same RPS. That work is scoped but not landed. See
 `development/FUTURE.md` for the queue.
 ---
 ## tl;dr
 - From 5 RPS / 49% failure to **1 585 RPS / 0% failure at 1500 concurrent users.**
 - Tracing costs ~13% RPS and doubles p50. Keep it off in production.
 - Workers matter. Single-core pinning = 46 RPS and two-minute tails.
 - Next bottleneck: the single SQLite writer. Blame the database, as is tradition.
 Related: [Design overview](Design-Overview) ·
 [Logging](Logging-and-Syslog) ·
 [Tracing and Profiling](Tracing-and-Profiling) ·
 [Testing and CI](Testing-and-CI).
--- a/Testing-and-CI.md
+++ b/Testing-and-CI.md
@@ -0,0 +1,161 @@
 # Testing and CI
 Pytest is the only game in town. Every new feature ships with tests.
 Never commit broken code. Those are house rules, not suggestions.
 For the numbers that our stress tests produce, see
 [Performance Story](Performance-Story). For the knobs used in dev, see
 [Tracing and Profiling](Tracing-and-Profiling) and
 [Environment variables](Environment-Variables).
 ---
 ## Layout
 Everything lives under `tests/`:
 ```
 tests/
  conftest.py              # global fixtures; forces DECNET_DEVELOPER_TRACING=false
  api/                     # FastAPI + repository integration
  docker/                  # image-build smoke (marker: docker)
  live/                    # subprocess-backed service tests (marker: live)
  perf/                    # per-request profiling harness + README
  service_testing/         # protocol-level decoys (SSH, SMB, RDP, ...)
  stress/                  # Locust-driven load tests (marker: stress)
  test_*.py                # unit tests, flat layout
 ```
 `tests/conftest.py` pins a few env vars for determinism:
 - `DECNET_DEVELOPER=true`
 - `DECNET_DEVELOPER_TRACING=false` (never trace in tests)
 - `DECNET_DB_TYPE=sqlite`
 ---
 ## Running the fast suite
 ```bash
 # activate the venv first — see ANTI's rule
 source .venv/bin/activate
 pytest
 ```
 `pyproject.toml` sets the default opt-outs:
 ```ini
 addopts = "-m 'not fuzz and not live and not stress and not bench and not docker' \
          -v -q -x -n logical --dist loadscope"
 ```
 Meaning: the default run **skips** `fuzz`, `live`, `stress`, `bench`,
 and `docker`. Everything else runs in parallel with xdist,
 distributed by test module (`--dist loadscope`).
 ## Markers
 All markers are declared in `pyproject.toml`:
 | Marker | Meaning | Opt in |
 |---|---|---|
 | `fuzz` | Hypothesis-based fuzz tests, slow | `pytest -m fuzz` |
 | `live` | Live subprocess service tests | `pytest -m live` |
 | `live_docker` | Live Docker containers (needs `DECNET_LIVE_DOCKER=1`) | `pytest -m live_docker` |
 | `stress` | Locust load tests | `pytest -m stress` |
 | `bench` | pytest-benchmark micro-benchmarks | `pytest -m bench` |
 | `docker` | Tests that build and run Docker images | `pytest -m docker` |
 Run **everything**, including the slow markers:
 ```bash
 pytest -m ''
 ```
 Run a single marker and nothing else:
 ```bash
 pytest -m stress tests/stress/ -v -x -n0 -s
 ```
 ---
 ## Stress tests (Locust)
 The stress suite is how we produced every number in
 [Performance Story](Performance-Story).
 Invocation:
 ```bash
 STRESS_USERS=2000 STRESS_SPAWN_RATE=200 \
    pytest -m stress tests/stress/ -v -x -n0 -s
 ```
 Notes:
 - `-n0` — **no xdist parallelism**. Locust wants to own the workers.
 - `-s` — do not capture stdout; you want to see the RPS/latency stream.
 - `-x` — stop on first failure. A locust failure here means the API
  regressed.
 - `STRESS_USERS` and `STRESS_SPAWN_RATE` are the knobs you tune. 2000
  users at spawn rate 200 is our standard "can the API hold" check.
 Output CSVs land in `development/profiles/` (or wherever the harness
 is configured to write) — the same CSV format parsed into the perf
 table on [Performance Story](Performance-Story).
 Before you run stress tests against a live deploy: **turn tracing
 off**. Profiles with `DECNET_DEVELOPER_TRACING=true` are not
 production-representative. The 1500-user comparison proves this.
 ---
 ## Project rules (non-negotiable)
 From `CLAUDE.md` at repo root:
 1. **Every new feature must have pytests.** No exceptions. If you add
   a collector, a parser, a service decoy, or a repository method,
   ship tests alongside it.
 2. **Never pass broken code.** "Broken" here means: does not run, does
   not pass 100% of tests, leaves the suite in a red state. If the
   bar is red when you commit, you are violating the rule.
 3. **100% pass before commit.** Green suite, then `git commit`. Not
   the other way around.
 There is also a live rule worth calling out because it has bitten
 more than one contributor:
 > **No scapy sniff threads in TestClient lifespan tests.** Scapy's
 > `sniff()` spawns a thread that does not clean up when FastAPI's
 > TestClient tears down the lifespan context. Result: pytest hangs
 > forever at teardown. If you need to verify sniffer behaviour under
 > a live API, **use static source inspection** (import the module,
 > assert on its attributes / dispatch table) rather than spinning up
 > the real capture loop.
 This is why `tests/test_sniffer_*.py` reaches into module internals
 instead of just hitting an endpoint.
 ---
 ## Quick recipe for adding a feature
 1. Write the failing test (pick the right marker — probably none, i.e.
   default suite).
 2. Implement the feature.
 3. `pytest` until green.
 4. If the feature touches the storage layer, use `get_repository()` or
   the FastAPI `get_repo` dependency — **never** import
   `SQLiteRepository` directly. See the DI rule in `CLAUDE.md`.
 5. If the feature has a performance story, run the stress suite with
   tracing off and archive the CSV under `development/profiles/`.
 6. `git commit`.
 Related: [Design overview](Design-Overview) ·
 [Performance Story](Performance-Story) ·
 [Tracing and Profiling](Tracing-and-Profiling) ·
 [Environment variables](Environment-Variables) ·
 [Logging](Logging-and-Syslog).
--- a/Tracing-and-Profiling.md
+++ b/Tracing-and-Profiling.md
@@ -0,0 +1,157 @@
 # Tracing and Profiling
 Three lenses. All off by default. All cost real CPU. Turn on, measure,
 turn off. For the perf history that justifies the defaults, see
 [Performance Story](Performance-Story). For every env var on this page,
 see [Environment variables](Environment-Variables).
 ---
 ## 1. OpenTelemetry (distributed tracing)
 Auto-instruments FastAPI, the internal workers, and outbound DB/HTTP
 calls. Emits OTLP over gRPC.
 ### Enable
 ```bash
 export DECNET_DEVELOPER_TRACING=true
 export DECNET_OTEL_ENDPOINT=http://localhost:4317   # default
 ```
 ### Bring up a local collector
 A dev collector ships in the repo:
 ```bash
 cd development
 docker compose -f docker-compose.otel.yml up -d
 # now point DECNET at it (default endpoint already matches)
 DECNET_DEVELOPER_TRACING=true decnet deploy --mode unihost --deckies 2
 ```
 Traces appear in whatever backend the compose file wires to
 (otel-collector → Jaeger/Tempo, depending on your compose overrides).
 ### Install the extras
 Tracing is an optional dependency group:
 ```bash
 pip install -e '.[tracing]'
 # or the umbrella:
 pip install -e '.[dev]'
 ```
 Without the extras, `DECNET_DEVELOPER_TRACING=true` is a no-op — the
 telemetry module detects missing packages and degrades silently. That
 behaviour is covered by `tests/test_telemetry.py`
 (`TestTracingDisabled`).
 ### Cost
 Measured, not guessed. From `development/profiles/`:
 - 1500 concurrent users, tracing ON: **p50=690 ms, 880 RPS**
 - same load, tracing OFF: **p50=340 ms, 993 RPS**
 That is roughly a **~13% RPS hit and p50 doubles**. Do not ship
 production deployments with this flag on. Full breakdown in
 [Performance Story §4](Performance-Story).
 ---
 ## 2. Pyinstrument (per-request flamegraphs)
 A sampling profiler mounted as ASGI middleware. Each request gets its
 own HTML flamegraph.
 ### Enable
 ```bash
 export DECNET_PROFILE_REQUESTS=true
 export DECNET_PROFILE_DIR=profiles     # default, relative to $PWD
 DECNET_PROFILE_REQUESTS=true decnet deploy --mode unihost --deckies 1
 ```
 Hit an endpoint, then:
 ```bash
 ls profiles/            # one *.html per request
 xdg-open profiles/<most-recent>.html
 ```
 ### Install the extras
 ```bash
 pip install -e '.[profile]'
 ```
 ### Cost
 Pyinstrument samples at kHz frequencies and writes HTML per request.
 That is **non-trivial CPU** and disk pressure. Do not leave it on. It
 is a "one request at a time, look at it under a microscope" tool, not
 a production observability channel.
 ---
 ## 3. py-spy / memray / snakeviz (dev extras)
 Optional profilers declared in `pyproject.toml` under the `profile`
 extras group. Install with `pip install -e '.[profile]'`.
 ### py-spy (sampling profiler, no code changes)
 ```bash
 # find the API pid
 ps -ef | grep 'uvicorn.*decnet'
 # record a 30s flamegraph
 sudo py-spy record -o pyspy.svg --pid <pid> --duration 30
 ```
 There is a helper at `scripts/profile/pyspy-attach.sh` that documents
 the common invocations. If py-spy cannot attach (Linux ptrace
 restrictions), use Pyinstrument instead — the script prints fallback
 guidance.
 ### memray (memory allocation tracing)
 ```bash
 scripts/profile/memray-api.sh            # launches API under memray
 memray flamegraph memray.bin             # generate report
 ```
 Good for hunting leaks in the long-running workers (collector,
 attacker-worker, sniffer).
 ### snakeviz (cProfile viewer)
 ```bash
 python -m cProfile -o out.prof -m decnet.cli services
 snakeviz out.prof
 ```
 Useful for CLI / one-shot paths. For the API, prefer Pyinstrument.
 ---
 ## Production warning
 > Do **not** leave `DECNET_DEVELOPER_TRACING=true` or
 > `DECNET_PROFILE_REQUESTS=true` on in a real DECNET deployment.
 The numbers in [Performance Story](Performance-Story) are measured on a
 single baseline commit (`fb69a06`) with every other variable held
 still, and tracing alone cost ~13% throughput and doubled p50. In an
 attacker-facing honeypot, a slow API means delayed detection, missed
 correlations, and potentially a noticeable behavioural tell — none of
 which you want.
 Rule: flip on, collect, flip off, commit.
 Related: [Environment variables](Environment-Variables) ·
 [Design overview](Design-Overview) ·
 [Logging](Logging-and-Syslog) ·
 [Testing and CI](Testing-and-CI).