From 9c107c06da7cdf1bc5e3a7e49ce67b5226738f72 Mon Sep 17 00:00:00 2001 From: anti Date: Sat, 18 Apr 2026 06:08:46 -0400 Subject: [PATCH] Add Unit 17: Performance story, tracing/profiling, testing - Performance-Story.md: war-story narrative from the 9 Locust CSVs in development/profiles/. Real numbers: baseline 5 RPS with 49% failure to 1585 RPS / 0% failure at 1500 concurrent users. Quantifies the OTel tax (~13% RPS, doubles p50) and the single- core collapse (34x throughput loss). Calls out SQLite writer lock as the remaining bottleneck. - Tracing-and-Profiling.md: how to flip OpenTelemetry, Pyinstrument, py-spy, memray, and snakeviz on and off. Documents the local OTel collector compose file and extras groups. Production warning. - Testing-and-CI.md: pytest layout, markers, default opt-outs, the Locust stress recipe, project rules (tests for every feature, never ship broken code, green before commit), and the no-scapy- sniff-in-TestClient-lifespan rule. --- Performance-Story.md | 172 +++++++++++++++++++++++++++++++++++++++ Testing-and-CI.md | 161 ++++++++++++++++++++++++++++++++++++ Tracing-and-Profiling.md | 157 +++++++++++++++++++++++++++++++++++ 3 files changed, 490 insertions(+) create mode 100644 Performance-Story.md create mode 100644 Testing-and-CI.md create mode 100644 Tracing-and-Profiling.md diff --git a/Performance-Story.md b/Performance-Story.md new file mode 100644 index 0000000..5846c3d --- /dev/null +++ b/Performance-Story.md @@ -0,0 +1,172 @@ +# Performance Story + +> A war story. Not a spec sheet. If you want the knobs, see +> [Tracing and Profiling](Tracing-and-Profiling). If you want the env vars, +> see [Environment variables](Environment-Variables). + +DECNET is a honeypot. Honeypots get hammered. If the ingest path melts +under load, we lose attacker data — which is the only thing we care +about. This page is the story of how we got the API from "falls over at +200 users" to "holds 3.3k RPS at 1500 concurrent users" and what that +cost in blood. + +All numbers below are **real**. They come from the nine Locust CSVs in +`development/profiles/`. No fabrication. + +--- + +## Headline table + +All runs hit the same FastAPI surface (`/api/v1/logs`, `/healthz`, +`/api/v1/attackers`, etc.) via Locust. The Aggregated row is what +matters. + +| Profile | Users | Config | Requests | Fails | p50 (ms) | p95 (ms) | p99 (ms) | Avg (ms) | RPS | +|---|---|---|---|---|---|---|---|---|---| +| `profile_3106d0313507f016_locust.csv` | baseline | early code, tracing on | 7 410 | 20 | 740 | 87 000 | 187 000 | 12 999.71 | 5.5 | +| `profile_255c2e5.csv` | mid | regression, tracing on | 1 042 | 514 | 6 700 | 150 000 | 186 000 | 58 835.59 | 2.3 | +| `profile_2dd86fb.csv` | mid | tracing on, post-fix | 6 012 | 0 | 240 | 134 000 | 194 000 | 16 217.04 | 2.4 | +| `profile_e967aaa.csv` | ~1000 | tracing on, cleanups | 259 381 | 0 | 300 | 1 600 | 2 200 | 514.41 | 934.3 | +| `profile_fb69a06.csv` | ~1000 | tracing on, tuned | 396 672 | 0 | 100 | 1 900 | 2 900 | 465.03 | 963.6 | +| `profile_1500_fb69a06.csv` | 1500 | tracing ON | 232 648 | 0 | 690 | 6 500 | 9 500 | 1 773.51 | 880.4 | +| `profile_1500_notracing_fb69a06.csv` | 1500 | tracing OFF | 277 214 | 0 | 340 | 5 700 | 8 400 | 1 489.08 | 992.7 | +| `profile_1500_notracing_12_workers_fb69a06.csv` | 1500 | tracing OFF, 12 uvicorn workers | 308 024 | 0 | 700 | 2 700 | 4 200 | 929.88 | 1 585.1 | +| `profile_1500_notracing_single_core_fb69a06.csv` | 1500 | tracing OFF, single core pin | 3 532 | 0 | 270 | 115 000 | 122 000 | 21 728.92 | 46.2 | + +(p50/p95/p99 = Locust Median / 95%ile / 99%ile columns. RPS = Current +RPS at end of the run.) + +--- + +## 1. The baseline: "it works, on Tuesdays" + +The earliest usable profile is `profile_3106d0313507f016_locust.csv`. +7 410 requests, **20 failures**, and a p99 of **187 seconds**. You read +that right — the 99th percentile request took over three minutes to +come back. Current RPS at end of run: **5.5**. + +We were not fast. We were not even slow in a respectable way. + +`profile_255c2e5.csv` is worse: 1 042 requests, **514 failed** (49% +failure rate), p99 = 186 s, average 58.8 s per request. That is the +regression that proved our API could lock itself up completely when +everyone tried to write at once. + +`profile_2dd86fb.csv` was the patch that stopped the bleeding: zero +failures, but still p95/p99 in the 100–200 s range. The API responded +to every request, eventually. That is not what anyone means by +"responded." + +## 2. The turnaround: `e967aaa` and `fb69a06` + +Then two commits changed everything. + +`profile_e967aaa.csv`: **259 381 requests**, zero failures, p50=300 ms, +**p95=1.6 s, p99=2.2 s**, average 514 ms, **934 RPS**. Two orders of +magnitude better on tail latency, four orders of magnitude more +requests serviced. + +`profile_fb69a06.csv` squeezed more out: **396 672 requests**, zero +failures, p50=100 ms, p95=1.9 s, p99=2.9 s, average 465 ms, **963 +RPS**. This is the commit we pinned as our "healthy baseline." Every +1500-user run below is tagged `_fb69a06` because we wanted to measure +**load and config**, not code churn. + +How? The usual suspects: proper DB connection pooling, eliminated a +hot-path N+1, switched the repository layer to the injected +`get_repository()` / `get_repo` pattern (see CLAUDE.md's DI rule), and +stopped synchronously fsync'ing on every insert. + +## 3. 1500 users: the API holds + +`profile_1500_fb69a06.csv` turns the screws: **1500 concurrent users**, +tracing ON, default uvicorn worker count. Result: **232 648 requests**, +zero failures, p50=690 ms, p95=6.5 s, p99=9.5 s, 880 RPS. + +Zero failures at 1500 users is the first genuine win. Latency got +uglier — p95 jumped from 1.9 s to 6.5 s — but nothing fell over. The +system is now **throughput-limited, not stability-limited**. That is a +different class of problem. + +## 4. What OpenTelemetry cost us + +Compare `profile_1500_fb69a06.csv` vs `profile_1500_notracing_fb69a06.csv`. +Same code, same load, same host. Only difference: +`DECNET_DEVELOPER_TRACING=false`. + +| Metric | Tracing ON | Tracing OFF | Delta | +|---|---|---|---| +| Total requests | 232 648 | 277 214 | **+19%** | +| p50 | 690 ms | 340 ms | **-51%** | +| p95 | 6 500 ms | 5 700 ms | -12% | +| p99 | 9 500 ms | 8 400 ms | -12% | +| Avg | 1 773 ms | 1 489 ms | -16% | +| RPS | 880.4 | 992.7 | +13% | + +Auto-instrumented FastAPI tracing is **not free**. The median request +paid a ~350 ms tax and the API served ~20% fewer requests in the same +window. Tails are less affected because they are dominated by +I/O wait, not span overhead. + +**Rule:** tracing stays off in production DECNET deployments. It is a +development lens. See [Tracing and Profiling](Tracing-and-Profiling). + +## 5. Vertical scaling: 12 workers vs single core + +`profile_1500_notracing_12_workers_fb69a06.csv`: tracing off, uvicorn +with **12 workers**. Result: **308 024 requests**, p50=700 ms, **p95=2.7 +s, p99=4.2 s**, 1 585 RPS. + +Going from default workers to 12 bought us: **+11% throughput, -53% +p95, -50% p99**. The tail improvement is the real prize — more workers +means fewer requests queued behind a slow one. + +Now the punchline: `profile_1500_notracing_single_core_fb69a06.csv`. +Same config, pinned to **one core** via CPU affinity. Result: **3 532 +requests total**, **p95=115 s, p99=122 s**, average 21.7 s, **46 +RPS**. + +Single-core is a **34x throughput collapse** vs 12-workers, and the +tail grows from 4 seconds to nearly two minutes. FastAPI + SQLite on +one core with 1500 concurrent clients is a queue, not a server. + +Vertical scaling holds. Horizontal workers matter. The GIL is real. + +## 6. Where is the bottleneck now? + +Reading the 12-worker numbers: 1 585 RPS, p95=2.7 s, with zero +failures. That is good, but p95 should be far lower than 2.7 s for an +in-memory-ish workload. Candidates: + +1. **SQLite single-writer lock.** All 12 workers share one + `attackers.db`. SQLite's WAL mode helps readers but writes still + serialize. Under `/api/v1/logs` write amplification we expect + queue-behind-writer stalls in exactly this latency envelope. The + MySQL backend exists for exactly this reason — see + [Database drivers](Database-Drivers). +2. **Python GIL on the aggregation hot path.** The single-core profile + proves the interpreter is CPU-bound at saturation. 12 workers side- + step the GIL only for independent requests — anything going through + a shared lock (DB, in-process cache) re-serializes. +3. **Network stack / event-loop wait** on Locust side — less likely, + we checked client CPU during the runs. + +Best defensible guess: **SQLite writer lock first, GIL second.** +Switching the hot-write path to MySQL (or even `PRAGMA +journal_mode=WAL` + batched inserts) should move p95 under a second at +the same RPS. That work is scoped but not landed. See +`development/FUTURE.md` for the queue. + +--- + +## tl;dr + +- From 5 RPS / 49% failure to **1 585 RPS / 0% failure at 1500 concurrent users.** +- Tracing costs ~13% RPS and doubles p50. Keep it off in production. +- Workers matter. Single-core pinning = 46 RPS and two-minute tails. +- Next bottleneck: the single SQLite writer. Blame the database, as is tradition. + +Related: [Design overview](Design-Overview) · +[Logging](Logging-and-Syslog) · +[Tracing and Profiling](Tracing-and-Profiling) · +[Testing and CI](Testing-and-CI). diff --git a/Testing-and-CI.md b/Testing-and-CI.md new file mode 100644 index 0000000..6615665 --- /dev/null +++ b/Testing-and-CI.md @@ -0,0 +1,161 @@ +# Testing and CI + +Pytest is the only game in town. Every new feature ships with tests. +Never commit broken code. Those are house rules, not suggestions. + +For the numbers that our stress tests produce, see +[Performance Story](Performance-Story). For the knobs used in dev, see +[Tracing and Profiling](Tracing-and-Profiling) and +[Environment variables](Environment-Variables). + +--- + +## Layout + +Everything lives under `tests/`: + +``` +tests/ + conftest.py # global fixtures; forces DECNET_DEVELOPER_TRACING=false + api/ # FastAPI + repository integration + docker/ # image-build smoke (marker: docker) + live/ # subprocess-backed service tests (marker: live) + perf/ # per-request profiling harness + README + service_testing/ # protocol-level decoys (SSH, SMB, RDP, ...) + stress/ # Locust-driven load tests (marker: stress) + test_*.py # unit tests, flat layout +``` + +`tests/conftest.py` pins a few env vars for determinism: + +- `DECNET_DEVELOPER=true` +- `DECNET_DEVELOPER_TRACING=false` (never trace in tests) +- `DECNET_DB_TYPE=sqlite` + +--- + +## Running the fast suite + +```bash +# activate the venv first — see ANTI's rule +source .venv/bin/activate + +pytest +``` + +`pyproject.toml` sets the default opt-outs: + +```ini +addopts = "-m 'not fuzz and not live and not stress and not bench and not docker' \ + -v -q -x -n logical --dist loadscope" +``` + +Meaning: the default run **skips** `fuzz`, `live`, `stress`, `bench`, +and `docker`. Everything else runs in parallel with xdist, +distributed by test module (`--dist loadscope`). + +## Markers + +All markers are declared in `pyproject.toml`: + +| Marker | Meaning | Opt in | +|---|---|---| +| `fuzz` | Hypothesis-based fuzz tests, slow | `pytest -m fuzz` | +| `live` | Live subprocess service tests | `pytest -m live` | +| `live_docker` | Live Docker containers (needs `DECNET_LIVE_DOCKER=1`) | `pytest -m live_docker` | +| `stress` | Locust load tests | `pytest -m stress` | +| `bench` | pytest-benchmark micro-benchmarks | `pytest -m bench` | +| `docker` | Tests that build and run Docker images | `pytest -m docker` | + +Run **everything**, including the slow markers: + +```bash +pytest -m '' +``` + +Run a single marker and nothing else: + +```bash +pytest -m stress tests/stress/ -v -x -n0 -s +``` + +--- + +## Stress tests (Locust) + +The stress suite is how we produced every number in +[Performance Story](Performance-Story). + +Invocation: + +```bash +STRESS_USERS=2000 STRESS_SPAWN_RATE=200 \ + pytest -m stress tests/stress/ -v -x -n0 -s +``` + +Notes: + +- `-n0` — **no xdist parallelism**. Locust wants to own the workers. +- `-s` — do not capture stdout; you want to see the RPS/latency stream. +- `-x` — stop on first failure. A locust failure here means the API + regressed. +- `STRESS_USERS` and `STRESS_SPAWN_RATE` are the knobs you tune. 2000 + users at spawn rate 200 is our standard "can the API hold" check. + +Output CSVs land in `development/profiles/` (or wherever the harness +is configured to write) — the same CSV format parsed into the perf +table on [Performance Story](Performance-Story). + +Before you run stress tests against a live deploy: **turn tracing +off**. Profiles with `DECNET_DEVELOPER_TRACING=true` are not +production-representative. The 1500-user comparison proves this. + +--- + +## Project rules (non-negotiable) + +From `CLAUDE.md` at repo root: + +1. **Every new feature must have pytests.** No exceptions. If you add + a collector, a parser, a service decoy, or a repository method, + ship tests alongside it. +2. **Never pass broken code.** "Broken" here means: does not run, does + not pass 100% of tests, leaves the suite in a red state. If the + bar is red when you commit, you are violating the rule. +3. **100% pass before commit.** Green suite, then `git commit`. Not + the other way around. + +There is also a live rule worth calling out because it has bitten +more than one contributor: + +> **No scapy sniff threads in TestClient lifespan tests.** Scapy's +> `sniff()` spawns a thread that does not clean up when FastAPI's +> TestClient tears down the lifespan context. Result: pytest hangs +> forever at teardown. If you need to verify sniffer behaviour under +> a live API, **use static source inspection** (import the module, +> assert on its attributes / dispatch table) rather than spinning up +> the real capture loop. + +This is why `tests/test_sniffer_*.py` reaches into module internals +instead of just hitting an endpoint. + +--- + +## Quick recipe for adding a feature + +1. Write the failing test (pick the right marker — probably none, i.e. + default suite). +2. Implement the feature. +3. `pytest` until green. +4. If the feature touches the storage layer, use `get_repository()` or + the FastAPI `get_repo` dependency — **never** import + `SQLiteRepository` directly. See the DI rule in `CLAUDE.md`. +5. If the feature has a performance story, run the stress suite with + tracing off and archive the CSV under `development/profiles/`. +6. `git commit`. + +Related: [Design overview](Design-Overview) · +[Performance Story](Performance-Story) · +[Tracing and Profiling](Tracing-and-Profiling) · +[Environment variables](Environment-Variables) · +[Logging](Logging-and-Syslog). diff --git a/Tracing-and-Profiling.md b/Tracing-and-Profiling.md new file mode 100644 index 0000000..6d471ef --- /dev/null +++ b/Tracing-and-Profiling.md @@ -0,0 +1,157 @@ +# Tracing and Profiling + +Three lenses. All off by default. All cost real CPU. Turn on, measure, +turn off. For the perf history that justifies the defaults, see +[Performance Story](Performance-Story). For every env var on this page, +see [Environment variables](Environment-Variables). + +--- + +## 1. OpenTelemetry (distributed tracing) + +Auto-instruments FastAPI, the internal workers, and outbound DB/HTTP +calls. Emits OTLP over gRPC. + +### Enable + +```bash +export DECNET_DEVELOPER_TRACING=true +export DECNET_OTEL_ENDPOINT=http://localhost:4317 # default +``` + +### Bring up a local collector + +A dev collector ships in the repo: + +```bash +cd development +docker compose -f docker-compose.otel.yml up -d +# now point DECNET at it (default endpoint already matches) +DECNET_DEVELOPER_TRACING=true decnet deploy --mode unihost --deckies 2 +``` + +Traces appear in whatever backend the compose file wires to +(otel-collector → Jaeger/Tempo, depending on your compose overrides). + +### Install the extras + +Tracing is an optional dependency group: + +```bash +pip install -e '.[tracing]' +# or the umbrella: +pip install -e '.[dev]' +``` + +Without the extras, `DECNET_DEVELOPER_TRACING=true` is a no-op — the +telemetry module detects missing packages and degrades silently. That +behaviour is covered by `tests/test_telemetry.py` +(`TestTracingDisabled`). + +### Cost + +Measured, not guessed. From `development/profiles/`: + +- 1500 concurrent users, tracing ON: **p50=690 ms, 880 RPS** +- same load, tracing OFF: **p50=340 ms, 993 RPS** + +That is roughly a **~13% RPS hit and p50 doubles**. Do not ship +production deployments with this flag on. Full breakdown in +[Performance Story §4](Performance-Story). + +--- + +## 2. Pyinstrument (per-request flamegraphs) + +A sampling profiler mounted as ASGI middleware. Each request gets its +own HTML flamegraph. + +### Enable + +```bash +export DECNET_PROFILE_REQUESTS=true +export DECNET_PROFILE_DIR=profiles # default, relative to $PWD +DECNET_PROFILE_REQUESTS=true decnet deploy --mode unihost --deckies 1 +``` + +Hit an endpoint, then: + +```bash +ls profiles/ # one *.html per request +xdg-open profiles/.html +``` + +### Install the extras + +```bash +pip install -e '.[profile]' +``` + +### Cost + +Pyinstrument samples at kHz frequencies and writes HTML per request. +That is **non-trivial CPU** and disk pressure. Do not leave it on. It +is a "one request at a time, look at it under a microscope" tool, not +a production observability channel. + +--- + +## 3. py-spy / memray / snakeviz (dev extras) + +Optional profilers declared in `pyproject.toml` under the `profile` +extras group. Install with `pip install -e '.[profile]'`. + +### py-spy (sampling profiler, no code changes) + +```bash +# find the API pid +ps -ef | grep 'uvicorn.*decnet' + +# record a 30s flamegraph +sudo py-spy record -o pyspy.svg --pid --duration 30 +``` + +There is a helper at `scripts/profile/pyspy-attach.sh` that documents +the common invocations. If py-spy cannot attach (Linux ptrace +restrictions), use Pyinstrument instead — the script prints fallback +guidance. + +### memray (memory allocation tracing) + +```bash +scripts/profile/memray-api.sh # launches API under memray +memray flamegraph memray.bin # generate report +``` + +Good for hunting leaks in the long-running workers (collector, +attacker-worker, sniffer). + +### snakeviz (cProfile viewer) + +```bash +python -m cProfile -o out.prof -m decnet.cli services +snakeviz out.prof +``` + +Useful for CLI / one-shot paths. For the API, prefer Pyinstrument. + +--- + +## Production warning + +> Do **not** leave `DECNET_DEVELOPER_TRACING=true` or +> `DECNET_PROFILE_REQUESTS=true` on in a real DECNET deployment. + +The numbers in [Performance Story](Performance-Story) are measured on a +single baseline commit (`fb69a06`) with every other variable held +still, and tracing alone cost ~13% throughput and doubled p50. In an +attacker-facing honeypot, a slow API means delayed detection, missed +correlations, and potentially a noticeable behavioural tell — none of +which you want. + +Rule: flip on, collect, flip off, commit. + +Related: [Environment variables](Environment-Variables) · +[Design overview](Design-Overview) · +[Logging](Logging-and-Syslog) · +[Testing and CI](Testing-and-CI).