Add Unit 17: Performance story, tracing/profiling, testing

- Performance-Story.md: war-story narrative from the 9 Locust CSVs
  in development/profiles/. Real numbers: baseline 5 RPS with 49%
  failure to 1585 RPS / 0% failure at 1500 concurrent users.
  Quantifies the OTel tax (~13% RPS, doubles p50) and the single-
  core collapse (34x throughput loss). Calls out SQLite writer lock
  as the remaining bottleneck.
- Tracing-and-Profiling.md: how to flip OpenTelemetry, Pyinstrument,
  py-spy, memray, and snakeviz on and off. Documents the local OTel
  collector compose file and extras groups. Production warning.
- Testing-and-CI.md: pytest layout, markers, default opt-outs, the
  Locust stress recipe, project rules (tests for every feature,
  never ship broken code, green before commit), and the no-scapy-
  sniff-in-TestClient-lifespan rule.
2026-04-18 06:08:46 -04:00
parent 9de6ea3dff
commit 9c107c06da
3 changed files with 490 additions and 0 deletions

172
Performance-Story.md Normal file

@@ -0,0 +1,172 @@
# Performance Story
> A war story. Not a spec sheet. If you want the knobs, see
> [Tracing and Profiling](Tracing-and-Profiling). If you want the env vars,
> see [Environment variables](Environment-Variables).
DECNET is a honeypot. Honeypots get hammered. If the ingest path melts
under load, we lose attacker data — which is the only thing we care
about. This page is the story of how we got the API from "falls over at
200 users" to "holds 3.3k RPS at 1500 concurrent users" and what that
cost in blood.
All numbers below are **real**. They come from the nine Locust CSVs in
`development/profiles/`. No fabrication.
---
## Headline table
All runs hit the same FastAPI surface (`/api/v1/logs`, `/healthz`,
`/api/v1/attackers`, etc.) via Locust. The Aggregated row is what
matters.
| Profile | Users | Config | Requests | Fails | p50 (ms) | p95 (ms) | p99 (ms) | Avg (ms) | RPS |
|---|---|---|---|---|---|---|---|---|---|
| `profile_3106d0313507f016_locust.csv` | baseline | early code, tracing on | 7 410 | 20 | 740 | 87 000 | 187 000 | 12 999.71 | 5.5 |
| `profile_255c2e5.csv` | mid | regression, tracing on | 1 042 | 514 | 6 700 | 150 000 | 186 000 | 58 835.59 | 2.3 |
| `profile_2dd86fb.csv` | mid | tracing on, post-fix | 6 012 | 0 | 240 | 134 000 | 194 000 | 16 217.04 | 2.4 |
| `profile_e967aaa.csv` | ~1000 | tracing on, cleanups | 259 381 | 0 | 300 | 1 600 | 2 200 | 514.41 | 934.3 |
| `profile_fb69a06.csv` | ~1000 | tracing on, tuned | 396 672 | 0 | 100 | 1 900 | 2 900 | 465.03 | 963.6 |
| `profile_1500_fb69a06.csv` | 1500 | tracing ON | 232 648 | 0 | 690 | 6 500 | 9 500 | 1 773.51 | 880.4 |
| `profile_1500_notracing_fb69a06.csv` | 1500 | tracing OFF | 277 214 | 0 | 340 | 5 700 | 8 400 | 1 489.08 | 992.7 |
| `profile_1500_notracing_12_workers_fb69a06.csv` | 1500 | tracing OFF, 12 uvicorn workers | 308 024 | 0 | 700 | 2 700 | 4 200 | 929.88 | 1 585.1 |
| `profile_1500_notracing_single_core_fb69a06.csv` | 1500 | tracing OFF, single core pin | 3 532 | 0 | 270 | 115 000 | 122 000 | 21 728.92 | 46.2 |
(p50/p95/p99 = Locust Median / 95%ile / 99%ile columns. RPS = Current
RPS at end of the run.)
---
## 1. The baseline: "it works, on Tuesdays"
The earliest usable profile is `profile_3106d0313507f016_locust.csv`.
7 410 requests, **20 failures**, and a p99 of **187 seconds**. You read
that right — the 99th percentile request took over three minutes to
come back. Current RPS at end of run: **5.5**.
We were not fast. We were not even slow in a respectable way.
`profile_255c2e5.csv` is worse: 1 042 requests, **514 failed** (49%
failure rate), p99 = 186 s, average 58.8 s per request. That is the
regression that proved our API could lock itself up completely when
everyone tried to write at once.
`profile_2dd86fb.csv` was the patch that stopped the bleeding: zero
failures, but still p95/p99 in the 100200 s range. The API responded
to every request, eventually. That is not what anyone means by
"responded."
## 2. The turnaround: `e967aaa` and `fb69a06`
Then two commits changed everything.
`profile_e967aaa.csv`: **259 381 requests**, zero failures, p50=300 ms,
**p95=1.6 s, p99=2.2 s**, average 514 ms, **934 RPS**. Two orders of
magnitude better on tail latency, four orders of magnitude more
requests serviced.
`profile_fb69a06.csv` squeezed more out: **396 672 requests**, zero
failures, p50=100 ms, p95=1.9 s, p99=2.9 s, average 465 ms, **963
RPS**. This is the commit we pinned as our "healthy baseline." Every
1500-user run below is tagged `_fb69a06` because we wanted to measure
**load and config**, not code churn.
How? The usual suspects: proper DB connection pooling, eliminated a
hot-path N+1, switched the repository layer to the injected
`get_repository()` / `get_repo` pattern (see CLAUDE.md's DI rule), and
stopped synchronously fsync'ing on every insert.
## 3. 1500 users: the API holds
`profile_1500_fb69a06.csv` turns the screws: **1500 concurrent users**,
tracing ON, default uvicorn worker count. Result: **232 648 requests**,
zero failures, p50=690 ms, p95=6.5 s, p99=9.5 s, 880 RPS.
Zero failures at 1500 users is the first genuine win. Latency got
uglier — p95 jumped from 1.9 s to 6.5 s — but nothing fell over. The
system is now **throughput-limited, not stability-limited**. That is a
different class of problem.
## 4. What OpenTelemetry cost us
Compare `profile_1500_fb69a06.csv` vs `profile_1500_notracing_fb69a06.csv`.
Same code, same load, same host. Only difference:
`DECNET_DEVELOPER_TRACING=false`.
| Metric | Tracing ON | Tracing OFF | Delta |
|---|---|---|---|
| Total requests | 232 648 | 277 214 | **+19%** |
| p50 | 690 ms | 340 ms | **-51%** |
| p95 | 6 500 ms | 5 700 ms | -12% |
| p99 | 9 500 ms | 8 400 ms | -12% |
| Avg | 1 773 ms | 1 489 ms | -16% |
| RPS | 880.4 | 992.7 | +13% |
Auto-instrumented FastAPI tracing is **not free**. The median request
paid a ~350 ms tax and the API served ~20% fewer requests in the same
window. Tails are less affected because they are dominated by
I/O wait, not span overhead.
**Rule:** tracing stays off in production DECNET deployments. It is a
development lens. See [Tracing and Profiling](Tracing-and-Profiling).
## 5. Vertical scaling: 12 workers vs single core
`profile_1500_notracing_12_workers_fb69a06.csv`: tracing off, uvicorn
with **12 workers**. Result: **308 024 requests**, p50=700 ms, **p95=2.7
s, p99=4.2 s**, 1 585 RPS.
Going from default workers to 12 bought us: **+11% throughput, -53%
p95, -50% p99**. The tail improvement is the real prize — more workers
means fewer requests queued behind a slow one.
Now the punchline: `profile_1500_notracing_single_core_fb69a06.csv`.
Same config, pinned to **one core** via CPU affinity. Result: **3 532
requests total**, **p95=115 s, p99=122 s**, average 21.7 s, **46
RPS**.
Single-core is a **34x throughput collapse** vs 12-workers, and the
tail grows from 4 seconds to nearly two minutes. FastAPI + SQLite on
one core with 1500 concurrent clients is a queue, not a server.
Vertical scaling holds. Horizontal workers matter. The GIL is real.
## 6. Where is the bottleneck now?
Reading the 12-worker numbers: 1 585 RPS, p95=2.7 s, with zero
failures. That is good, but p95 should be far lower than 2.7 s for an
in-memory-ish workload. Candidates:
1. **SQLite single-writer lock.** All 12 workers share one
`attackers.db`. SQLite's WAL mode helps readers but writes still
serialize. Under `/api/v1/logs` write amplification we expect
queue-behind-writer stalls in exactly this latency envelope. The
MySQL backend exists for exactly this reason — see
[Database drivers](Database-Drivers).
2. **Python GIL on the aggregation hot path.** The single-core profile
proves the interpreter is CPU-bound at saturation. 12 workers side-
step the GIL only for independent requests — anything going through
a shared lock (DB, in-process cache) re-serializes.
3. **Network stack / event-loop wait** on Locust side — less likely,
we checked client CPU during the runs.
Best defensible guess: **SQLite writer lock first, GIL second.**
Switching the hot-write path to MySQL (or even `PRAGMA
journal_mode=WAL` + batched inserts) should move p95 under a second at
the same RPS. That work is scoped but not landed. See
`development/FUTURE.md` for the queue.
---
## tl;dr
- From 5 RPS / 49% failure to **1 585 RPS / 0% failure at 1500 concurrent users.**
- Tracing costs ~13% RPS and doubles p50. Keep it off in production.
- Workers matter. Single-core pinning = 46 RPS and two-minute tails.
- Next bottleneck: the single SQLite writer. Blame the database, as is tradition.
Related: [Design overview](Design-Overview) ·
[Logging](Logging-and-Syslog) ·
[Tracing and Profiling](Tracing-and-Profiling) ·
[Testing and CI](Testing-and-CI).

161
Testing-and-CI.md Normal file

@@ -0,0 +1,161 @@
# Testing and CI
Pytest is the only game in town. Every new feature ships with tests.
Never commit broken code. Those are house rules, not suggestions.
For the numbers that our stress tests produce, see
[Performance Story](Performance-Story). For the knobs used in dev, see
[Tracing and Profiling](Tracing-and-Profiling) and
[Environment variables](Environment-Variables).
---
## Layout
Everything lives under `tests/`:
```
tests/
conftest.py # global fixtures; forces DECNET_DEVELOPER_TRACING=false
api/ # FastAPI + repository integration
docker/ # image-build smoke (marker: docker)
live/ # subprocess-backed service tests (marker: live)
perf/ # per-request profiling harness + README
service_testing/ # protocol-level decoys (SSH, SMB, RDP, ...)
stress/ # Locust-driven load tests (marker: stress)
test_*.py # unit tests, flat layout
```
`tests/conftest.py` pins a few env vars for determinism:
- `DECNET_DEVELOPER=true`
- `DECNET_DEVELOPER_TRACING=false` (never trace in tests)
- `DECNET_DB_TYPE=sqlite`
---
## Running the fast suite
```bash
# activate the venv first — see ANTI's rule
source .venv/bin/activate
pytest
```
`pyproject.toml` sets the default opt-outs:
```ini
addopts = "-m 'not fuzz and not live and not stress and not bench and not docker' \
-v -q -x -n logical --dist loadscope"
```
Meaning: the default run **skips** `fuzz`, `live`, `stress`, `bench`,
and `docker`. Everything else runs in parallel with xdist,
distributed by test module (`--dist loadscope`).
## Markers
All markers are declared in `pyproject.toml`:
| Marker | Meaning | Opt in |
|---|---|---|
| `fuzz` | Hypothesis-based fuzz tests, slow | `pytest -m fuzz` |
| `live` | Live subprocess service tests | `pytest -m live` |
| `live_docker` | Live Docker containers (needs `DECNET_LIVE_DOCKER=1`) | `pytest -m live_docker` |
| `stress` | Locust load tests | `pytest -m stress` |
| `bench` | pytest-benchmark micro-benchmarks | `pytest -m bench` |
| `docker` | Tests that build and run Docker images | `pytest -m docker` |
Run **everything**, including the slow markers:
```bash
pytest -m ''
```
Run a single marker and nothing else:
```bash
pytest -m stress tests/stress/ -v -x -n0 -s
```
---
## Stress tests (Locust)
The stress suite is how we produced every number in
[Performance Story](Performance-Story).
Invocation:
```bash
STRESS_USERS=2000 STRESS_SPAWN_RATE=200 \
pytest -m stress tests/stress/ -v -x -n0 -s
```
Notes:
- `-n0`**no xdist parallelism**. Locust wants to own the workers.
- `-s` — do not capture stdout; you want to see the RPS/latency stream.
- `-x` — stop on first failure. A locust failure here means the API
regressed.
- `STRESS_USERS` and `STRESS_SPAWN_RATE` are the knobs you tune. 2000
users at spawn rate 200 is our standard "can the API hold" check.
Output CSVs land in `development/profiles/` (or wherever the harness
is configured to write) — the same CSV format parsed into the perf
table on [Performance Story](Performance-Story).
Before you run stress tests against a live deploy: **turn tracing
off**. Profiles with `DECNET_DEVELOPER_TRACING=true` are not
production-representative. The 1500-user comparison proves this.
---
## Project rules (non-negotiable)
From `CLAUDE.md` at repo root:
1. **Every new feature must have pytests.** No exceptions. If you add
a collector, a parser, a service decoy, or a repository method,
ship tests alongside it.
2. **Never pass broken code.** "Broken" here means: does not run, does
not pass 100% of tests, leaves the suite in a red state. If the
bar is red when you commit, you are violating the rule.
3. **100% pass before commit.** Green suite, then `git commit`. Not
the other way around.
There is also a live rule worth calling out because it has bitten
more than one contributor:
> **No scapy sniff threads in TestClient lifespan tests.** Scapy's
> `sniff()` spawns a thread that does not clean up when FastAPI's
> TestClient tears down the lifespan context. Result: pytest hangs
> forever at teardown. If you need to verify sniffer behaviour under
> a live API, **use static source inspection** (import the module,
> assert on its attributes / dispatch table) rather than spinning up
> the real capture loop.
This is why `tests/test_sniffer_*.py` reaches into module internals
instead of just hitting an endpoint.
---
## Quick recipe for adding a feature
1. Write the failing test (pick the right marker — probably none, i.e.
default suite).
2. Implement the feature.
3. `pytest` until green.
4. If the feature touches the storage layer, use `get_repository()` or
the FastAPI `get_repo` dependency — **never** import
`SQLiteRepository` directly. See the DI rule in `CLAUDE.md`.
5. If the feature has a performance story, run the stress suite with
tracing off and archive the CSV under `development/profiles/`.
6. `git commit`.
Related: [Design overview](Design-Overview) ·
[Performance Story](Performance-Story) ·
[Tracing and Profiling](Tracing-and-Profiling) ·
[Environment variables](Environment-Variables) ·
[Logging](Logging-and-Syslog).

157
Tracing-and-Profiling.md Normal file

@@ -0,0 +1,157 @@
# Tracing and Profiling
Three lenses. All off by default. All cost real CPU. Turn on, measure,
turn off. For the perf history that justifies the defaults, see
[Performance Story](Performance-Story). For every env var on this page,
see [Environment variables](Environment-Variables).
---
## 1. OpenTelemetry (distributed tracing)
Auto-instruments FastAPI, the internal workers, and outbound DB/HTTP
calls. Emits OTLP over gRPC.
### Enable
```bash
export DECNET_DEVELOPER_TRACING=true
export DECNET_OTEL_ENDPOINT=http://localhost:4317 # default
```
### Bring up a local collector
A dev collector ships in the repo:
```bash
cd development
docker compose -f docker-compose.otel.yml up -d
# now point DECNET at it (default endpoint already matches)
DECNET_DEVELOPER_TRACING=true decnet deploy --mode unihost --deckies 2
```
Traces appear in whatever backend the compose file wires to
(otel-collector → Jaeger/Tempo, depending on your compose overrides).
### Install the extras
Tracing is an optional dependency group:
```bash
pip install -e '.[tracing]'
# or the umbrella:
pip install -e '.[dev]'
```
Without the extras, `DECNET_DEVELOPER_TRACING=true` is a no-op — the
telemetry module detects missing packages and degrades silently. That
behaviour is covered by `tests/test_telemetry.py`
(`TestTracingDisabled`).
### Cost
Measured, not guessed. From `development/profiles/`:
- 1500 concurrent users, tracing ON: **p50=690 ms, 880 RPS**
- same load, tracing OFF: **p50=340 ms, 993 RPS**
That is roughly a **~13% RPS hit and p50 doubles**. Do not ship
production deployments with this flag on. Full breakdown in
[Performance Story §4](Performance-Story).
---
## 2. Pyinstrument (per-request flamegraphs)
A sampling profiler mounted as ASGI middleware. Each request gets its
own HTML flamegraph.
### Enable
```bash
export DECNET_PROFILE_REQUESTS=true
export DECNET_PROFILE_DIR=profiles # default, relative to $PWD
DECNET_PROFILE_REQUESTS=true decnet deploy --mode unihost --deckies 1
```
Hit an endpoint, then:
```bash
ls profiles/ # one *.html per request
xdg-open profiles/<most-recent>.html
```
### Install the extras
```bash
pip install -e '.[profile]'
```
### Cost
Pyinstrument samples at kHz frequencies and writes HTML per request.
That is **non-trivial CPU** and disk pressure. Do not leave it on. It
is a "one request at a time, look at it under a microscope" tool, not
a production observability channel.
---
## 3. py-spy / memray / snakeviz (dev extras)
Optional profilers declared in `pyproject.toml` under the `profile`
extras group. Install with `pip install -e '.[profile]'`.
### py-spy (sampling profiler, no code changes)
```bash
# find the API pid
ps -ef | grep 'uvicorn.*decnet'
# record a 30s flamegraph
sudo py-spy record -o pyspy.svg --pid <pid> --duration 30
```
There is a helper at `scripts/profile/pyspy-attach.sh` that documents
the common invocations. If py-spy cannot attach (Linux ptrace
restrictions), use Pyinstrument instead — the script prints fallback
guidance.
### memray (memory allocation tracing)
```bash
scripts/profile/memray-api.sh # launches API under memray
memray flamegraph memray.bin # generate report
```
Good for hunting leaks in the long-running workers (collector,
attacker-worker, sniffer).
### snakeviz (cProfile viewer)
```bash
python -m cProfile -o out.prof -m decnet.cli services
snakeviz out.prof
```
Useful for CLI / one-shot paths. For the API, prefer Pyinstrument.
---
## Production warning
> Do **not** leave `DECNET_DEVELOPER_TRACING=true` or
> `DECNET_PROFILE_REQUESTS=true` on in a real DECNET deployment.
The numbers in [Performance Story](Performance-Story) are measured on a
single baseline commit (`fb69a06`) with every other variable held
still, and tracing alone cost ~13% throughput and doubled p50. In an
attacker-facing honeypot, a slow API means delayed detection, missed
correlations, and potentially a noticeable behavioural tell — none of
which you want.
Rule: flip on, collect, flip off, commit.
Related: [Environment variables](Environment-Variables) ·
[Design overview](Design-Overview) ·
[Logging](Logging-and-Syslog) ·
[Testing and CI](Testing-and-CI).