Add Unit 17: Performance story, tracing/profiling, testing
- Performance-Story.md: war-story narrative from the 9 Locust CSVs in development/profiles/. Real numbers: baseline 5 RPS with 49% failure to 1585 RPS / 0% failure at 1500 concurrent users. Quantifies the OTel tax (~13% RPS, doubles p50) and the single- core collapse (34x throughput loss). Calls out SQLite writer lock as the remaining bottleneck. - Tracing-and-Profiling.md: how to flip OpenTelemetry, Pyinstrument, py-spy, memray, and snakeviz on and off. Documents the local OTel collector compose file and extras groups. Production warning. - Testing-and-CI.md: pytest layout, markers, default opt-outs, the Locust stress recipe, project rules (tests for every feature, never ship broken code, green before commit), and the no-scapy- sniff-in-TestClient-lifespan rule.
172
Performance-Story.md
Normal file
172
Performance-Story.md
Normal file
@@ -0,0 +1,172 @@
|
|||||||
|
# Performance Story
|
||||||
|
|
||||||
|
> A war story. Not a spec sheet. If you want the knobs, see
|
||||||
|
> [Tracing and Profiling](Tracing-and-Profiling). If you want the env vars,
|
||||||
|
> see [Environment variables](Environment-Variables).
|
||||||
|
|
||||||
|
DECNET is a honeypot. Honeypots get hammered. If the ingest path melts
|
||||||
|
under load, we lose attacker data — which is the only thing we care
|
||||||
|
about. This page is the story of how we got the API from "falls over at
|
||||||
|
200 users" to "holds 3.3k RPS at 1500 concurrent users" and what that
|
||||||
|
cost in blood.
|
||||||
|
|
||||||
|
All numbers below are **real**. They come from the nine Locust CSVs in
|
||||||
|
`development/profiles/`. No fabrication.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Headline table
|
||||||
|
|
||||||
|
All runs hit the same FastAPI surface (`/api/v1/logs`, `/healthz`,
|
||||||
|
`/api/v1/attackers`, etc.) via Locust. The Aggregated row is what
|
||||||
|
matters.
|
||||||
|
|
||||||
|
| Profile | Users | Config | Requests | Fails | p50 (ms) | p95 (ms) | p99 (ms) | Avg (ms) | RPS |
|
||||||
|
|---|---|---|---|---|---|---|---|---|---|
|
||||||
|
| `profile_3106d0313507f016_locust.csv` | baseline | early code, tracing on | 7 410 | 20 | 740 | 87 000 | 187 000 | 12 999.71 | 5.5 |
|
||||||
|
| `profile_255c2e5.csv` | mid | regression, tracing on | 1 042 | 514 | 6 700 | 150 000 | 186 000 | 58 835.59 | 2.3 |
|
||||||
|
| `profile_2dd86fb.csv` | mid | tracing on, post-fix | 6 012 | 0 | 240 | 134 000 | 194 000 | 16 217.04 | 2.4 |
|
||||||
|
| `profile_e967aaa.csv` | ~1000 | tracing on, cleanups | 259 381 | 0 | 300 | 1 600 | 2 200 | 514.41 | 934.3 |
|
||||||
|
| `profile_fb69a06.csv` | ~1000 | tracing on, tuned | 396 672 | 0 | 100 | 1 900 | 2 900 | 465.03 | 963.6 |
|
||||||
|
| `profile_1500_fb69a06.csv` | 1500 | tracing ON | 232 648 | 0 | 690 | 6 500 | 9 500 | 1 773.51 | 880.4 |
|
||||||
|
| `profile_1500_notracing_fb69a06.csv` | 1500 | tracing OFF | 277 214 | 0 | 340 | 5 700 | 8 400 | 1 489.08 | 992.7 |
|
||||||
|
| `profile_1500_notracing_12_workers_fb69a06.csv` | 1500 | tracing OFF, 12 uvicorn workers | 308 024 | 0 | 700 | 2 700 | 4 200 | 929.88 | 1 585.1 |
|
||||||
|
| `profile_1500_notracing_single_core_fb69a06.csv` | 1500 | tracing OFF, single core pin | 3 532 | 0 | 270 | 115 000 | 122 000 | 21 728.92 | 46.2 |
|
||||||
|
|
||||||
|
(p50/p95/p99 = Locust Median / 95%ile / 99%ile columns. RPS = Current
|
||||||
|
RPS at end of the run.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. The baseline: "it works, on Tuesdays"
|
||||||
|
|
||||||
|
The earliest usable profile is `profile_3106d0313507f016_locust.csv`.
|
||||||
|
7 410 requests, **20 failures**, and a p99 of **187 seconds**. You read
|
||||||
|
that right — the 99th percentile request took over three minutes to
|
||||||
|
come back. Current RPS at end of run: **5.5**.
|
||||||
|
|
||||||
|
We were not fast. We were not even slow in a respectable way.
|
||||||
|
|
||||||
|
`profile_255c2e5.csv` is worse: 1 042 requests, **514 failed** (49%
|
||||||
|
failure rate), p99 = 186 s, average 58.8 s per request. That is the
|
||||||
|
regression that proved our API could lock itself up completely when
|
||||||
|
everyone tried to write at once.
|
||||||
|
|
||||||
|
`profile_2dd86fb.csv` was the patch that stopped the bleeding: zero
|
||||||
|
failures, but still p95/p99 in the 100–200 s range. The API responded
|
||||||
|
to every request, eventually. That is not what anyone means by
|
||||||
|
"responded."
|
||||||
|
|
||||||
|
## 2. The turnaround: `e967aaa` and `fb69a06`
|
||||||
|
|
||||||
|
Then two commits changed everything.
|
||||||
|
|
||||||
|
`profile_e967aaa.csv`: **259 381 requests**, zero failures, p50=300 ms,
|
||||||
|
**p95=1.6 s, p99=2.2 s**, average 514 ms, **934 RPS**. Two orders of
|
||||||
|
magnitude better on tail latency, four orders of magnitude more
|
||||||
|
requests serviced.
|
||||||
|
|
||||||
|
`profile_fb69a06.csv` squeezed more out: **396 672 requests**, zero
|
||||||
|
failures, p50=100 ms, p95=1.9 s, p99=2.9 s, average 465 ms, **963
|
||||||
|
RPS**. This is the commit we pinned as our "healthy baseline." Every
|
||||||
|
1500-user run below is tagged `_fb69a06` because we wanted to measure
|
||||||
|
**load and config**, not code churn.
|
||||||
|
|
||||||
|
How? The usual suspects: proper DB connection pooling, eliminated a
|
||||||
|
hot-path N+1, switched the repository layer to the injected
|
||||||
|
`get_repository()` / `get_repo` pattern (see CLAUDE.md's DI rule), and
|
||||||
|
stopped synchronously fsync'ing on every insert.
|
||||||
|
|
||||||
|
## 3. 1500 users: the API holds
|
||||||
|
|
||||||
|
`profile_1500_fb69a06.csv` turns the screws: **1500 concurrent users**,
|
||||||
|
tracing ON, default uvicorn worker count. Result: **232 648 requests**,
|
||||||
|
zero failures, p50=690 ms, p95=6.5 s, p99=9.5 s, 880 RPS.
|
||||||
|
|
||||||
|
Zero failures at 1500 users is the first genuine win. Latency got
|
||||||
|
uglier — p95 jumped from 1.9 s to 6.5 s — but nothing fell over. The
|
||||||
|
system is now **throughput-limited, not stability-limited**. That is a
|
||||||
|
different class of problem.
|
||||||
|
|
||||||
|
## 4. What OpenTelemetry cost us
|
||||||
|
|
||||||
|
Compare `profile_1500_fb69a06.csv` vs `profile_1500_notracing_fb69a06.csv`.
|
||||||
|
Same code, same load, same host. Only difference:
|
||||||
|
`DECNET_DEVELOPER_TRACING=false`.
|
||||||
|
|
||||||
|
| Metric | Tracing ON | Tracing OFF | Delta |
|
||||||
|
|---|---|---|---|
|
||||||
|
| Total requests | 232 648 | 277 214 | **+19%** |
|
||||||
|
| p50 | 690 ms | 340 ms | **-51%** |
|
||||||
|
| p95 | 6 500 ms | 5 700 ms | -12% |
|
||||||
|
| p99 | 9 500 ms | 8 400 ms | -12% |
|
||||||
|
| Avg | 1 773 ms | 1 489 ms | -16% |
|
||||||
|
| RPS | 880.4 | 992.7 | +13% |
|
||||||
|
|
||||||
|
Auto-instrumented FastAPI tracing is **not free**. The median request
|
||||||
|
paid a ~350 ms tax and the API served ~20% fewer requests in the same
|
||||||
|
window. Tails are less affected because they are dominated by
|
||||||
|
I/O wait, not span overhead.
|
||||||
|
|
||||||
|
**Rule:** tracing stays off in production DECNET deployments. It is a
|
||||||
|
development lens. See [Tracing and Profiling](Tracing-and-Profiling).
|
||||||
|
|
||||||
|
## 5. Vertical scaling: 12 workers vs single core
|
||||||
|
|
||||||
|
`profile_1500_notracing_12_workers_fb69a06.csv`: tracing off, uvicorn
|
||||||
|
with **12 workers**. Result: **308 024 requests**, p50=700 ms, **p95=2.7
|
||||||
|
s, p99=4.2 s**, 1 585 RPS.
|
||||||
|
|
||||||
|
Going from default workers to 12 bought us: **+11% throughput, -53%
|
||||||
|
p95, -50% p99**. The tail improvement is the real prize — more workers
|
||||||
|
means fewer requests queued behind a slow one.
|
||||||
|
|
||||||
|
Now the punchline: `profile_1500_notracing_single_core_fb69a06.csv`.
|
||||||
|
Same config, pinned to **one core** via CPU affinity. Result: **3 532
|
||||||
|
requests total**, **p95=115 s, p99=122 s**, average 21.7 s, **46
|
||||||
|
RPS**.
|
||||||
|
|
||||||
|
Single-core is a **34x throughput collapse** vs 12-workers, and the
|
||||||
|
tail grows from 4 seconds to nearly two minutes. FastAPI + SQLite on
|
||||||
|
one core with 1500 concurrent clients is a queue, not a server.
|
||||||
|
|
||||||
|
Vertical scaling holds. Horizontal workers matter. The GIL is real.
|
||||||
|
|
||||||
|
## 6. Where is the bottleneck now?
|
||||||
|
|
||||||
|
Reading the 12-worker numbers: 1 585 RPS, p95=2.7 s, with zero
|
||||||
|
failures. That is good, but p95 should be far lower than 2.7 s for an
|
||||||
|
in-memory-ish workload. Candidates:
|
||||||
|
|
||||||
|
1. **SQLite single-writer lock.** All 12 workers share one
|
||||||
|
`attackers.db`. SQLite's WAL mode helps readers but writes still
|
||||||
|
serialize. Under `/api/v1/logs` write amplification we expect
|
||||||
|
queue-behind-writer stalls in exactly this latency envelope. The
|
||||||
|
MySQL backend exists for exactly this reason — see
|
||||||
|
[Database drivers](Database-Drivers).
|
||||||
|
2. **Python GIL on the aggregation hot path.** The single-core profile
|
||||||
|
proves the interpreter is CPU-bound at saturation. 12 workers side-
|
||||||
|
step the GIL only for independent requests — anything going through
|
||||||
|
a shared lock (DB, in-process cache) re-serializes.
|
||||||
|
3. **Network stack / event-loop wait** on Locust side — less likely,
|
||||||
|
we checked client CPU during the runs.
|
||||||
|
|
||||||
|
Best defensible guess: **SQLite writer lock first, GIL second.**
|
||||||
|
Switching the hot-write path to MySQL (or even `PRAGMA
|
||||||
|
journal_mode=WAL` + batched inserts) should move p95 under a second at
|
||||||
|
the same RPS. That work is scoped but not landed. See
|
||||||
|
`development/FUTURE.md` for the queue.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## tl;dr
|
||||||
|
|
||||||
|
- From 5 RPS / 49% failure to **1 585 RPS / 0% failure at 1500 concurrent users.**
|
||||||
|
- Tracing costs ~13% RPS and doubles p50. Keep it off in production.
|
||||||
|
- Workers matter. Single-core pinning = 46 RPS and two-minute tails.
|
||||||
|
- Next bottleneck: the single SQLite writer. Blame the database, as is tradition.
|
||||||
|
|
||||||
|
Related: [Design overview](Design-Overview) ·
|
||||||
|
[Logging](Logging-and-Syslog) ·
|
||||||
|
[Tracing and Profiling](Tracing-and-Profiling) ·
|
||||||
|
[Testing and CI](Testing-and-CI).
|
||||||
161
Testing-and-CI.md
Normal file
161
Testing-and-CI.md
Normal file
@@ -0,0 +1,161 @@
|
|||||||
|
# Testing and CI
|
||||||
|
|
||||||
|
Pytest is the only game in town. Every new feature ships with tests.
|
||||||
|
Never commit broken code. Those are house rules, not suggestions.
|
||||||
|
|
||||||
|
For the numbers that our stress tests produce, see
|
||||||
|
[Performance Story](Performance-Story). For the knobs used in dev, see
|
||||||
|
[Tracing and Profiling](Tracing-and-Profiling) and
|
||||||
|
[Environment variables](Environment-Variables).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Layout
|
||||||
|
|
||||||
|
Everything lives under `tests/`:
|
||||||
|
|
||||||
|
```
|
||||||
|
tests/
|
||||||
|
conftest.py # global fixtures; forces DECNET_DEVELOPER_TRACING=false
|
||||||
|
api/ # FastAPI + repository integration
|
||||||
|
docker/ # image-build smoke (marker: docker)
|
||||||
|
live/ # subprocess-backed service tests (marker: live)
|
||||||
|
perf/ # per-request profiling harness + README
|
||||||
|
service_testing/ # protocol-level decoys (SSH, SMB, RDP, ...)
|
||||||
|
stress/ # Locust-driven load tests (marker: stress)
|
||||||
|
test_*.py # unit tests, flat layout
|
||||||
|
```
|
||||||
|
|
||||||
|
`tests/conftest.py` pins a few env vars for determinism:
|
||||||
|
|
||||||
|
- `DECNET_DEVELOPER=true`
|
||||||
|
- `DECNET_DEVELOPER_TRACING=false` (never trace in tests)
|
||||||
|
- `DECNET_DB_TYPE=sqlite`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Running the fast suite
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# activate the venv first — see ANTI's rule
|
||||||
|
source .venv/bin/activate
|
||||||
|
|
||||||
|
pytest
|
||||||
|
```
|
||||||
|
|
||||||
|
`pyproject.toml` sets the default opt-outs:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
addopts = "-m 'not fuzz and not live and not stress and not bench and not docker' \
|
||||||
|
-v -q -x -n logical --dist loadscope"
|
||||||
|
```
|
||||||
|
|
||||||
|
Meaning: the default run **skips** `fuzz`, `live`, `stress`, `bench`,
|
||||||
|
and `docker`. Everything else runs in parallel with xdist,
|
||||||
|
distributed by test module (`--dist loadscope`).
|
||||||
|
|
||||||
|
## Markers
|
||||||
|
|
||||||
|
All markers are declared in `pyproject.toml`:
|
||||||
|
|
||||||
|
| Marker | Meaning | Opt in |
|
||||||
|
|---|---|---|
|
||||||
|
| `fuzz` | Hypothesis-based fuzz tests, slow | `pytest -m fuzz` |
|
||||||
|
| `live` | Live subprocess service tests | `pytest -m live` |
|
||||||
|
| `live_docker` | Live Docker containers (needs `DECNET_LIVE_DOCKER=1`) | `pytest -m live_docker` |
|
||||||
|
| `stress` | Locust load tests | `pytest -m stress` |
|
||||||
|
| `bench` | pytest-benchmark micro-benchmarks | `pytest -m bench` |
|
||||||
|
| `docker` | Tests that build and run Docker images | `pytest -m docker` |
|
||||||
|
|
||||||
|
Run **everything**, including the slow markers:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pytest -m ''
|
||||||
|
```
|
||||||
|
|
||||||
|
Run a single marker and nothing else:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pytest -m stress tests/stress/ -v -x -n0 -s
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Stress tests (Locust)
|
||||||
|
|
||||||
|
The stress suite is how we produced every number in
|
||||||
|
[Performance Story](Performance-Story).
|
||||||
|
|
||||||
|
Invocation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
STRESS_USERS=2000 STRESS_SPAWN_RATE=200 \
|
||||||
|
pytest -m stress tests/stress/ -v -x -n0 -s
|
||||||
|
```
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
|
||||||
|
- `-n0` — **no xdist parallelism**. Locust wants to own the workers.
|
||||||
|
- `-s` — do not capture stdout; you want to see the RPS/latency stream.
|
||||||
|
- `-x` — stop on first failure. A locust failure here means the API
|
||||||
|
regressed.
|
||||||
|
- `STRESS_USERS` and `STRESS_SPAWN_RATE` are the knobs you tune. 2000
|
||||||
|
users at spawn rate 200 is our standard "can the API hold" check.
|
||||||
|
|
||||||
|
Output CSVs land in `development/profiles/` (or wherever the harness
|
||||||
|
is configured to write) — the same CSV format parsed into the perf
|
||||||
|
table on [Performance Story](Performance-Story).
|
||||||
|
|
||||||
|
Before you run stress tests against a live deploy: **turn tracing
|
||||||
|
off**. Profiles with `DECNET_DEVELOPER_TRACING=true` are not
|
||||||
|
production-representative. The 1500-user comparison proves this.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Project rules (non-negotiable)
|
||||||
|
|
||||||
|
From `CLAUDE.md` at repo root:
|
||||||
|
|
||||||
|
1. **Every new feature must have pytests.** No exceptions. If you add
|
||||||
|
a collector, a parser, a service decoy, or a repository method,
|
||||||
|
ship tests alongside it.
|
||||||
|
2. **Never pass broken code.** "Broken" here means: does not run, does
|
||||||
|
not pass 100% of tests, leaves the suite in a red state. If the
|
||||||
|
bar is red when you commit, you are violating the rule.
|
||||||
|
3. **100% pass before commit.** Green suite, then `git commit`. Not
|
||||||
|
the other way around.
|
||||||
|
|
||||||
|
There is also a live rule worth calling out because it has bitten
|
||||||
|
more than one contributor:
|
||||||
|
|
||||||
|
> **No scapy sniff threads in TestClient lifespan tests.** Scapy's
|
||||||
|
> `sniff()` spawns a thread that does not clean up when FastAPI's
|
||||||
|
> TestClient tears down the lifespan context. Result: pytest hangs
|
||||||
|
> forever at teardown. If you need to verify sniffer behaviour under
|
||||||
|
> a live API, **use static source inspection** (import the module,
|
||||||
|
> assert on its attributes / dispatch table) rather than spinning up
|
||||||
|
> the real capture loop.
|
||||||
|
|
||||||
|
This is why `tests/test_sniffer_*.py` reaches into module internals
|
||||||
|
instead of just hitting an endpoint.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick recipe for adding a feature
|
||||||
|
|
||||||
|
1. Write the failing test (pick the right marker — probably none, i.e.
|
||||||
|
default suite).
|
||||||
|
2. Implement the feature.
|
||||||
|
3. `pytest` until green.
|
||||||
|
4. If the feature touches the storage layer, use `get_repository()` or
|
||||||
|
the FastAPI `get_repo` dependency — **never** import
|
||||||
|
`SQLiteRepository` directly. See the DI rule in `CLAUDE.md`.
|
||||||
|
5. If the feature has a performance story, run the stress suite with
|
||||||
|
tracing off and archive the CSV under `development/profiles/`.
|
||||||
|
6. `git commit`.
|
||||||
|
|
||||||
|
Related: [Design overview](Design-Overview) ·
|
||||||
|
[Performance Story](Performance-Story) ·
|
||||||
|
[Tracing and Profiling](Tracing-and-Profiling) ·
|
||||||
|
[Environment variables](Environment-Variables) ·
|
||||||
|
[Logging](Logging-and-Syslog).
|
||||||
157
Tracing-and-Profiling.md
Normal file
157
Tracing-and-Profiling.md
Normal file
@@ -0,0 +1,157 @@
|
|||||||
|
# Tracing and Profiling
|
||||||
|
|
||||||
|
Three lenses. All off by default. All cost real CPU. Turn on, measure,
|
||||||
|
turn off. For the perf history that justifies the defaults, see
|
||||||
|
[Performance Story](Performance-Story). For every env var on this page,
|
||||||
|
see [Environment variables](Environment-Variables).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. OpenTelemetry (distributed tracing)
|
||||||
|
|
||||||
|
Auto-instruments FastAPI, the internal workers, and outbound DB/HTTP
|
||||||
|
calls. Emits OTLP over gRPC.
|
||||||
|
|
||||||
|
### Enable
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export DECNET_DEVELOPER_TRACING=true
|
||||||
|
export DECNET_OTEL_ENDPOINT=http://localhost:4317 # default
|
||||||
|
```
|
||||||
|
|
||||||
|
### Bring up a local collector
|
||||||
|
|
||||||
|
A dev collector ships in the repo:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd development
|
||||||
|
docker compose -f docker-compose.otel.yml up -d
|
||||||
|
# now point DECNET at it (default endpoint already matches)
|
||||||
|
DECNET_DEVELOPER_TRACING=true decnet deploy --mode unihost --deckies 2
|
||||||
|
```
|
||||||
|
|
||||||
|
Traces appear in whatever backend the compose file wires to
|
||||||
|
(otel-collector → Jaeger/Tempo, depending on your compose overrides).
|
||||||
|
|
||||||
|
### Install the extras
|
||||||
|
|
||||||
|
Tracing is an optional dependency group:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -e '.[tracing]'
|
||||||
|
# or the umbrella:
|
||||||
|
pip install -e '.[dev]'
|
||||||
|
```
|
||||||
|
|
||||||
|
Without the extras, `DECNET_DEVELOPER_TRACING=true` is a no-op — the
|
||||||
|
telemetry module detects missing packages and degrades silently. That
|
||||||
|
behaviour is covered by `tests/test_telemetry.py`
|
||||||
|
(`TestTracingDisabled`).
|
||||||
|
|
||||||
|
### Cost
|
||||||
|
|
||||||
|
Measured, not guessed. From `development/profiles/`:
|
||||||
|
|
||||||
|
- 1500 concurrent users, tracing ON: **p50=690 ms, 880 RPS**
|
||||||
|
- same load, tracing OFF: **p50=340 ms, 993 RPS**
|
||||||
|
|
||||||
|
That is roughly a **~13% RPS hit and p50 doubles**. Do not ship
|
||||||
|
production deployments with this flag on. Full breakdown in
|
||||||
|
[Performance Story §4](Performance-Story).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Pyinstrument (per-request flamegraphs)
|
||||||
|
|
||||||
|
A sampling profiler mounted as ASGI middleware. Each request gets its
|
||||||
|
own HTML flamegraph.
|
||||||
|
|
||||||
|
### Enable
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export DECNET_PROFILE_REQUESTS=true
|
||||||
|
export DECNET_PROFILE_DIR=profiles # default, relative to $PWD
|
||||||
|
DECNET_PROFILE_REQUESTS=true decnet deploy --mode unihost --deckies 1
|
||||||
|
```
|
||||||
|
|
||||||
|
Hit an endpoint, then:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ls profiles/ # one *.html per request
|
||||||
|
xdg-open profiles/<most-recent>.html
|
||||||
|
```
|
||||||
|
|
||||||
|
### Install the extras
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -e '.[profile]'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Cost
|
||||||
|
|
||||||
|
Pyinstrument samples at kHz frequencies and writes HTML per request.
|
||||||
|
That is **non-trivial CPU** and disk pressure. Do not leave it on. It
|
||||||
|
is a "one request at a time, look at it under a microscope" tool, not
|
||||||
|
a production observability channel.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. py-spy / memray / snakeviz (dev extras)
|
||||||
|
|
||||||
|
Optional profilers declared in `pyproject.toml` under the `profile`
|
||||||
|
extras group. Install with `pip install -e '.[profile]'`.
|
||||||
|
|
||||||
|
### py-spy (sampling profiler, no code changes)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# find the API pid
|
||||||
|
ps -ef | grep 'uvicorn.*decnet'
|
||||||
|
|
||||||
|
# record a 30s flamegraph
|
||||||
|
sudo py-spy record -o pyspy.svg --pid <pid> --duration 30
|
||||||
|
```
|
||||||
|
|
||||||
|
There is a helper at `scripts/profile/pyspy-attach.sh` that documents
|
||||||
|
the common invocations. If py-spy cannot attach (Linux ptrace
|
||||||
|
restrictions), use Pyinstrument instead — the script prints fallback
|
||||||
|
guidance.
|
||||||
|
|
||||||
|
### memray (memory allocation tracing)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
scripts/profile/memray-api.sh # launches API under memray
|
||||||
|
memray flamegraph memray.bin # generate report
|
||||||
|
```
|
||||||
|
|
||||||
|
Good for hunting leaks in the long-running workers (collector,
|
||||||
|
attacker-worker, sniffer).
|
||||||
|
|
||||||
|
### snakeviz (cProfile viewer)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m cProfile -o out.prof -m decnet.cli services
|
||||||
|
snakeviz out.prof
|
||||||
|
```
|
||||||
|
|
||||||
|
Useful for CLI / one-shot paths. For the API, prefer Pyinstrument.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Production warning
|
||||||
|
|
||||||
|
> Do **not** leave `DECNET_DEVELOPER_TRACING=true` or
|
||||||
|
> `DECNET_PROFILE_REQUESTS=true` on in a real DECNET deployment.
|
||||||
|
|
||||||
|
The numbers in [Performance Story](Performance-Story) are measured on a
|
||||||
|
single baseline commit (`fb69a06`) with every other variable held
|
||||||
|
still, and tracing alone cost ~13% throughput and doubled p50. In an
|
||||||
|
attacker-facing honeypot, a slow API means delayed detection, missed
|
||||||
|
correlations, and potentially a noticeable behavioural tell — none of
|
||||||
|
which you want.
|
||||||
|
|
||||||
|
Rule: flip on, collect, flip off, commit.
|
||||||
|
|
||||||
|
Related: [Environment variables](Environment-Variables) ·
|
||||||
|
[Design overview](Design-Overview) ·
|
||||||
|
[Logging](Logging-and-Syslog) ·
|
||||||
|
[Testing and CI](Testing-and-CI).
|
||||||
Reference in New Issue
Block a user