1
Performance Story
anti edited this page 2026-04-18 06:08:46 -04:00
This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Performance Story

A war story. Not a spec sheet. If you want the knobs, see Tracing and Profiling. If you want the env vars, see Environment variables.

DECNET is a honeypot. Honeypots get hammered. If the ingest path melts under load, we lose attacker data — which is the only thing we care about. This page is the story of how we got the API from "falls over at 200 users" to "holds 3.3k RPS at 1500 concurrent users" and what that cost in blood.

All numbers below are real. They come from the nine Locust CSVs in development/profiles/. No fabrication.


Headline table

All runs hit the same FastAPI surface (/api/v1/logs, /healthz, /api/v1/attackers, etc.) via Locust. The Aggregated row is what matters.

Profile Users Config Requests Fails p50 (ms) p95 (ms) p99 (ms) Avg (ms) RPS
profile_3106d0313507f016_locust.csv baseline early code, tracing on 7 410 20 740 87 000 187 000 12 999.71 5.5
profile_255c2e5.csv mid regression, tracing on 1 042 514 6 700 150 000 186 000 58 835.59 2.3
profile_2dd86fb.csv mid tracing on, post-fix 6 012 0 240 134 000 194 000 16 217.04 2.4
profile_e967aaa.csv ~1000 tracing on, cleanups 259 381 0 300 1 600 2 200 514.41 934.3
profile_fb69a06.csv ~1000 tracing on, tuned 396 672 0 100 1 900 2 900 465.03 963.6
profile_1500_fb69a06.csv 1500 tracing ON 232 648 0 690 6 500 9 500 1 773.51 880.4
profile_1500_notracing_fb69a06.csv 1500 tracing OFF 277 214 0 340 5 700 8 400 1 489.08 992.7
profile_1500_notracing_12_workers_fb69a06.csv 1500 tracing OFF, 12 uvicorn workers 308 024 0 700 2 700 4 200 929.88 1 585.1
profile_1500_notracing_single_core_fb69a06.csv 1500 tracing OFF, single core pin 3 532 0 270 115 000 122 000 21 728.92 46.2

(p50/p95/p99 = Locust Median / 95%ile / 99%ile columns. RPS = Current RPS at end of the run.)


1. The baseline: "it works, on Tuesdays"

The earliest usable profile is profile_3106d0313507f016_locust.csv. 7 410 requests, 20 failures, and a p99 of 187 seconds. You read that right — the 99th percentile request took over three minutes to come back. Current RPS at end of run: 5.5.

We were not fast. We were not even slow in a respectable way.

profile_255c2e5.csv is worse: 1 042 requests, 514 failed (49% failure rate), p99 = 186 s, average 58.8 s per request. That is the regression that proved our API could lock itself up completely when everyone tried to write at once.

profile_2dd86fb.csv was the patch that stopped the bleeding: zero failures, but still p95/p99 in the 100200 s range. The API responded to every request, eventually. That is not what anyone means by "responded."

2. The turnaround: e967aaa and fb69a06

Then two commits changed everything.

profile_e967aaa.csv: 259 381 requests, zero failures, p50=300 ms, p95=1.6 s, p99=2.2 s, average 514 ms, 934 RPS. Two orders of magnitude better on tail latency, four orders of magnitude more requests serviced.

profile_fb69a06.csv squeezed more out: 396 672 requests, zero failures, p50=100 ms, p95=1.9 s, p99=2.9 s, average 465 ms, 963 RPS. This is the commit we pinned as our "healthy baseline." Every 1500-user run below is tagged _fb69a06 because we wanted to measure load and config, not code churn.

How? The usual suspects: proper DB connection pooling, eliminated a hot-path N+1, switched the repository layer to the injected get_repository() / get_repo pattern (see CLAUDE.md's DI rule), and stopped synchronously fsync'ing on every insert.

3. 1500 users: the API holds

profile_1500_fb69a06.csv turns the screws: 1500 concurrent users, tracing ON, default uvicorn worker count. Result: 232 648 requests, zero failures, p50=690 ms, p95=6.5 s, p99=9.5 s, 880 RPS.

Zero failures at 1500 users is the first genuine win. Latency got uglier — p95 jumped from 1.9 s to 6.5 s — but nothing fell over. The system is now throughput-limited, not stability-limited. That is a different class of problem.

4. What OpenTelemetry cost us

Compare profile_1500_fb69a06.csv vs profile_1500_notracing_fb69a06.csv. Same code, same load, same host. Only difference: DECNET_DEVELOPER_TRACING=false.

Metric Tracing ON Tracing OFF Delta
Total requests 232 648 277 214 +19%
p50 690 ms 340 ms -51%
p95 6 500 ms 5 700 ms -12%
p99 9 500 ms 8 400 ms -12%
Avg 1 773 ms 1 489 ms -16%
RPS 880.4 992.7 +13%

Auto-instrumented FastAPI tracing is not free. The median request paid a ~350 ms tax and the API served ~20% fewer requests in the same window. Tails are less affected because they are dominated by I/O wait, not span overhead.

Rule: tracing stays off in production DECNET deployments. It is a development lens. See Tracing and Profiling.

5. Vertical scaling: 12 workers vs single core

profile_1500_notracing_12_workers_fb69a06.csv: tracing off, uvicorn with 12 workers. Result: 308 024 requests, p50=700 ms, p95=2.7 s, p99=4.2 s, 1 585 RPS.

Going from default workers to 12 bought us: +11% throughput, -53% p95, -50% p99. The tail improvement is the real prize — more workers means fewer requests queued behind a slow one.

Now the punchline: profile_1500_notracing_single_core_fb69a06.csv. Same config, pinned to one core via CPU affinity. Result: 3 532 requests total, p95=115 s, p99=122 s, average 21.7 s, 46 RPS.

Single-core is a 34x throughput collapse vs 12-workers, and the tail grows from 4 seconds to nearly two minutes. FastAPI + SQLite on one core with 1500 concurrent clients is a queue, not a server.

Vertical scaling holds. Horizontal workers matter. The GIL is real.

6. Where is the bottleneck now?

Reading the 12-worker numbers: 1 585 RPS, p95=2.7 s, with zero failures. That is good, but p95 should be far lower than 2.7 s for an in-memory-ish workload. Candidates:

  1. SQLite single-writer lock. All 12 workers share one attackers.db. SQLite's WAL mode helps readers but writes still serialize. Under /api/v1/logs write amplification we expect queue-behind-writer stalls in exactly this latency envelope. The MySQL backend exists for exactly this reason — see Database drivers.
  2. Python GIL on the aggregation hot path. The single-core profile proves the interpreter is CPU-bound at saturation. 12 workers side- step the GIL only for independent requests — anything going through a shared lock (DB, in-process cache) re-serializes.
  3. Network stack / event-loop wait on Locust side — less likely, we checked client CPU during the runs.

Best defensible guess: SQLite writer lock first, GIL second. Switching the hot-write path to MySQL (or even PRAGMA journal_mode=WAL + batched inserts) should move p95 under a second at the same RPS. That work is scoped but not landed. See development/FUTURE.md for the queue.


tl;dr

  • From 5 RPS / 49% failure to 1 585 RPS / 0% failure at 1500 concurrent users.
  • Tracing costs ~13% RPS and doubles p50. Keep it off in production.
  • Workers matter. Single-core pinning = 46 RPS and two-minute tails.
  • Next bottleneck: the single SQLite writer. Blame the database, as is tradition.

Related: Design overview · Logging · Tracing and Profiling · Testing and CI.