From fc43909221476a1da3c018324bf1659d60a4319e Mon Sep 17 00:00:00 2001
From: anti <samuel@securejump.cl>
Date: Wed, 17 Jun 2026 16:40:19 -0400
Subject: [PATCH] =?UTF-8?q?docs(1.1):=20consolidation=20design=20=E2=80=94?=
 =?UTF-8?q?=20supervise=20by=20failure=20domain?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

HOW to consolidate: supervision-loop primitive (not TaskGroup, whose
all-or-nothing cancel breaks isolation); group by failure domain +
resource profile keeping per-group cgroup limits; every worker remains
config-extractable. Recommend process-groups now (~18->~9 units),
evaluate prefork+gc.freeze CoW on 3.14 as the higher-ceiling follow-on.
---
 development/RELEASE-1.1.md | 86 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)

diff --git a/development/RELEASE-1.1.md b/development/RELEASE-1.1.md
index d0103ebe..7b8ef589 100644
--- a/development/RELEASE-1.1.md
+++ b/development/RELEASE-1.1.md
@@ -75,6 +75,92 @@ the ORM into every worker. No production code imports `generate` from the packag
   restart + `MemoryMax`. **Medium.** Verify on the live fleet before adopting; keep the
   individual units as the fallback. Do C2–C4 first; C5 only if RAM still bites.
 
+## C4/C5 Consolidation design — HOW, not just which
+
+### The governing principle
+**Consolidate by failure domain, keep every worker independently extractable.**
+A worker's coroutine must not know whether it runs solo or hosted. "Hosted vs standalone"
+is a *deploy-time config decision*, never a code fork. That single rule makes consolidation
+reversible per-worker: if a co-located worker misbehaves, you pull it back to its own unit
+by editing a config list — no code change, no redeploy of others.
+
+### Two traps that kill the naive version
+
+1. **`asyncio.TaskGroup` is the WRONG primitive.** Its semantics are all-or-nothing: if one
+   task raises, the group cancels every sibling and propagates. That is the *opposite* of
+   worker isolation. A bug in `webhook` would cancel `collector`. We need independent
+   **supervision loops** — each worker wrapped in restart/backoff — gathered with
+   `return_exceptions=True`, NOT a bare TaskGroup/gather.
+
+2. **Consolidation silently discards systemd features we rely on.** Per-worker `Restart=`,
+   `MemoryMax=` (cgroup), journal tagging, `After=`/`Requires=` ordering. The supervisor must
+   *replace* the parts we used. `Restart=` → the in-process supervision loop below.
+   `MemoryMax=` → survives as a **per-group** cgroup limit on the group's systemd unit (you
+   lose per-*worker* granularity — that's the real cost, priced in below).
+
+### The supervision primitive (the one reusable bit — ~12 lines, no framework)
+```python
+async def supervise(name, run, *, max_backoff=30):
+    backoff = 1
+    while not _shutdown.is_set():
+        try:
+            await run()                       # the worker's own coroutine
+        except asyncio.CancelledError:
+            raise
+        except Exception:
+            log.exception("worker %s crashed; restart in %ds", name, backoff)
+            await asyncio.sleep(backoff); backoff = min(backoff * 2, max_backoff)
+        else:
+            break
+# host: await asyncio.gather(*(supervise(n, r) for n, r in group), return_exceptions=True)
+```
+This IS systemd `Restart=on-failure` with exponential backoff, in-process. Shutdown reuses
+the existing `system.{worker}.control` bus topic.
+
+### The decision axis: RAM ⟷ isolation (three coherent points)
+
+| Design | RAM win | Isolation kept | Verdict |
+|---|---|---|---|
+| **A. Single supervisor** (whole herd, 1 proc) | max (−600 MB) | crash-isolation only (via loop); shared OOM, no per-worker cgroup | too blunt — one leak starves all |
+| **B. Process groups by failure domain** ⭐ | ~−500 MB (4 floors vs ~13) | crash + group-level cgroup + reversible per-worker | **recommended start** |
+| **C. Prefork master** (import once, `gc.freeze()`, fork children) | potentially max, real-process isolation | full per-process isolation via CoW-shared floor | **the big-win follow-on, gated on a 3.14 CoW measurement** |
+
+### Recommended path: B now, measure C later
+
+**Stage 1 — build the primitive + ONE group.** Ship `decnet supervise --group <name>` reading
+a config list of `{worker: run-callable}`. Prove it on the safest group first.
+
+**Stage 2 — group by failure domain + resource profile** (not by convenience):
+
+| Group (1 systemd unit each) | Workers | Why they belong together |
+|---|---|---|
+| `supervise-io` | `forwarder`, `listener`, `mutate`, `webhook`† | pure IO, DB-light, rarely crash |
+| `supervise-batch` | `reconciler`, `enrich`, `orchestrator`, `canary` | periodic DB batch, similar churn |
+| `supervise-scapy` | `collect`, `probe`, `sniffer` | share the 76 MB scapy import once; tolerate blocking threads |
+| `supervise-cpu` | `clusterer`, `campaign-clusterer`, `attribution`, `reuse-correlate` | bursty/reactive CPU; GIL OK while idle, offload heavy kernels to a shared `ProcessPoolExecutor` only if contention shows |
+
+**Stay separate, no exceptions:** `bus` (broker), `api`/`web` (multiprocess by design),
+`profiler` (353 MB) + `ttp` (308 MB) — big resident state + sustained CPU, co-location just
+serializes them under the GIL.
+
+† `webhook` makes external HTTP calls → hang/crash risk. It only joins `supervise-io` once it
+has hard per-request timeouts; otherwise it stays standalone. Exactly the kind of call the
+"reversible per-worker" rule exists for.
+
+Net: ~18 units → **~9** (bus, api, web, profiler, ttp, + 4 supervise groups). ~13 floors → 4.
+
+**Stage 3 — evaluate prefork (C).** Only if Stage 2's savings aren't enough. On Python 3.14,
+immortal objects (PEP 683) + `gc.freeze()` before `fork()` keep module/code pages out of
+refcount-dirtying, so CoW can share much of the 86 MB floor across *real* child processes —
+full isolation AND the RAM win. But CoW decay is workload-dependent: **measure actual shared
+RSS on 3.14 before committing.** If it shares well, prefork supersedes the groups; if refcounts
+dirty the pages anyway, we keep B and stop.
+
+### Why this order
+B is incremental, reversible, and keeps the ops model you know — it de-risks the supervision
+pattern on one group before betting the fleet. C is the higher ceiling but rests on an
+empirical CoW question we haven't answered yet. Build the primitive once; it serves both.
+
 ## Projected (revised)
 
 - C2–C3 (import floor): only the 2-3 DB-less workers shed the ORM. **~100 MB.** Cheap hygiene.