Files
DECNET/development/RELEASE-1.1.md
anti 3df9770cec docs(1.1): batch group VERIFIED LIVE — 509MB -> 129MB (-380MB / 75%)
Controlled unit swap on mothership: single PID hosts all 4 workers,
shared repo + 30s reconcile pass confirmed working, no crashes. Live
delta beat the floor estimate. Batch group complete.
2026-06-17 17:28:18 -04:00

12 KiB
Raw Blame History

DECNET 1.1 — RAM / Process-Footprint Release

Predecessor: v1.0.0. Theme: cut the fleet resident set from 2.57 GB → target ~1.0 GB by paying the import floor once per process-group instead of once per worker.

Analysis & measurements: see improvements.md.

⚠️ Strategy correction (after C2 measurement)

Initial assumption: lazy imports would drop idle workers below the 86 MB floor → ~1.3 GB. Measurement disproved this. All 25 CLI command modules transitively pull the SQLModel ORM, and the 38 MB table chain is a hub: importing any one table loads the whole registry. Most workers legitimately touch the DB, so they can't shed the ORM. Lazy registration only helps the 2-3 genuinely DB-less workers (forwarder, listener, maybe bus).

Therefore consolidation is the PRIMARY lever, not the fallback. Paying the 86 MB floor once for the idle herd (instead of ~8×) is the reliable ~600 MB win. C2/C3 stay as cheap hygiene and to let the DB-less workers actually skip the ORM, but they are no longer the headline.

Why

18 long-running workers, ~2.57 GB resident. ~1.5 GB of that is the same 86 MB import floor paid 18×, not workload. The floor is import decnet.cli pulling the entire SQLModel ORM + all 26 model tables into every worker, even ones that never touch the DB.

Root cause (pinned)

cli/__init__.py:22  from . import (... topology ...)
 → cli/topology.py:12        from decnet.topology.config import TopologyConfig
   → decnet/topology/__init__.py:10   from decnet.topology.generator import generate   ← TRIGGER
     → generator → allocator → repository → web.db.models.topology → all 26 tables (~38 MB)

The topology/__init__.py eager re-export of generate is the single thread that drags the ORM into every worker. No production code imports generate from the package surface (only tests, and they import the compose submodule) — safe to make lazy.

Commit plan (incremental, one concern each)

  • C1 — docs. improvements.md (analysis) + this release plan.
  • C2 — lazy topology re-export. PEP 562 __getattr__ in topology/__init__.py so generate loads on access, not on package import. Public API unchanged. Guard test added. Finding: correct, but ORM pull is pervasive (see correction above) — C2 alone does not move the floor.
  • C3 — DB-less worker lazy registration (reduced scope). Confirm forwarder / listener (and audit bus) never touch models at runtime; make ONLY their command modules import-clean so those specific processes skip the 38 MB ORM. Skip the DB-touching majority — they can't shed it. Test: those modules don't pull decnet.web.db.models.
  • C4 — extract idle-herd coroutines. Hoist the inline _run() closures (webhook, canary, listener, forwarder, mutate, enrich) into reusable async def run(bus, cfg) in their packages, so they're hostable by a supervisor. CLI commands become thin asyncio.run(run(...)) wrappers. No behaviour change.
  • C5 — decnet supervise primitive + batch group. decnet/supervisor.py (per-worker restart loops, NOT TaskGroup) + decnet supervise batch hosting reconcile/enrich/orchestrate/mutate with one shared repo + deploy/ decnet-supervise-batch.service.j2 (Conflicts= the units it replaces). Unit-tested. Live verification PENDING — must be a controlled unit swap (stop the 4 units, start supervise-batch), never a parallel run: mutate/orchestrate are stateful and double-running mutates deckies / doubles synthetic traffic. Measure RSS before/after. Remaining groups (cpu, scapy) and webhook-timeout prerequisite not yet built.
  • C6 — merge scapy workers. Optional. collect/probe/sniffer share the 76 MB scapy import once instead of 3×.

Scope boundaries (no creep)

  • In: import-floor reduction (C2C3), idle-herd consolidation (C4C5), scapy merge (C6).
  • Out: bus (broker — stays alone), api/web (already multiprocess), profiler/ttp (heavy resident state + real CPU — stay separate). Not touching DB schema, bus wire format, or worker logic — only where code is imported and which process hosts it.

Risk ladder

  • C2C3: import-site only, reversible, covered by an import-floor test. Low.
  • C4: pure extraction, behaviour-preserving, existing tests guard it. Low.
  • C5: introduces shared fate — one crash/OOM takes the herd; loses per-worker systemd restart + MemoryMax. Medium. Verify on the live fleet before adopting; keep the individual units as the fallback. Do C2C4 first; C5 only if RAM still bites.

C4/C5 Consolidation design — HOW, not just which

The governing principle

Consolidate by failure domain, keep every worker independently extractable. A worker's coroutine must not know whether it runs solo or hosted. "Hosted vs standalone" is a deploy-time config decision, never a code fork. That single rule makes consolidation reversible per-worker: if a co-located worker misbehaves, you pull it back to its own unit by editing a config list — no code change, no redeploy of others.

Two traps that kill the naive version

  1. asyncio.TaskGroup is the WRONG primitive. Its semantics are all-or-nothing: if one task raises, the group cancels every sibling and propagates. That is the opposite of worker isolation. A bug in webhook would cancel collector. We need independent supervision loops — each worker wrapped in restart/backoff — gathered with return_exceptions=True, NOT a bare TaskGroup/gather.

  2. Consolidation silently discards systemd features we rely on. Per-worker Restart=, MemoryMax= (cgroup), journal tagging, After=/Requires= ordering. The supervisor must replace the parts we used. Restart= → the in-process supervision loop below. MemoryMax= → survives as a per-group cgroup limit on the group's systemd unit (you lose per-worker granularity — that's the real cost, priced in below).

The supervision primitive (the one reusable bit — ~12 lines, no framework)

async def supervise(name, run, *, max_backoff=30):
    backoff = 1
    while not _shutdown.is_set():
        try:
            await run()                       # the worker's own coroutine
        except asyncio.CancelledError:
            raise
        except Exception:
            log.exception("worker %s crashed; restart in %ds", name, backoff)
            await asyncio.sleep(backoff); backoff = min(backoff * 2, max_backoff)
        else:
            break
# host: await asyncio.gather(*(supervise(n, r) for n, r in group), return_exceptions=True)

This IS systemd Restart=on-failure with exponential backoff, in-process. Shutdown reuses the existing system.{worker}.control bus topic.

The decision axis: RAM ⟷ isolation (three coherent points)

Design RAM win Isolation kept Verdict
A. Single supervisor (whole herd, 1 proc) max (600 MB) crash-isolation only (via loop); shared OOM, no per-worker cgroup too blunt — one leak starves all
B. Process groups by failure domain ~500 MB (4 floors vs ~13) crash + group-level cgroup + reversible per-worker recommended start
C. Prefork master (import once, gc.freeze(), fork children) potentially max, real-process isolation full per-process isolation via CoW-shared floor the big-win follow-on, gated on a 3.14 CoW measurement

Stage 1 — build the primitive + ONE group. Ship decnet supervise --group <name> reading a config list of {worker: run-callable}. Prove it on the safest group first.

Group by failure domain AND co-residency (corrected against the live ps):

forwarder/listener are role-split swarm singletons (forwarder=agent-side, listener=master-side) — never co-resident with the herd, one per role, so consolidating them saves nothing. They drop out of the grouping. The actually co-resident master herd is what matters.

Bonus discovered during extraction: the batch workers all share the decnet.web.dependencies.repo singleton → a group initializes it once and they share one DB connection pool, not N. Savings beyond the import floor.

Group (1 systemd unit each) Workers Why they belong together
supervise-batch Stage 1 reconcile, enrich, orchestrate, mutate periodic/event DB loops, all share repo; low crash risk
supervise-cpu clusterer, campaign-clusterer, attribution, reuse-correlate bursty/reactive CPU; GIL OK while idle, offload heavy kernels to a shared ProcessPoolExecutor only if contention shows
supervise-scapy collect, probe (+sniffer where present) share the 76 MB scapy import once; tolerate blocking threads

Stay separate, no exceptions: bus (broker), api/web (multiprocess by design), profiler (353 MB) + ttp (308 MB) — big resident state + sustained CPU, co-location just serializes them under the GIL. Deferred standalone: webhook (external HTTP → needs hard timeouts before co-location), canary (self-manages its own repo; revisit).

Net on the master: idle/CPU/scapy herd (~10 procs) → 3 group procs.

Stage 3 — evaluate prefork (C). Only if Stage 2's savings aren't enough. On Python 3.14, immortal objects (PEP 683) + gc.freeze() before fork() keep module/code pages out of refcount-dirtying, so CoW can share much of the 86 MB floor across real child processes — full isolation AND the RAM win. But CoW decay is workload-dependent: measure actual shared RSS on 3.14 before committing. If it shares well, prefork supersedes the groups; if refcounts dirty the pages anyway, we keep B and stop.

Why this order

B is incremental, reversible, and keeps the ops model you know — it de-risks the supervision pattern on one group before betting the fleet. C is the higher ceiling but rests on an empirical CoW question we haven't answered yet. Build the primitive once; it serves both.

Measured — batch group (2026-06-17, side-effect-free floor measurement)

The 4 live batch workers vs the consolidated startup floor (imports + repo pool + bus client, measured WITHOUT running a mutation tick to avoid racing the live fleet):

RSS
enrich + reconcile + mutate + orchestrate (4 procs) 509 MB
consolidated startup floor (1 proc) 118.5 MB
of which DB pool 15.7 MB (1× vs 4× → ~47 MB saved alone)

Delta ≈ 350 MB (~70%) on this group. 118.5 MB is a lower bound: the live consolidated process adds each worker's working set on top (idle workers, so modest — estimate ~140170 MB live). The exact live number needs the controlled unit swap.

VERIFIED LIVE (2026-06-17, controlled unit swap on mothership)

Swapped the 4 individual units → decnet-supervise-batch.service. Single PID hosts all four (supervisor: hosting 4 workers: reconcile, enrich, orchestrate, mutate); reconciler's 30s pass and the shared aiosqlite repo confirmed working; no crashes/restarts.

RSS
4 separate processes (before) 509 MB
decnet supervise batch (after, live) 129 MB
saved 380 MB (75%)

Beat the estimate — one process, one DB pool, four workers. Rollback = disable the supervisor, re-enable the 4 units (Conflicts= prevents accidental double-run).

Projected (rest of release)

  • Batch group (measured): 350 MB, confirmed by floor measurement, swap pending.
  • C-cpu group (4 clusterers): similar shape, ~300 MB expected.
  • C-scapy (collect+probe): share the 76 MB scapy once, ~150 MB.
  • Total: 2.57 GB → ~1.3 GB realistic (consolidation only; webhook/canary stay out). Prefork (C) could push further but is gated on the 3.14 CoW measurement.