docs(remote-updates): document shared venv, self-update argv mechanics, new symptom rows

- Install layout now shows shared /opt/decnet/venv/ (not per-slot .venv/) and the persistent /opt/decnet/.env.local convention. - 'What the updater does' reflects: full-dep bootstrap on first venv use then --no-deps; /proc-scan fallback when agent.pid is absent. - --include-self section spells out the argv reconstruction: why sys.argv is wrong inside the app process and which env vars the new argv is rebuilt from. - Symptom table picks up three rows from the recent fixes (--ssl-keyfile crash, missing typer after first self-update, old agent surviving the restart when it was started manually).
2026-04-19 00:13:51 -04:00
parent 7c6fe3b576
commit e07128a72a
1 changed files with 287 additions and 0 deletions
--- a/Remote-Updates.md
+++ b/Remote-Updates.md
@@ -0,0 +1,287 @@
+# Remote Updates
+
+DECNET ships a **self-updater** daemon that runs on every worker alongside
+`decnet agent`. It lets the master push a new working tree to the worker
+(tarball over mTLS), install it, restart the agent, health-probe it, and
+**auto-rollback** to the previous release if the new one is unhealthy —
+all without `scp`, `sshpass`, or any human SSH session.
+
+This page covers architecture, enrollment, the command surface, and the
+failure modes you'll actually hit in practice. If you just want to ship
+code, jump to [Pushing an update](#pushing-an-update).
+
+## Why a separate daemon
+
+The naive design puts the agent in charge of its own updates. That
+immediately bricks itself the first time you push broken code — the
+daemon you'd use to roll the fix back is the daemon you just broke. The
+updater dodges that paradox by being a completely separate process with
+its own venv and its own mTLS identity. A normal update does not touch
+the updater, so the updater is *always* a known-good rescuer.
+
+A second explicit endpoint, `POST /update-self`, handles updater
+upgrades. It has no auto-rollback and you must opt in — the contract is
+"you have chosen to push to the thing that rescues you; don't break it."
+
+## Architecture
+
+```
+┌────────── MASTER ──────────┐        ┌──────────── WORKER ────────────┐
+│ decnet swarm update ...    │        │                                 │
+│   tars working tree        │        │ decnet updater   :8766 ◀──────┐ │
+│   POST /update  ──mTLS──▶──┼────────┼─▶ snapshots, installs, probes │ │
+│                            │        │   restarts agent via exec     │ │
+│                            │        │                               │ │
+│                            │        │ decnet agent     :8765 ◀──────┘ │
+│                            │        │   (managed by updater)          │
+└────────────────────────────┘        └─────────────────────────────────┘
+```
+
+Two daemons on each worker, each with a distinct cert (both signed by
+the same DECNET CA that already backs [SWARM Mode](SWARM-Mode)).
+Certificate CN distinguishes identities:
+
+| Identity | CN example | Used for |
+|---|---|---|
+| Agent | `worker-01` | `/deploy`, `/teardown`, `/status`, `/health` on port 8765 |
+| Updater | `updater@worker-01` | `/update`, `/update-self`, `/rollback`, `/releases` on port 8766 |
+
+## Install layout on the worker
+
+The updater owns the release directory:
+
+```
+/opt/decnet/                         (default; override with --install-dir)
+  current -> releases/active          (atomic symlink; flip == promotion)
+  venv/                               shared venv — agent + updater run from here
+  releases/
+    active/                           source tree of the live release
+    prev/                             the last good source snapshot
+    active.new/                       staging (only exists mid-update)
+  updater/                            updater's own tree + venv + releases
+                                      — NEVER touched by a normal /update
+  agent.pid                           PID of the agent process we spawned
+  agent.spawn.log                     stdout/stderr of the most recent spawn
+  .env.local                          per-host overrides (JWT secret, DB URL, …)
+~/.decnet/
+  agent/         worker.key, worker.crt, ca.crt
+  updater/       updater.key, updater.crt, ca.crt   (CN=updater@<host>)
+```
+
+The venv is **shared** across releases (not per-slot). An update swaps the
+source-tree symlink; pip reinstalls the `decnet` package into the same
+venv with `--force-reinstall --no-deps`, so the slow work is the fresh
+tarball unpack, not a full dep rebuild. On the very first update into a
+brand-new venv the full dep tree is installed once — subsequent updates
+are near-no-op if dependencies haven't changed.
+
+The updater loads `.env.local` from its working directory, so the worker
+can carry a persistent per-host `.env.local` (JWT secret, DB URL, log
+paths) without editing site-packages. The updater spawns the agent with
+`cwd=/opt/decnet/` so the agent picks up the same file.
+
+## Enrollment
+
+Enrolling a host for remote updates is a single extra flag on the
+existing [`decnet swarm enroll`](SWARM-Mode#decnet-swarm-enroll):
+
+```bash
+decnet swarm enroll \
+  --host 192.168.1.23 --address 192.168.1.23 \
+  --sans 192.168.1.23 \
+  --updater \
+  --out-dir ./enroll-bundle
+```
+
+The controller now issues **two** certs signed by the same CA:
+
+- `./enroll-bundle/{ca.crt, worker.crt, worker.key}` — goes to
+  `~/.decnet/agent/` on the worker (same as before).
+- `./enroll-bundle-updater/{ca.crt, updater.crt, updater.key}` —
+  goes to `~/.decnet/updater/` on the worker.
+
+Ship both directories to the worker once (this is the last scp you'll do
+for this host), then on the worker:
+
+```bash
+sudo install -d -m 0700 ~/.decnet/agent ~/.decnet/updater
+# ...scp the two bundles into place...
+sudo decnet agent --daemon --agent-dir ~/.decnet/agent
+sudo decnet updater --daemon --updater-dir ~/.decnet/updater \
+    --install-dir /opt/decnet
+```
+
+From this point on the master can push code without touching SSH.
+
+### Without `--updater`
+
+If you forgot `--updater` at enrollment time, decommission and re-enroll
+the host — that's the cleanest path. The alternative is running the
+enrollment endpoint manually with `issue_updater_bundle=true` for an
+already-enrolled host; this is currently a v2 concern.
+
+## Pushing an update
+
+From the master (your dev box), make your changes, commit if you want
+(the tarball is the working tree, staged + unstaged + untracked), then:
+
+```bash
+# Push to one worker
+decnet swarm update --host worker-01
+
+# Push to every non-decommissioned worker
+decnet swarm update --all
+
+# Also ship the updater itself (explicit; no auto-rollback)
+decnet swarm update --all --include-self
+
+# Inspect what would ship — no network
+decnet swarm update --all --dry-run
+```
+
+Output is a table per host:
+
+```
+┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
+┃ host       ┃ address      ┃ agent   ┃ self ┃ detail             ┃
+┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
+│ worker-01  │ 192.168.1.23 │ updated │ —    │ 4b7e1a9...         │
+│ worker-02  │ 192.168.1.24 │ rolled- │ —    │ probe failed;...   │
+│            │              │ back    │      │                    │
+└────────────┴──────────────┴─────────┴──────┴────────────────────┘
+```
+
+- **`updated` (green)** — new release live, agent answered `/health`.
+- **`rolled-back` (yellow, exit 1)** — new release failed its post-deploy
+  probe; the updater already swapped the symlink back and restarted the
+  agent against the previous release. The worker is functional; the
+  attempted update is in `releases/prev/` on the worker for forensics.
+- **`error` (red, exit 1)** — transport or install failure before the
+  rotation even happened; no state change on the worker.
+
+## What the updater actually does
+
+For each `/update` request:
+
+1. **Extract** the tarball into `/opt/decnet/releases/active.new/`. Paths
+   with `..` or leading `/` are rejected.
+2. **Install**: `pip install --force-reinstall [--no-deps] <active.new>`
+   into the **shared** `/opt/decnet/venv/`. The first time this runs (venv
+   doesn't exist yet) the updater bootstraps it with the full dep tree;
+   every subsequent update uses `--no-deps` so only the `decnet` package
+   is replaced. Non-zero exit → abort, return **500** with pip stderr, no
+   rotation.
+3. **Rotate**: `prev/` (if present) is removed, `active/` → `prev/`,
+   `active.new/` → `active/`. The `current` symlink is flipped atomically
+   via `rename(2)`.
+4. **Restart agent**: SIGTERM the PID in `agent.pid`, wait up to 10 s,
+   then SIGKILL if still alive. If `agent.pid` is missing (agent was
+   started manually, not spawned by the updater), the updater scans
+   `/proc` for any `decnet agent` process and SIGTERMs those instead —
+   so restart is reliable regardless of how the agent was originally
+   launched. Spawn a new agent through the shared venv's `decnet` entry
+   point with `cwd=/opt/decnet/`.
+5. **Probe**: GET `https://127.0.0.1:8765/health` over mTLS up to 10
+   times with 1 s backoff.
+6. **On probe success** → return **200** with the new release manifest.
+7. **On probe failure** → swap `active/` ↔ `prev/` back, restart the
+   agent again, re-probe, return **409** with both probe transcripts and
+   `rolled_back: true`. The master CLI translates that to a yellow
+   "rolled-back" row and exit code 1.
+
+All of this runs inside the single POST handler. The update is atomic
+from the outside: either the worker is on the new release *and* the
+agent is healthy, or it's on the previous release (possibly because the
+new one failed) with a healthy agent.
+
+## `--include-self`
+
+Same protocol but targets `/opt/decnet/updater/`, which has its own
+release slots and its own venv (`/opt/decnet/updater/venv/`). The
+updater never touches this directory during a normal `/update`, which
+is the whole point: a broken `/update` can't brick the thing that rolls
+it back.
+
+**Mechanics:**
+
+1. The master POSTs the tarball to `POST /update-self` with
+   `confirm_self=true`.
+2. The updater extracts into `/opt/decnet/updater/releases/active.new/`
+   and runs `pip install --force-reinstall <slot>` against
+   `/opt/decnet/updater/venv/`. The first self-update bootstraps this
+   venv with the full dep tree (typer, fastapi, uvicorn, …) before
+   installing `decnet`.
+3. On success, rotate the updater's own `active`/`prev` slots.
+4. `os.execv` into the newly installed binary with a cleanly
+   reconstructed argv. The argv is **not** `sys.argv[1:]` — inside the
+   running process `sys.argv` is the uvicorn subprocess invocation
+   (`--ssl-keyfile …`), which `decnet updater` CLI does not understand.
+   Instead the updater rebuilds `decnet updater --host … --port … \
+   --updater-dir … --install-dir … --agent-dir …` from env vars that
+   `decnet.updater.server.run` stashes at startup
+   (`DECNET_UPDATER_HOST`, `DECNET_UPDATER_PORT`,
+   `DECNET_UPDATER_BUNDLE_DIR`, `DECNET_UPDATER_INSTALL_DIR`,
+   `DECNET_UPDATER_AGENT_DIR`).
+5. The TCP connection drops *mid-response*. That is normal: the master
+   waits up to 30 s for the updater's `/health` to come back with the
+   new SHA and treats that as success. No auto-rollback — if the new
+   updater can't import, the old one is gone and you'll need SSH to
+   recover.
+
+**Ordering:** agent first, updater second. A broken agent push should
+not lock you into shipping the updater through that broken agent's
+host.
+
+**Use sparingly.** A bad self-update is the one case you *will* need
+`scp` for — the wiki's promise of "no more scp" has one asterisk on it,
+and this is it.
+
+## Manual rollback
+
+If you want to roll back without pushing new code (you notice a
+regression after the probe already passed):
+
+```bash
+# No CLI yet for this in v1; hit the endpoint directly.
+curl --cert ~/.decnet/ca/master/worker.crt \
+     --key  ~/.decnet/ca/master/worker.key \
+     --cacert ~/.decnet/ca/master/ca.crt \
+     -X POST https://<worker-ip>:8766/rollback
+```
+
+Returns 404 if there's no `prev/` slot (which is only the case on the
+worker's very first release — a fresh install has nothing to roll back
+to). A CLI wrapper (`decnet swarm rollback`) is planned for v2.
+
+## Symptom table
+
+| Symptom | Likely cause | Fix |
+|---|---|---|
+| `curl: (35) error:...peer certificate required` on 8766 | Updater is up but mTLS rejected the client cert. Using the wrong bundle. | Use `~/.decnet/ca/master/` bundle, not the worker bundle. |
+| `swarm update` hangs for >2 min on one host | pip install is slow on a very underpowered worker. | Bump `_TIMEOUT_UPDATE` in `decnet/swarm/updater_client.py` (temporary) or enroll more resources. |
+| All hosts return `error: ConnectTimeout` | Updater isn't running on any worker. | On each worker: `sudo decnet updater --daemon --updater-dir ...`. |
+| `rolled-back` on every push | The agent is now importing something the worker doesn't have. Probe hits `/health` and gets 500. | Read `detail` field — it contains the agent's traceback. Fix and push again. |
+| `rolled-back` only on workers with a different OS | The Compose/Buildx/Python version on that worker differs from the master's. | [SWARM Mode prerequisites](SWARM-Mode#prerequisites). The updater does not install OS packages. |
+| After `--include-self`, `/health` on 8766 never returns | The new updater failed at import time. `execv` succeeded but Python died. | SSH in; look at the updater's journalctl / stderr; revert `~/.decnet/updater/` to the previous tree manually. |
+| After `--include-self`, updater logs `No such option: --ssl-keyfile` and dies | Pre-fix bug: updater re-exec'd with `sys.argv[1:]` (uvicorn's argv) instead of the CLI argv. | Fixed in commit `40d3e86`. Make sure the updater is running code that reconstructs argv from env — if not, SSH in and `sudo decnet updater --host 0.0.0.0 --port 8766 --updater-dir ... --install-dir ...` manually. |
+| After `--include-self`, updater dies with `ModuleNotFoundError: No module named 'typer'` | Pre-fix bug: a freshly bootstrapped updater venv installed `decnet` with `--no-deps`. | Fixed in commit `40d3e86`. On an already-broken host: `sudo rm -rf /opt/decnet/updater/{venv,releases,current}` and restart the updater from `/opt/decnet/venv/` — the next `--include-self` bootstraps a complete venv. |
+| Agent keeps serving old code after `/update` returns 200 | Agent was started by hand (no `agent.pid`), pre-fix `_stop_agent` had nothing to kill. | Fixed in commit `ebeaf08`: `_stop_agent` now falls back to a `/proc` scan for any `decnet agent` process. On stale hosts, restart the agent once. |
+| Master: `FileNotFoundError: .../master/worker.key` | Master identity was never materialized (no `swarm enroll` has run yet on this install). | `decnet swarm list` once — it materializes the master identity as a side effect of `ensure_master_identity()`. |
+
+## Out of scope (v1)
+
+- **Dependency changes.** The updater `pip install`s the new tree, so
+  adding a dep *usually* just works. But if the new version of a dep
+  fails to resolve against the worker's Python / index / lockfile, the
+  rollback path catches it. It is not the updater's job to fix package
+  layer drift — use `scp` and deploy manually once, then carry on.
+- **OS package installs.** Upgrades to Docker / Compose / buildx are
+  still manual on the worker. See [SWARM Mode prerequisites](SWARM-Mode#prerequisites).
+- **Systemd unit changes.** Shipping a new unit file requires the old
+  one to already work; you won't get rescue for that kind of change via
+  the updater.
+- **Schema migrations** on the worker's tiny SQLite (if any) — manual.
+- **Canary / A/B rollouts.** `--all` is all-at-once. If you want staged
+  rollout, push to one host, observe, then the rest.
+- **Signed release artifacts.** mTLS already authenticates the master; a
+  detached signature is a v2 concern.