From e07128a72aa4a262fb7e412fdbbaf4faa5e10156 Mon Sep 17 00:00:00 2001 From: anti Date: Sun, 19 Apr 2026 00:13:51 -0400 Subject: [PATCH] docs(remote-updates): document shared venv, self-update argv mechanics, new symptom rows - Install layout now shows shared /opt/decnet/venv/ (not per-slot .venv/) and the persistent /opt/decnet/.env.local convention. - 'What the updater does' reflects: full-dep bootstrap on first venv use then --no-deps; /proc-scan fallback when agent.pid is absent. - --include-self section spells out the argv reconstruction: why sys.argv is wrong inside the app process and which env vars the new argv is rebuilt from. - Symptom table picks up three rows from the recent fixes (--ssl-keyfile crash, missing typer after first self-update, old agent surviving the restart when it was started manually). --- Remote-Updates.md | 287 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 287 insertions(+) create mode 100644 Remote-Updates.md diff --git a/Remote-Updates.md b/Remote-Updates.md new file mode 100644 index 0000000..4128fad --- /dev/null +++ b/Remote-Updates.md @@ -0,0 +1,287 @@ +# Remote Updates + +DECNET ships a **self-updater** daemon that runs on every worker alongside +`decnet agent`. It lets the master push a new working tree to the worker +(tarball over mTLS), install it, restart the agent, health-probe it, and +**auto-rollback** to the previous release if the new one is unhealthy — +all without `scp`, `sshpass`, or any human SSH session. + +This page covers architecture, enrollment, the command surface, and the +failure modes you'll actually hit in practice. If you just want to ship +code, jump to [Pushing an update](#pushing-an-update). + +## Why a separate daemon + +The naive design puts the agent in charge of its own updates. That +immediately bricks itself the first time you push broken code — the +daemon you'd use to roll the fix back is the daemon you just broke. The +updater dodges that paradox by being a completely separate process with +its own venv and its own mTLS identity. A normal update does not touch +the updater, so the updater is *always* a known-good rescuer. + +A second explicit endpoint, `POST /update-self`, handles updater +upgrades. It has no auto-rollback and you must opt in — the contract is +"you have chosen to push to the thing that rescues you; don't break it." + +## Architecture + +``` +┌────────── MASTER ──────────┐ ┌──────────── WORKER ────────────┐ +│ decnet swarm update ... │ │ │ +│ tars working tree │ │ decnet updater :8766 ◀──────┐ │ +│ POST /update ──mTLS──▶──┼────────┼─▶ snapshots, installs, probes │ │ +│ │ │ restarts agent via exec │ │ +│ │ │ │ │ +│ │ │ decnet agent :8765 ◀──────┘ │ +│ │ │ (managed by updater) │ +└────────────────────────────┘ └─────────────────────────────────┘ +``` + +Two daemons on each worker, each with a distinct cert (both signed by +the same DECNET CA that already backs [SWARM Mode](SWARM-Mode)). +Certificate CN distinguishes identities: + +| Identity | CN example | Used for | +|---|---|---| +| Agent | `worker-01` | `/deploy`, `/teardown`, `/status`, `/health` on port 8765 | +| Updater | `updater@worker-01` | `/update`, `/update-self`, `/rollback`, `/releases` on port 8766 | + +## Install layout on the worker + +The updater owns the release directory: + +``` +/opt/decnet/ (default; override with --install-dir) + current -> releases/active (atomic symlink; flip == promotion) + venv/ shared venv — agent + updater run from here + releases/ + active/ source tree of the live release + prev/ the last good source snapshot + active.new/ staging (only exists mid-update) + updater/ updater's own tree + venv + releases + — NEVER touched by a normal /update + agent.pid PID of the agent process we spawned + agent.spawn.log stdout/stderr of the most recent spawn + .env.local per-host overrides (JWT secret, DB URL, …) +~/.decnet/ + agent/ worker.key, worker.crt, ca.crt + updater/ updater.key, updater.crt, ca.crt (CN=updater@) +``` + +The venv is **shared** across releases (not per-slot). An update swaps the +source-tree symlink; pip reinstalls the `decnet` package into the same +venv with `--force-reinstall --no-deps`, so the slow work is the fresh +tarball unpack, not a full dep rebuild. On the very first update into a +brand-new venv the full dep tree is installed once — subsequent updates +are near-no-op if dependencies haven't changed. + +The updater loads `.env.local` from its working directory, so the worker +can carry a persistent per-host `.env.local` (JWT secret, DB URL, log +paths) without editing site-packages. The updater spawns the agent with +`cwd=/opt/decnet/` so the agent picks up the same file. + +## Enrollment + +Enrolling a host for remote updates is a single extra flag on the +existing [`decnet swarm enroll`](SWARM-Mode#decnet-swarm-enroll): + +```bash +decnet swarm enroll \ + --host 192.168.1.23 --address 192.168.1.23 \ + --sans 192.168.1.23 \ + --updater \ + --out-dir ./enroll-bundle +``` + +The controller now issues **two** certs signed by the same CA: + +- `./enroll-bundle/{ca.crt, worker.crt, worker.key}` — goes to + `~/.decnet/agent/` on the worker (same as before). +- `./enroll-bundle-updater/{ca.crt, updater.crt, updater.key}` — + goes to `~/.decnet/updater/` on the worker. + +Ship both directories to the worker once (this is the last scp you'll do +for this host), then on the worker: + +```bash +sudo install -d -m 0700 ~/.decnet/agent ~/.decnet/updater +# ...scp the two bundles into place... +sudo decnet agent --daemon --agent-dir ~/.decnet/agent +sudo decnet updater --daemon --updater-dir ~/.decnet/updater \ + --install-dir /opt/decnet +``` + +From this point on the master can push code without touching SSH. + +### Without `--updater` + +If you forgot `--updater` at enrollment time, decommission and re-enroll +the host — that's the cleanest path. The alternative is running the +enrollment endpoint manually with `issue_updater_bundle=true` for an +already-enrolled host; this is currently a v2 concern. + +## Pushing an update + +From the master (your dev box), make your changes, commit if you want +(the tarball is the working tree, staged + unstaged + untracked), then: + +```bash +# Push to one worker +decnet swarm update --host worker-01 + +# Push to every non-decommissioned worker +decnet swarm update --all + +# Also ship the updater itself (explicit; no auto-rollback) +decnet swarm update --all --include-self + +# Inspect what would ship — no network +decnet swarm update --all --dry-run +``` + +Output is a table per host: + +``` +┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓ +┃ host ┃ address ┃ agent ┃ self ┃ detail ┃ +┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩ +│ worker-01 │ 192.168.1.23 │ updated │ — │ 4b7e1a9... │ +│ worker-02 │ 192.168.1.24 │ rolled- │ — │ probe failed;... │ +│ │ │ back │ │ │ +└────────────┴──────────────┴─────────┴──────┴────────────────────┘ +``` + +- **`updated` (green)** — new release live, agent answered `/health`. +- **`rolled-back` (yellow, exit 1)** — new release failed its post-deploy + probe; the updater already swapped the symlink back and restarted the + agent against the previous release. The worker is functional; the + attempted update is in `releases/prev/` on the worker for forensics. +- **`error` (red, exit 1)** — transport or install failure before the + rotation even happened; no state change on the worker. + +## What the updater actually does + +For each `/update` request: + +1. **Extract** the tarball into `/opt/decnet/releases/active.new/`. Paths + with `..` or leading `/` are rejected. +2. **Install**: `pip install --force-reinstall [--no-deps] ` + into the **shared** `/opt/decnet/venv/`. The first time this runs (venv + doesn't exist yet) the updater bootstraps it with the full dep tree; + every subsequent update uses `--no-deps` so only the `decnet` package + is replaced. Non-zero exit → abort, return **500** with pip stderr, no + rotation. +3. **Rotate**: `prev/` (if present) is removed, `active/` → `prev/`, + `active.new/` → `active/`. The `current` symlink is flipped atomically + via `rename(2)`. +4. **Restart agent**: SIGTERM the PID in `agent.pid`, wait up to 10 s, + then SIGKILL if still alive. If `agent.pid` is missing (agent was + started manually, not spawned by the updater), the updater scans + `/proc` for any `decnet agent` process and SIGTERMs those instead — + so restart is reliable regardless of how the agent was originally + launched. Spawn a new agent through the shared venv's `decnet` entry + point with `cwd=/opt/decnet/`. +5. **Probe**: GET `https://127.0.0.1:8765/health` over mTLS up to 10 + times with 1 s backoff. +6. **On probe success** → return **200** with the new release manifest. +7. **On probe failure** → swap `active/` ↔ `prev/` back, restart the + agent again, re-probe, return **409** with both probe transcripts and + `rolled_back: true`. The master CLI translates that to a yellow + "rolled-back" row and exit code 1. + +All of this runs inside the single POST handler. The update is atomic +from the outside: either the worker is on the new release *and* the +agent is healthy, or it's on the previous release (possibly because the +new one failed) with a healthy agent. + +## `--include-self` + +Same protocol but targets `/opt/decnet/updater/`, which has its own +release slots and its own venv (`/opt/decnet/updater/venv/`). The +updater never touches this directory during a normal `/update`, which +is the whole point: a broken `/update` can't brick the thing that rolls +it back. + +**Mechanics:** + +1. The master POSTs the tarball to `POST /update-self` with + `confirm_self=true`. +2. The updater extracts into `/opt/decnet/updater/releases/active.new/` + and runs `pip install --force-reinstall ` against + `/opt/decnet/updater/venv/`. The first self-update bootstraps this + venv with the full dep tree (typer, fastapi, uvicorn, …) before + installing `decnet`. +3. On success, rotate the updater's own `active`/`prev` slots. +4. `os.execv` into the newly installed binary with a cleanly + reconstructed argv. The argv is **not** `sys.argv[1:]` — inside the + running process `sys.argv` is the uvicorn subprocess invocation + (`--ssl-keyfile …`), which `decnet updater` CLI does not understand. + Instead the updater rebuilds `decnet updater --host … --port … \ + --updater-dir … --install-dir … --agent-dir …` from env vars that + `decnet.updater.server.run` stashes at startup + (`DECNET_UPDATER_HOST`, `DECNET_UPDATER_PORT`, + `DECNET_UPDATER_BUNDLE_DIR`, `DECNET_UPDATER_INSTALL_DIR`, + `DECNET_UPDATER_AGENT_DIR`). +5. The TCP connection drops *mid-response*. That is normal: the master + waits up to 30 s for the updater's `/health` to come back with the + new SHA and treats that as success. No auto-rollback — if the new + updater can't import, the old one is gone and you'll need SSH to + recover. + +**Ordering:** agent first, updater second. A broken agent push should +not lock you into shipping the updater through that broken agent's +host. + +**Use sparingly.** A bad self-update is the one case you *will* need +`scp` for — the wiki's promise of "no more scp" has one asterisk on it, +and this is it. + +## Manual rollback + +If you want to roll back without pushing new code (you notice a +regression after the probe already passed): + +```bash +# No CLI yet for this in v1; hit the endpoint directly. +curl --cert ~/.decnet/ca/master/worker.crt \ + --key ~/.decnet/ca/master/worker.key \ + --cacert ~/.decnet/ca/master/ca.crt \ + -X POST https://:8766/rollback +``` + +Returns 404 if there's no `prev/` slot (which is only the case on the +worker's very first release — a fresh install has nothing to roll back +to). A CLI wrapper (`decnet swarm rollback`) is planned for v2. + +## Symptom table + +| Symptom | Likely cause | Fix | +|---|---|---| +| `curl: (35) error:...peer certificate required` on 8766 | Updater is up but mTLS rejected the client cert. Using the wrong bundle. | Use `~/.decnet/ca/master/` bundle, not the worker bundle. | +| `swarm update` hangs for >2 min on one host | pip install is slow on a very underpowered worker. | Bump `_TIMEOUT_UPDATE` in `decnet/swarm/updater_client.py` (temporary) or enroll more resources. | +| All hosts return `error: ConnectTimeout` | Updater isn't running on any worker. | On each worker: `sudo decnet updater --daemon --updater-dir ...`. | +| `rolled-back` on every push | The agent is now importing something the worker doesn't have. Probe hits `/health` and gets 500. | Read `detail` field — it contains the agent's traceback. Fix and push again. | +| `rolled-back` only on workers with a different OS | The Compose/Buildx/Python version on that worker differs from the master's. | [SWARM Mode prerequisites](SWARM-Mode#prerequisites). The updater does not install OS packages. | +| After `--include-self`, `/health` on 8766 never returns | The new updater failed at import time. `execv` succeeded but Python died. | SSH in; look at the updater's journalctl / stderr; revert `~/.decnet/updater/` to the previous tree manually. | +| After `--include-self`, updater logs `No such option: --ssl-keyfile` and dies | Pre-fix bug: updater re-exec'd with `sys.argv[1:]` (uvicorn's argv) instead of the CLI argv. | Fixed in commit `40d3e86`. Make sure the updater is running code that reconstructs argv from env — if not, SSH in and `sudo decnet updater --host 0.0.0.0 --port 8766 --updater-dir ... --install-dir ...` manually. | +| After `--include-self`, updater dies with `ModuleNotFoundError: No module named 'typer'` | Pre-fix bug: a freshly bootstrapped updater venv installed `decnet` with `--no-deps`. | Fixed in commit `40d3e86`. On an already-broken host: `sudo rm -rf /opt/decnet/updater/{venv,releases,current}` and restart the updater from `/opt/decnet/venv/` — the next `--include-self` bootstraps a complete venv. | +| Agent keeps serving old code after `/update` returns 200 | Agent was started by hand (no `agent.pid`), pre-fix `_stop_agent` had nothing to kill. | Fixed in commit `ebeaf08`: `_stop_agent` now falls back to a `/proc` scan for any `decnet agent` process. On stale hosts, restart the agent once. | +| Master: `FileNotFoundError: .../master/worker.key` | Master identity was never materialized (no `swarm enroll` has run yet on this install). | `decnet swarm list` once — it materializes the master identity as a side effect of `ensure_master_identity()`. | + +## Out of scope (v1) + +- **Dependency changes.** The updater `pip install`s the new tree, so + adding a dep *usually* just works. But if the new version of a dep + fails to resolve against the worker's Python / index / lockfile, the + rollback path catches it. It is not the updater's job to fix package + layer drift — use `scp` and deploy manually once, then carry on. +- **OS package installs.** Upgrades to Docker / Compose / buildx are + still manual on the worker. See [SWARM Mode prerequisites](SWARM-Mode#prerequisites). +- **Systemd unit changes.** Shipping a new unit file requires the old + one to already work; you won't get rescue for that kind of change via + the updater. +- **Schema migrations** on the worker's tiny SQLite (if any) — manual. +- **Canary / A/B rollouts.** `--all` is all-at-once. If you want staged + rollout, push to one host, observe, then the rest. +- **Signed release artifacts.** mTLS already authenticates the master; a + detached signature is a v2 concern.