docs(remote-updates): document shared venv, self-update argv mechanics, new symptom rows
- Install layout now shows shared /opt/decnet/venv/ (not per-slot .venv/) and the persistent /opt/decnet/.env.local convention. - 'What the updater does' reflects: full-dep bootstrap on first venv use then --no-deps; /proc-scan fallback when agent.pid is absent. - --include-self section spells out the argv reconstruction: why sys.argv is wrong inside the app process and which env vars the new argv is rebuilt from. - Symptom table picks up three rows from the recent fixes (--ssl-keyfile crash, missing typer after first self-update, old agent surviving the restart when it was started manually).
287
Remote-Updates.md
Normal file
287
Remote-Updates.md
Normal file
@@ -0,0 +1,287 @@
|
||||
# Remote Updates
|
||||
|
||||
DECNET ships a **self-updater** daemon that runs on every worker alongside
|
||||
`decnet agent`. It lets the master push a new working tree to the worker
|
||||
(tarball over mTLS), install it, restart the agent, health-probe it, and
|
||||
**auto-rollback** to the previous release if the new one is unhealthy —
|
||||
all without `scp`, `sshpass`, or any human SSH session.
|
||||
|
||||
This page covers architecture, enrollment, the command surface, and the
|
||||
failure modes you'll actually hit in practice. If you just want to ship
|
||||
code, jump to [Pushing an update](#pushing-an-update).
|
||||
|
||||
## Why a separate daemon
|
||||
|
||||
The naive design puts the agent in charge of its own updates. That
|
||||
immediately bricks itself the first time you push broken code — the
|
||||
daemon you'd use to roll the fix back is the daemon you just broke. The
|
||||
updater dodges that paradox by being a completely separate process with
|
||||
its own venv and its own mTLS identity. A normal update does not touch
|
||||
the updater, so the updater is *always* a known-good rescuer.
|
||||
|
||||
A second explicit endpoint, `POST /update-self`, handles updater
|
||||
upgrades. It has no auto-rollback and you must opt in — the contract is
|
||||
"you have chosen to push to the thing that rescues you; don't break it."
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌────────── MASTER ──────────┐ ┌──────────── WORKER ────────────┐
|
||||
│ decnet swarm update ... │ │ │
|
||||
│ tars working tree │ │ decnet updater :8766 ◀──────┐ │
|
||||
│ POST /update ──mTLS──▶──┼────────┼─▶ snapshots, installs, probes │ │
|
||||
│ │ │ restarts agent via exec │ │
|
||||
│ │ │ │ │
|
||||
│ │ │ decnet agent :8765 ◀──────┘ │
|
||||
│ │ │ (managed by updater) │
|
||||
└────────────────────────────┘ └─────────────────────────────────┘
|
||||
```
|
||||
|
||||
Two daemons on each worker, each with a distinct cert (both signed by
|
||||
the same DECNET CA that already backs [SWARM Mode](SWARM-Mode)).
|
||||
Certificate CN distinguishes identities:
|
||||
|
||||
| Identity | CN example | Used for |
|
||||
|---|---|---|
|
||||
| Agent | `worker-01` | `/deploy`, `/teardown`, `/status`, `/health` on port 8765 |
|
||||
| Updater | `updater@worker-01` | `/update`, `/update-self`, `/rollback`, `/releases` on port 8766 |
|
||||
|
||||
## Install layout on the worker
|
||||
|
||||
The updater owns the release directory:
|
||||
|
||||
```
|
||||
/opt/decnet/ (default; override with --install-dir)
|
||||
current -> releases/active (atomic symlink; flip == promotion)
|
||||
venv/ shared venv — agent + updater run from here
|
||||
releases/
|
||||
active/ source tree of the live release
|
||||
prev/ the last good source snapshot
|
||||
active.new/ staging (only exists mid-update)
|
||||
updater/ updater's own tree + venv + releases
|
||||
— NEVER touched by a normal /update
|
||||
agent.pid PID of the agent process we spawned
|
||||
agent.spawn.log stdout/stderr of the most recent spawn
|
||||
.env.local per-host overrides (JWT secret, DB URL, …)
|
||||
~/.decnet/
|
||||
agent/ worker.key, worker.crt, ca.crt
|
||||
updater/ updater.key, updater.crt, ca.crt (CN=updater@<host>)
|
||||
```
|
||||
|
||||
The venv is **shared** across releases (not per-slot). An update swaps the
|
||||
source-tree symlink; pip reinstalls the `decnet` package into the same
|
||||
venv with `--force-reinstall --no-deps`, so the slow work is the fresh
|
||||
tarball unpack, not a full dep rebuild. On the very first update into a
|
||||
brand-new venv the full dep tree is installed once — subsequent updates
|
||||
are near-no-op if dependencies haven't changed.
|
||||
|
||||
The updater loads `.env.local` from its working directory, so the worker
|
||||
can carry a persistent per-host `.env.local` (JWT secret, DB URL, log
|
||||
paths) without editing site-packages. The updater spawns the agent with
|
||||
`cwd=/opt/decnet/` so the agent picks up the same file.
|
||||
|
||||
## Enrollment
|
||||
|
||||
Enrolling a host for remote updates is a single extra flag on the
|
||||
existing [`decnet swarm enroll`](SWARM-Mode#decnet-swarm-enroll):
|
||||
|
||||
```bash
|
||||
decnet swarm enroll \
|
||||
--host 192.168.1.23 --address 192.168.1.23 \
|
||||
--sans 192.168.1.23 \
|
||||
--updater \
|
||||
--out-dir ./enroll-bundle
|
||||
```
|
||||
|
||||
The controller now issues **two** certs signed by the same CA:
|
||||
|
||||
- `./enroll-bundle/{ca.crt, worker.crt, worker.key}` — goes to
|
||||
`~/.decnet/agent/` on the worker (same as before).
|
||||
- `./enroll-bundle-updater/{ca.crt, updater.crt, updater.key}` —
|
||||
goes to `~/.decnet/updater/` on the worker.
|
||||
|
||||
Ship both directories to the worker once (this is the last scp you'll do
|
||||
for this host), then on the worker:
|
||||
|
||||
```bash
|
||||
sudo install -d -m 0700 ~/.decnet/agent ~/.decnet/updater
|
||||
# ...scp the two bundles into place...
|
||||
sudo decnet agent --daemon --agent-dir ~/.decnet/agent
|
||||
sudo decnet updater --daemon --updater-dir ~/.decnet/updater \
|
||||
--install-dir /opt/decnet
|
||||
```
|
||||
|
||||
From this point on the master can push code without touching SSH.
|
||||
|
||||
### Without `--updater`
|
||||
|
||||
If you forgot `--updater` at enrollment time, decommission and re-enroll
|
||||
the host — that's the cleanest path. The alternative is running the
|
||||
enrollment endpoint manually with `issue_updater_bundle=true` for an
|
||||
already-enrolled host; this is currently a v2 concern.
|
||||
|
||||
## Pushing an update
|
||||
|
||||
From the master (your dev box), make your changes, commit if you want
|
||||
(the tarball is the working tree, staged + unstaged + untracked), then:
|
||||
|
||||
```bash
|
||||
# Push to one worker
|
||||
decnet swarm update --host worker-01
|
||||
|
||||
# Push to every non-decommissioned worker
|
||||
decnet swarm update --all
|
||||
|
||||
# Also ship the updater itself (explicit; no auto-rollback)
|
||||
decnet swarm update --all --include-self
|
||||
|
||||
# Inspect what would ship — no network
|
||||
decnet swarm update --all --dry-run
|
||||
```
|
||||
|
||||
Output is a table per host:
|
||||
|
||||
```
|
||||
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
|
||||
┃ host ┃ address ┃ agent ┃ self ┃ detail ┃
|
||||
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
|
||||
│ worker-01 │ 192.168.1.23 │ updated │ — │ 4b7e1a9... │
|
||||
│ worker-02 │ 192.168.1.24 │ rolled- │ — │ probe failed;... │
|
||||
│ │ │ back │ │ │
|
||||
└────────────┴──────────────┴─────────┴──────┴────────────────────┘
|
||||
```
|
||||
|
||||
- **`updated` (green)** — new release live, agent answered `/health`.
|
||||
- **`rolled-back` (yellow, exit 1)** — new release failed its post-deploy
|
||||
probe; the updater already swapped the symlink back and restarted the
|
||||
agent against the previous release. The worker is functional; the
|
||||
attempted update is in `releases/prev/` on the worker for forensics.
|
||||
- **`error` (red, exit 1)** — transport or install failure before the
|
||||
rotation even happened; no state change on the worker.
|
||||
|
||||
## What the updater actually does
|
||||
|
||||
For each `/update` request:
|
||||
|
||||
1. **Extract** the tarball into `/opt/decnet/releases/active.new/`. Paths
|
||||
with `..` or leading `/` are rejected.
|
||||
2. **Install**: `pip install --force-reinstall [--no-deps] <active.new>`
|
||||
into the **shared** `/opt/decnet/venv/`. The first time this runs (venv
|
||||
doesn't exist yet) the updater bootstraps it with the full dep tree;
|
||||
every subsequent update uses `--no-deps` so only the `decnet` package
|
||||
is replaced. Non-zero exit → abort, return **500** with pip stderr, no
|
||||
rotation.
|
||||
3. **Rotate**: `prev/` (if present) is removed, `active/` → `prev/`,
|
||||
`active.new/` → `active/`. The `current` symlink is flipped atomically
|
||||
via `rename(2)`.
|
||||
4. **Restart agent**: SIGTERM the PID in `agent.pid`, wait up to 10 s,
|
||||
then SIGKILL if still alive. If `agent.pid` is missing (agent was
|
||||
started manually, not spawned by the updater), the updater scans
|
||||
`/proc` for any `decnet agent` process and SIGTERMs those instead —
|
||||
so restart is reliable regardless of how the agent was originally
|
||||
launched. Spawn a new agent through the shared venv's `decnet` entry
|
||||
point with `cwd=/opt/decnet/`.
|
||||
5. **Probe**: GET `https://127.0.0.1:8765/health` over mTLS up to 10
|
||||
times with 1 s backoff.
|
||||
6. **On probe success** → return **200** with the new release manifest.
|
||||
7. **On probe failure** → swap `active/` ↔ `prev/` back, restart the
|
||||
agent again, re-probe, return **409** with both probe transcripts and
|
||||
`rolled_back: true`. The master CLI translates that to a yellow
|
||||
"rolled-back" row and exit code 1.
|
||||
|
||||
All of this runs inside the single POST handler. The update is atomic
|
||||
from the outside: either the worker is on the new release *and* the
|
||||
agent is healthy, or it's on the previous release (possibly because the
|
||||
new one failed) with a healthy agent.
|
||||
|
||||
## `--include-self`
|
||||
|
||||
Same protocol but targets `/opt/decnet/updater/`, which has its own
|
||||
release slots and its own venv (`/opt/decnet/updater/venv/`). The
|
||||
updater never touches this directory during a normal `/update`, which
|
||||
is the whole point: a broken `/update` can't brick the thing that rolls
|
||||
it back.
|
||||
|
||||
**Mechanics:**
|
||||
|
||||
1. The master POSTs the tarball to `POST /update-self` with
|
||||
`confirm_self=true`.
|
||||
2. The updater extracts into `/opt/decnet/updater/releases/active.new/`
|
||||
and runs `pip install --force-reinstall <slot>` against
|
||||
`/opt/decnet/updater/venv/`. The first self-update bootstraps this
|
||||
venv with the full dep tree (typer, fastapi, uvicorn, …) before
|
||||
installing `decnet`.
|
||||
3. On success, rotate the updater's own `active`/`prev` slots.
|
||||
4. `os.execv` into the newly installed binary with a cleanly
|
||||
reconstructed argv. The argv is **not** `sys.argv[1:]` — inside the
|
||||
running process `sys.argv` is the uvicorn subprocess invocation
|
||||
(`--ssl-keyfile …`), which `decnet updater` CLI does not understand.
|
||||
Instead the updater rebuilds `decnet updater --host … --port … \
|
||||
--updater-dir … --install-dir … --agent-dir …` from env vars that
|
||||
`decnet.updater.server.run` stashes at startup
|
||||
(`DECNET_UPDATER_HOST`, `DECNET_UPDATER_PORT`,
|
||||
`DECNET_UPDATER_BUNDLE_DIR`, `DECNET_UPDATER_INSTALL_DIR`,
|
||||
`DECNET_UPDATER_AGENT_DIR`).
|
||||
5. The TCP connection drops *mid-response*. That is normal: the master
|
||||
waits up to 30 s for the updater's `/health` to come back with the
|
||||
new SHA and treats that as success. No auto-rollback — if the new
|
||||
updater can't import, the old one is gone and you'll need SSH to
|
||||
recover.
|
||||
|
||||
**Ordering:** agent first, updater second. A broken agent push should
|
||||
not lock you into shipping the updater through that broken agent's
|
||||
host.
|
||||
|
||||
**Use sparingly.** A bad self-update is the one case you *will* need
|
||||
`scp` for — the wiki's promise of "no more scp" has one asterisk on it,
|
||||
and this is it.
|
||||
|
||||
## Manual rollback
|
||||
|
||||
If you want to roll back without pushing new code (you notice a
|
||||
regression after the probe already passed):
|
||||
|
||||
```bash
|
||||
# No CLI yet for this in v1; hit the endpoint directly.
|
||||
curl --cert ~/.decnet/ca/master/worker.crt \
|
||||
--key ~/.decnet/ca/master/worker.key \
|
||||
--cacert ~/.decnet/ca/master/ca.crt \
|
||||
-X POST https://<worker-ip>:8766/rollback
|
||||
```
|
||||
|
||||
Returns 404 if there's no `prev/` slot (which is only the case on the
|
||||
worker's very first release — a fresh install has nothing to roll back
|
||||
to). A CLI wrapper (`decnet swarm rollback`) is planned for v2.
|
||||
|
||||
## Symptom table
|
||||
|
||||
| Symptom | Likely cause | Fix |
|
||||
|---|---|---|
|
||||
| `curl: (35) error:...peer certificate required` on 8766 | Updater is up but mTLS rejected the client cert. Using the wrong bundle. | Use `~/.decnet/ca/master/` bundle, not the worker bundle. |
|
||||
| `swarm update` hangs for >2 min on one host | pip install is slow on a very underpowered worker. | Bump `_TIMEOUT_UPDATE` in `decnet/swarm/updater_client.py` (temporary) or enroll more resources. |
|
||||
| All hosts return `error: ConnectTimeout` | Updater isn't running on any worker. | On each worker: `sudo decnet updater --daemon --updater-dir ...`. |
|
||||
| `rolled-back` on every push | The agent is now importing something the worker doesn't have. Probe hits `/health` and gets 500. | Read `detail` field — it contains the agent's traceback. Fix and push again. |
|
||||
| `rolled-back` only on workers with a different OS | The Compose/Buildx/Python version on that worker differs from the master's. | [SWARM Mode prerequisites](SWARM-Mode#prerequisites). The updater does not install OS packages. |
|
||||
| After `--include-self`, `/health` on 8766 never returns | The new updater failed at import time. `execv` succeeded but Python died. | SSH in; look at the updater's journalctl / stderr; revert `~/.decnet/updater/` to the previous tree manually. |
|
||||
| After `--include-self`, updater logs `No such option: --ssl-keyfile` and dies | Pre-fix bug: updater re-exec'd with `sys.argv[1:]` (uvicorn's argv) instead of the CLI argv. | Fixed in commit `40d3e86`. Make sure the updater is running code that reconstructs argv from env — if not, SSH in and `sudo decnet updater --host 0.0.0.0 --port 8766 --updater-dir ... --install-dir ...` manually. |
|
||||
| After `--include-self`, updater dies with `ModuleNotFoundError: No module named 'typer'` | Pre-fix bug: a freshly bootstrapped updater venv installed `decnet` with `--no-deps`. | Fixed in commit `40d3e86`. On an already-broken host: `sudo rm -rf /opt/decnet/updater/{venv,releases,current}` and restart the updater from `/opt/decnet/venv/` — the next `--include-self` bootstraps a complete venv. |
|
||||
| Agent keeps serving old code after `/update` returns 200 | Agent was started by hand (no `agent.pid`), pre-fix `_stop_agent` had nothing to kill. | Fixed in commit `ebeaf08`: `_stop_agent` now falls back to a `/proc` scan for any `decnet agent` process. On stale hosts, restart the agent once. |
|
||||
| Master: `FileNotFoundError: .../master/worker.key` | Master identity was never materialized (no `swarm enroll` has run yet on this install). | `decnet swarm list` once — it materializes the master identity as a side effect of `ensure_master_identity()`. |
|
||||
|
||||
## Out of scope (v1)
|
||||
|
||||
- **Dependency changes.** The updater `pip install`s the new tree, so
|
||||
adding a dep *usually* just works. But if the new version of a dep
|
||||
fails to resolve against the worker's Python / index / lockfile, the
|
||||
rollback path catches it. It is not the updater's job to fix package
|
||||
layer drift — use `scp` and deploy manually once, then carry on.
|
||||
- **OS package installs.** Upgrades to Docker / Compose / buildx are
|
||||
still manual on the worker. See [SWARM Mode prerequisites](SWARM-Mode#prerequisites).
|
||||
- **Systemd unit changes.** Shipping a new unit file requires the old
|
||||
one to already work; you won't get rescue for that kind of change via
|
||||
the updater.
|
||||
- **Schema migrations** on the worker's tiny SQLite (if any) — manual.
|
||||
- **Canary / A/B rollouts.** `--all` is all-at-once. If you want staged
|
||||
rollout, push to one host, observe, then the rest.
|
||||
- **Signed release artifacts.** mTLS already authenticates the master; a
|
||||
detached signature is a v2 concern.
|
||||
Reference in New Issue
Block a user