1
Remote Updates
anti edited this page 2026-04-19 00:13:51 -04:00

Remote Updates

DECNET ships a self-updater daemon that runs on every worker alongside decnet agent. It lets the master push a new working tree to the worker (tarball over mTLS), install it, restart the agent, health-probe it, and auto-rollback to the previous release if the new one is unhealthy — all without scp, sshpass, or any human SSH session.

This page covers architecture, enrollment, the command surface, and the failure modes you'll actually hit in practice. If you just want to ship code, jump to Pushing an update.

Why a separate daemon

The naive design puts the agent in charge of its own updates. That immediately bricks itself the first time you push broken code — the daemon you'd use to roll the fix back is the daemon you just broke. The updater dodges that paradox by being a completely separate process with its own venv and its own mTLS identity. A normal update does not touch the updater, so the updater is always a known-good rescuer.

A second explicit endpoint, POST /update-self, handles updater upgrades. It has no auto-rollback and you must opt in — the contract is "you have chosen to push to the thing that rescues you; don't break it."

Architecture

┌────────── MASTER ──────────┐        ┌──────────── WORKER ────────────┐
│ decnet swarm update ...    │        │                                 │
│   tars working tree        │        │ decnet updater   :8766 ◀──────┐ │
│   POST /update  ──mTLS──▶──┼────────┼─▶ snapshots, installs, probes │ │
│                            │        │   restarts agent via exec     │ │
│                            │        │                               │ │
│                            │        │ decnet agent     :8765 ◀──────┘ │
│                            │        │   (managed by updater)          │
└────────────────────────────┘        └─────────────────────────────────┘

Two daemons on each worker, each with a distinct cert (both signed by the same DECNET CA that already backs SWARM Mode). Certificate CN distinguishes identities:

Identity CN example Used for
Agent worker-01 /deploy, /teardown, /status, /health on port 8765
Updater updater@worker-01 /update, /update-self, /rollback, /releases on port 8766

Install layout on the worker

The updater owns the release directory:

/opt/decnet/                         (default; override with --install-dir)
  current -> releases/active          (atomic symlink; flip == promotion)
  venv/                               shared venv — agent + updater run from here
  releases/
    active/                           source tree of the live release
    prev/                             the last good source snapshot
    active.new/                       staging (only exists mid-update)
  updater/                            updater's own tree + venv + releases
                                      — NEVER touched by a normal /update
  agent.pid                           PID of the agent process we spawned
  agent.spawn.log                     stdout/stderr of the most recent spawn
  .env.local                          per-host overrides (JWT secret, DB URL, …)
~/.decnet/
  agent/         worker.key, worker.crt, ca.crt
  updater/       updater.key, updater.crt, ca.crt   (CN=updater@<host>)

The venv is shared across releases (not per-slot). An update swaps the source-tree symlink; pip reinstalls the decnet package into the same venv with --force-reinstall --no-deps, so the slow work is the fresh tarball unpack, not a full dep rebuild. On the very first update into a brand-new venv the full dep tree is installed once — subsequent updates are near-no-op if dependencies haven't changed.

The updater loads .env.local from its working directory, so the worker can carry a persistent per-host .env.local (JWT secret, DB URL, log paths) without editing site-packages. The updater spawns the agent with cwd=/opt/decnet/ so the agent picks up the same file.

Enrollment

Enrolling a host for remote updates is a single extra flag on the existing decnet swarm enroll:

decnet swarm enroll \
  --host 192.168.1.23 --address 192.168.1.23 \
  --sans 192.168.1.23 \
  --updater \
  --out-dir ./enroll-bundle

The controller now issues two certs signed by the same CA:

  • ./enroll-bundle/{ca.crt, worker.crt, worker.key} — goes to ~/.decnet/agent/ on the worker (same as before).
  • ./enroll-bundle-updater/{ca.crt, updater.crt, updater.key} — goes to ~/.decnet/updater/ on the worker.

Ship both directories to the worker once (this is the last scp you'll do for this host), then on the worker:

sudo install -d -m 0700 ~/.decnet/agent ~/.decnet/updater
# ...scp the two bundles into place...
sudo decnet agent --daemon --agent-dir ~/.decnet/agent
sudo decnet updater --daemon --updater-dir ~/.decnet/updater \
    --install-dir /opt/decnet

From this point on the master can push code without touching SSH.

Without --updater

If you forgot --updater at enrollment time, decommission and re-enroll the host — that's the cleanest path. The alternative is running the enrollment endpoint manually with issue_updater_bundle=true for an already-enrolled host; this is currently a v2 concern.

Pushing an update

From the master (your dev box), make your changes, commit if you want (the tarball is the working tree, staged + unstaged + untracked), then:

# Push to one worker
decnet swarm update --host worker-01

# Push to every non-decommissioned worker
decnet swarm update --all

# Also ship the updater itself (explicit; no auto-rollback)
decnet swarm update --all --include-self

# Inspect what would ship — no network
decnet swarm update --all --dry-run

Output is a table per host:

┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ host       ┃ address      ┃ agent   ┃ self ┃ detail             ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ worker-01  │ 192.168.1.23 │ updated │ —    │ 4b7e1a9...         │
│ worker-02  │ 192.168.1.24 │ rolled- │ —    │ probe failed;...   │
│            │              │ back    │      │                    │
└────────────┴──────────────┴─────────┴──────┴────────────────────┘
  • updated (green) — new release live, agent answered /health.
  • rolled-back (yellow, exit 1) — new release failed its post-deploy probe; the updater already swapped the symlink back and restarted the agent against the previous release. The worker is functional; the attempted update is in releases/prev/ on the worker for forensics.
  • error (red, exit 1) — transport or install failure before the rotation even happened; no state change on the worker.

What the updater actually does

For each /update request:

  1. Extract the tarball into /opt/decnet/releases/active.new/. Paths with .. or leading / are rejected.
  2. Install: pip install --force-reinstall [--no-deps] <active.new> into the shared /opt/decnet/venv/. The first time this runs (venv doesn't exist yet) the updater bootstraps it with the full dep tree; every subsequent update uses --no-deps so only the decnet package is replaced. Non-zero exit → abort, return 500 with pip stderr, no rotation.
  3. Rotate: prev/ (if present) is removed, active/prev/, active.new/active/. The current symlink is flipped atomically via rename(2).
  4. Restart agent: SIGTERM the PID in agent.pid, wait up to 10 s, then SIGKILL if still alive. If agent.pid is missing (agent was started manually, not spawned by the updater), the updater scans /proc for any decnet agent process and SIGTERMs those instead — so restart is reliable regardless of how the agent was originally launched. Spawn a new agent through the shared venv's decnet entry point with cwd=/opt/decnet/.
  5. Probe: GET https://127.0.0.1:8765/health over mTLS up to 10 times with 1 s backoff.
  6. On probe success → return 200 with the new release manifest.
  7. On probe failure → swap active/prev/ back, restart the agent again, re-probe, return 409 with both probe transcripts and rolled_back: true. The master CLI translates that to a yellow "rolled-back" row and exit code 1.

All of this runs inside the single POST handler. The update is atomic from the outside: either the worker is on the new release and the agent is healthy, or it's on the previous release (possibly because the new one failed) with a healthy agent.

--include-self

Same protocol but targets /opt/decnet/updater/, which has its own release slots and its own venv (/opt/decnet/updater/venv/). The updater never touches this directory during a normal /update, which is the whole point: a broken /update can't brick the thing that rolls it back.

Mechanics:

  1. The master POSTs the tarball to POST /update-self with confirm_self=true.
  2. The updater extracts into /opt/decnet/updater/releases/active.new/ and runs pip install --force-reinstall <slot> against /opt/decnet/updater/venv/. The first self-update bootstraps this venv with the full dep tree (typer, fastapi, uvicorn, …) before installing decnet.
  3. On success, rotate the updater's own active/prev slots.
  4. os.execv into the newly installed binary with a cleanly reconstructed argv. The argv is not sys.argv[1:] — inside the running process sys.argv is the uvicorn subprocess invocation (--ssl-keyfile …), which decnet updater CLI does not understand. Instead the updater rebuilds decnet updater --host … --port … \ --updater-dir … --install-dir … --agent-dir … from env vars that decnet.updater.server.run stashes at startup (DECNET_UPDATER_HOST, DECNET_UPDATER_PORT, DECNET_UPDATER_BUNDLE_DIR, DECNET_UPDATER_INSTALL_DIR, DECNET_UPDATER_AGENT_DIR).
  5. The TCP connection drops mid-response. That is normal: the master waits up to 30 s for the updater's /health to come back with the new SHA and treats that as success. No auto-rollback — if the new updater can't import, the old one is gone and you'll need SSH to recover.

Ordering: agent first, updater second. A broken agent push should not lock you into shipping the updater through that broken agent's host.

Use sparingly. A bad self-update is the one case you will need scp for — the wiki's promise of "no more scp" has one asterisk on it, and this is it.

Manual rollback

If you want to roll back without pushing new code (you notice a regression after the probe already passed):

# No CLI yet for this in v1; hit the endpoint directly.
curl --cert ~/.decnet/ca/master/worker.crt \
     --key  ~/.decnet/ca/master/worker.key \
     --cacert ~/.decnet/ca/master/ca.crt \
     -X POST https://<worker-ip>:8766/rollback

Returns 404 if there's no prev/ slot (which is only the case on the worker's very first release — a fresh install has nothing to roll back to). A CLI wrapper (decnet swarm rollback) is planned for v2.

Symptom table

Symptom Likely cause Fix
curl: (35) error:...peer certificate required on 8766 Updater is up but mTLS rejected the client cert. Using the wrong bundle. Use ~/.decnet/ca/master/ bundle, not the worker bundle.
swarm update hangs for >2 min on one host pip install is slow on a very underpowered worker. Bump _TIMEOUT_UPDATE in decnet/swarm/updater_client.py (temporary) or enroll more resources.
All hosts return error: ConnectTimeout Updater isn't running on any worker. On each worker: sudo decnet updater --daemon --updater-dir ....
rolled-back on every push The agent is now importing something the worker doesn't have. Probe hits /health and gets 500. Read detail field — it contains the agent's traceback. Fix and push again.
rolled-back only on workers with a different OS The Compose/Buildx/Python version on that worker differs from the master's. SWARM Mode prerequisites. The updater does not install OS packages.
After --include-self, /health on 8766 never returns The new updater failed at import time. execv succeeded but Python died. SSH in; look at the updater's journalctl / stderr; revert ~/.decnet/updater/ to the previous tree manually.
After --include-self, updater logs No such option: --ssl-keyfile and dies Pre-fix bug: updater re-exec'd with sys.argv[1:] (uvicorn's argv) instead of the CLI argv. Fixed in commit 40d3e86. Make sure the updater is running code that reconstructs argv from env — if not, SSH in and sudo decnet updater --host 0.0.0.0 --port 8766 --updater-dir ... --install-dir ... manually.
After --include-self, updater dies with ModuleNotFoundError: No module named 'typer' Pre-fix bug: a freshly bootstrapped updater venv installed decnet with --no-deps. Fixed in commit 40d3e86. On an already-broken host: sudo rm -rf /opt/decnet/updater/{venv,releases,current} and restart the updater from /opt/decnet/venv/ — the next --include-self bootstraps a complete venv.
Agent keeps serving old code after /update returns 200 Agent was started by hand (no agent.pid), pre-fix _stop_agent had nothing to kill. Fixed in commit ebeaf08: _stop_agent now falls back to a /proc scan for any decnet agent process. On stale hosts, restart the agent once.
Master: FileNotFoundError: .../master/worker.key Master identity was never materialized (no swarm enroll has run yet on this install). decnet swarm list once — it materializes the master identity as a side effect of ensure_master_identity().

Out of scope (v1)

  • Dependency changes. The updater pip installs the new tree, so adding a dep usually just works. But if the new version of a dep fails to resolve against the worker's Python / index / lockfile, the rollback path catches it. It is not the updater's job to fix package layer drift — use scp and deploy manually once, then carry on.
  • OS package installs. Upgrades to Docker / Compose / buildx are still manual on the worker. See SWARM Mode prerequisites.
  • Systemd unit changes. Shipping a new unit file requires the old one to already work; you won't get rescue for that kind of change via the updater.
  • Schema migrations on the worker's tiny SQLite (if any) — manual.
  • Canary / A/B rollouts. --all is all-at-once. If you want staged rollout, push to one host, observe, then the rest.
  • Signed release artifacts. mTLS already authenticates the master; a detached signature is a v2 concern.