Hit live on first VPS deploy: a window between the initial
client.containers.list() snapshot and the client.events() start-event
stream let topology service containers slip through, requiring an
operator restart for them to be picked up.
Two fixes:
* `_watch_events` now wraps the events() call in a retry loop with
exponential backoff (1s -> 30s cap). A docker.errors.APIError, daemon
reload, or SDK stream-decode hiccup used to make the executor task
return cleanly, leaving the collector "running" with no event
subscription. Future container starts were silently dropped until
the unit was restarted.
* New `_reconcile_loop` async task ticks every
DECNET_COLLECTOR_RECONCILE_S (default 30s), re-scans
client.containers.list(), and calls _spawn for any service container
not already in `active`. Belt to the event watcher's suspenders:
even if a start event is dropped during a reconnect window, the
reconciler picks it up within one cycle. Also prunes finished
futures from `active` so the dict's bounded by current container
count rather than agent lifetime churn.