Files
DECNET/decnet/fleet/reconciler_worker.py
anti f775223a83 feat(fleet): reconciler converges JSON ↔ DB ↔ docker
Adds decnet.fleet.reconciler — a pure async function plus a long-lived
worker — that periodically reconciles the three sources of truth on a
DECNET host:

  1. decnet-state.json (CLI-canonical fleet record)
  2. fleet_deckies table (DB mirror, written by engine.deployer)
  3. docker inspect (actual per-container runtime state)

Drift handling:
  * JSON has X, DB doesn't       → INSERT (deploy ran with DB offline)
  * DB has X (this host), JSON doesn't → DELETE (teardown ran with DB offline)
  * Both have X, docker disagrees → flip state to running/failed/degraded
  * Docker socket unreachable    → leave existing state alone (don't
                                    torch every row to torn_down)

Cross-host safety: deletions are scoped to host_uuid for the local host;
a master that runs both a local fleet and swarm workers will never
clobber a peer's slice.

CLI:
  decnet reconcile --once            # one-shot, prints counts
  decnet reconcile [--interval N]    # long-lived worker, mirrors
                                     # orchestrator's lifecycle (control
                                     # listener + heartbeat + tick loop)

Promotes decnet/fleet.py → decnet/fleet/ package so the reconciler can
live alongside it without name collision (build_deckies_from_ini and
all_service_names re-exported unchanged via __init__.py).

14 new tests cover state aggregation rules, all four drift directions,
host_uuid scoping, docker-unreachable safety, and worker shutdown via
the bus control event.
2026-04-26 21:14:48 -04:00

85 lines
2.8 KiB
Python

"""Long-lived periodic reconciler worker.
Modeled on :mod:`decnet.orchestrator.worker`: same control listener, same
heartbeat helper, same shutdown semantics. One tick = one
:func:`reconcile_once` pass.
Default interval is short (30s) because reconciliation is cheap when
nothing has drifted (three reads, no writes), and a short cadence keeps
the dashboard's view of crashed containers fresh.
"""
from __future__ import annotations
import asyncio
import contextlib
from decnet.bus.factory import get_bus
from decnet.bus.publish import (
run_control_listener,
run_health_heartbeat,
)
from decnet.fleet.reconciler import reconcile_once
from decnet.logging import get_logger
from decnet.web.db.models import LOCAL_HOST_SENTINEL
from decnet.web.db.repository import BaseRepository
logger = get_logger("fleet.reconciler")
async def fleet_reconciler_worker(
repo: BaseRepository,
*,
interval: int = 30,
host_uuid: str = LOCAL_HOST_SENTINEL,
) -> None:
"""Periodically converge JSON ↔ DB ↔ docker for the local host.
Honours the bus control topic (``system.reconciler.control``) for
graceful shutdown — same lifecycle contract as every other DECNET
worker.
"""
logger.info("fleet reconciler started interval=%ds host=%s", interval, host_uuid)
bus = None
try:
bus = get_bus(client_name="reconciler")
await bus.connect()
except Exception as exc: # noqa: BLE001
logger.warning(
"reconciler: bus unavailable, continuing without publish: %s", exc,
)
bus = None
shutdown = asyncio.Event()
heartbeat_task = asyncio.create_task(run_health_heartbeat(bus, "reconciler"))
control_task = asyncio.create_task(
run_control_listener(bus, "reconciler", shutdown),
)
try:
while not shutdown.is_set():
try:
await asyncio.wait_for(shutdown.wait(), timeout=interval)
except asyncio.TimeoutError:
pass # normal tick
if shutdown.is_set():
break
try:
counts = await reconcile_once(repo, host_uuid=host_uuid)
if any(counts.values()):
logger.info(
"reconcile inserted=%d deleted=%d state_updated=%d",
counts["inserted"], counts["deleted"],
counts["state_updated"],
)
except Exception as exc: # noqa: BLE001
logger.error("reconcile tick failed: %s", exc)
finally:
for t in (heartbeat_task, control_task):
t.cancel()
with contextlib.suppress(Exception, asyncio.CancelledError):
await t
if bus is not None:
with contextlib.suppress(Exception):
await bus.close()