DECNET/tests/swarm/test_swarm_api.py at 9d68bb45c70d9b80d8405e323e38241faf5cc90c

Files

anti df18cb44cc fix(swarm): don't paint healthy deckies as failed when a shard-sibling fails

docker compose up is partial-success-friendly — a build failure on one
service doesn't roll back the others. But the master was catching the
agent's 500 and tagging every decky in the shard as 'failed' with the
same error message. From the UI that looked like all three deckies died
even though two were live on the worker.

On dispatch exception, probe the agent's /status to learn which deckies
actually have running containers, and upsert per-decky state accordingly.
Only fall back to marking the whole shard failed if the status probe
itself is unreachable.

Enhance agent.executor.status() to include a 'runtime' map keyed by
decky name with per-service container state, so the master has something
concrete to consult.

2026-04-19 20:11:08 -04:00

15 KiB

Raw Blame History

View Raw

15 KiB Raw Blame History

15 KiB

Raw Blame History