Commit Graph

2 Commits

Author SHA1 Message Date
86b9decf80 fix(engine): detect wedged buildx + surface recovery hint on deploy
When Docker's buildx leaks bind-mounts from a failed build it starts
reporting 'read-only file system' on its own activity file, even
though nothing is actually read-only. The user's host had 20+
leaked mounts before we noticed — each retry compounds the leak.

_compose_with_retry now:
 * Pre-flight counts /var/lib/docker/tmp/buildkit-mount* entries in
   /proc/self/mounts; if >= 10 and the command is a build, refuses
   to start and returns a clean recovery recipe instead of retrying.
 * On mid-build failures that match the wedge signature
   ('failed to update builder last activity time' or the activity-dir
   path in stderr), short-circuits the retry loop with the same
   recipe. The first occurrence no longer needs a pre-flight; the
   pre-flight catches repeat attempts.

Recipe points at 'docker buildx prune -af && sudo systemctl restart
docker', which is what actually clears the leaked mounts.

Tests cover all three paths: wedge preflight blocks builds, non-build
commands (down/stop) ignore the preflight, mid-build signature
detection kills the retry loop. A new autouse fixture stubs the
wedge-detector to 0 so dev-host state doesn't poison the mocked
subprocess tests.

Wiki companion commit adds Troubleshooting → 'Buildx leaked mounts'.
2026-04-24 19:25:45 -04:00
ea95a009df refactor(tests): move flat tests/*.py into per-subsystem subfolders
Groups every flat test_*.py under the module it exercises, matching the
existing tests/{profiler,sniffer,prober,collector,correlation,cli,web,
topology,swarm,bus,updater,api,docker,geoip,...} layout. New folders:
services/, fleet/, config/, logging/, db/ (+ db/mysql/), telemetry/,
mutator/, core/.

Path-dependent __file__ references bumped an extra .parent in three
files that moved one level deeper:
- tests/sniffer/test_sniffer_ja3.py   (template path)
- tests/services/test_ssh_capture_emit.py (template path)
- tests/cli/test_mode_gating.py  (REPO root)
- tests/web/test_env_lazy_jwt.py (repo var)

Also drops two SQLite runtime artifacts (test_decnet.db-{shm,wal}) that
were leaking into the repo from a previous test run.

Fixes two test_service_isolation cases that patched asyncio.sleep (no
longer on the profiler main-loop hot path — same pre-existing bug I
fixed earlier in test_attacker_worker.py) by patching asyncio.wait_for
and passing interval=0.
2026-04-23 21:34:25 -04:00