Commit Graph

6 Commits

Author SHA1 Message Date
646aeeca40 feat(deployer): mirror fleet deploy/teardown into fleet_deckies table
CLI deploy now writes both surfaces: decnet-state.json (existing,
canonical for offline / no-API hosts) and the new fleet_deckies DB
table (visible to orchestrator, web dashboard, REST API).

Best-effort: a DB outage logs a warning and returns. The JSON file
remains the source of truth for `decnet status`, `decnet teardown`,
sniffer, and collector — operators on a CLI-only host keep working.

_run_async helper bridges sync deploy() into the async repository.
Always uses a fresh thread because the API handler at
web.router.fleet.api_deploy_deckies invokes deploy() from inside a
FastAPI event loop, which would otherwise break asyncio.run.

Verified end-to-end against MySQL: deploy mirror inserts rows, union
view (list_running_deckies) returns them with source="fleet",
teardown mirror removes them. Works from both sync (CLI) and async
(API handler) call sites.
2026-04-26 21:05:50 -04:00
f8ef0a5cf1 fix(deploy): redirect DOCKER_CONFIG out of $HOME so ProtectHome doesn't kill builds
The api unit's ProtectHome=read-only made the user's HOME read-only
inside the unit's namespace. docker compose --build then tried to
write ~/.docker/buildx/activity/* and got EROFS — which we'd been
misdiagnosing as a buildx wedge for the last few iterations.

Real fix: set DOCKER_CONFIG and BUILDX_CONFIG in the unit's
Environment= to a path inside ReadWritePaths. Hardening stays on,
docker CLI writes to install_dir/.docker instead of /home/<user>/.docker.

The wedge classifier now detects this case (count==0 + /home/ in
the stderr path) and emits a recipe pointing at the env-var fix
instead of the driver-rebuild path. Test added.

Wiki gets the new branch first since it's the most common cause
on systemd-managed installs.
2026-04-24 22:07:13 -04:00
257624e6a7 fix(engine/buildx): recipe used reserved 'default' builder name
'docker buildx create --name default' errors with 'default is a
reserved name and cannot be used to identify builder instance'.
The bundled builder always exists under that name; the recipe
should switch to it (buildx use default), not try to recreate it.

For the count==0 driver-rebuild branch, the new builder needs a
non-reserved name — using 'decnet-builder' as the example.
2026-04-24 22:02:20 -04:00
40a31d8bc7 fix(engine/buildx): branch recovery recipe on leaked-mount count
The hint was one-size-fits-all and pointed at prune+restart even
when zero mounts were leaked — a false positive caused by matching
any stderr containing the activity-dir path.

Two changes:

1. Tighten the wedge classifier. Both the buildx-specific phrase
   ('failed to update builder last activity time') AND the EROFS
   marker ('read-only file system') must appear in stderr. Either
   alone is now treated as a normal transient error and retried.

2. Branch the recipe on _count_leaked_buildkit_mounts():
   * count > 0 → unmount loop + daemon stop + umount -l
     (prune+restart alone doesn't evict held mounts)
   * count == 0 → rebuild the buildx driver (rm builder state,
     buildx create --use, inspect --bootstrap)

Original compose stderr is now preserved in the hint as
'Original error: ...' so the user sees both the recipe and what
compose actually said.

Tests cover both branches plus a negative case (unrelated EROFS).
2026-04-24 21:58:09 -04:00
86b9decf80 fix(engine): detect wedged buildx + surface recovery hint on deploy
When Docker's buildx leaks bind-mounts from a failed build it starts
reporting 'read-only file system' on its own activity file, even
though nothing is actually read-only. The user's host had 20+
leaked mounts before we noticed — each retry compounds the leak.

_compose_with_retry now:
 * Pre-flight counts /var/lib/docker/tmp/buildkit-mount* entries in
   /proc/self/mounts; if >= 10 and the command is a build, refuses
   to start and returns a clean recovery recipe instead of retrying.
 * On mid-build failures that match the wedge signature
   ('failed to update builder last activity time' or the activity-dir
   path in stderr), short-circuits the retry loop with the same
   recipe. The first occurrence no longer needs a pre-flight; the
   pre-flight catches repeat attempts.

Recipe points at 'docker buildx prune -af && sudo systemctl restart
docker', which is what actually clears the leaked mounts.

Tests cover all three paths: wedge preflight blocks builds, non-build
commands (down/stop) ignore the preflight, mid-build signature
detection kills the retry loop. A new autouse fixture stubs the
wedge-detector to 0 so dev-host state doesn't poison the mocked
subprocess tests.

Wiki companion commit adds Troubleshooting → 'Buildx leaked mounts'.
2026-04-24 19:25:45 -04:00
ea95a009df refactor(tests): move flat tests/*.py into per-subsystem subfolders
Groups every flat test_*.py under the module it exercises, matching the
existing tests/{profiler,sniffer,prober,collector,correlation,cli,web,
topology,swarm,bus,updater,api,docker,geoip,...} layout. New folders:
services/, fleet/, config/, logging/, db/ (+ db/mysql/), telemetry/,
mutator/, core/.

Path-dependent __file__ references bumped an extra .parent in three
files that moved one level deeper:
- tests/sniffer/test_sniffer_ja3.py   (template path)
- tests/services/test_ssh_capture_emit.py (template path)
- tests/cli/test_mode_gating.py  (REPO root)
- tests/web/test_env_lazy_jwt.py (repo var)

Also drops two SQLite runtime artifacts (test_decnet.db-{shm,wal}) that
were leaking into the repo from a previous test run.

Fixes two test_service_isolation cases that patched asyncio.sleep (no
longer on the profiler main-loop hot path — same pre-existing bug I
fixed earlier in test_attacker_worker.py) by patching asyncio.wait_for
and passing interval=0.
2026-04-23 21:34:25 -04:00