docs(troubleshooting): branch buildx recovery on leaked-mount count
prune+restart alone doesn't evict already-held mounts; full recipe is stop-pkill-umount-start. Add a separate recipe for the count==0 case (driver state corruption, no mounts to clean) and call out the strict signature DECNET uses to classify the wedge.
@@ -78,23 +78,43 @@ mount | grep -c '/var/lib/docker/tmp/buildkit-mount'
|
|||||||
|
|
||||||
Anything past single digits is pathological. We've seen hosts sitting on hundreds after a few botched mass-scale topologies.
|
Anything past single digits is pathological. We've seen hosts sitting on hundreds after a few botched mass-scale topologies.
|
||||||
|
|
||||||
**Fix.**
|
**Fix — leaked mounts present (count > 0).**
|
||||||
|
|
||||||
```bash
|
`prune -af && systemctl restart docker` is **not enough** — leaked mounts often outlive the daemon because zombie `buildkitd` / `containerd-shim` processes still hold them. Full recipe:
|
||||||
docker buildx prune -af
|
|
||||||
sudo systemctl restart docker
|
|
||||||
```
|
|
||||||
|
|
||||||
Restart is the operative step — it drops every leaked mount. `prune -af` also discards the build cache so the next deploy rebuilds from scratch; skip it if you want the cache preserved.
|
|
||||||
|
|
||||||
If the activity dir itself is corrupted (rare):
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
sudo systemctl stop docker.socket docker.service
|
||||||
|
sudo pkill -9 -f buildkitd
|
||||||
|
sudo pkill -9 -f containerd-shim
|
||||||
|
for m in $(mount | awk '$3 ~ /buildkit-mount/ {print $3}'); do
|
||||||
|
sudo umount -l "$m"
|
||||||
|
done
|
||||||
rm -rf ~/.docker/buildx/activity
|
rm -rf ~/.docker/buildx/activity
|
||||||
docker buildx create --use
|
sudo systemctl start docker
|
||||||
|
docker buildx create --use --name default
|
||||||
|
docker buildx inspect --bootstrap
|
||||||
```
|
```
|
||||||
|
|
||||||
**How DECNET handles it.** The engine's `_compose_with_retry` counts leaked buildkit mounts before every build and refuses to start if the count crosses 10 — you get the recovery recipe in the error payload instead of a cryptic EROFS surfaced three retries deep. Mid-build failures that match the known wedge signature also short-circuit the retry loop with the same hint.
|
The `umount -l` step is the one most recipes online miss.
|
||||||
|
|
||||||
|
**Fix — driver corruption (count == 0).**
|
||||||
|
|
||||||
|
If `mount | grep -c buildkit-mount` already prints 0 and you still hit the wedge, the buildx driver state itself is inconsistent. Rebuild it:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker buildx rm default 2>/dev/null
|
||||||
|
rm -rf ~/.docker/buildx/activity ~/.docker/buildx/instances/default
|
||||||
|
docker buildx create --use --name default
|
||||||
|
docker buildx inspect --bootstrap
|
||||||
|
```
|
||||||
|
|
||||||
|
**How DECNET handles it.** The engine's `_compose_with_retry`:
|
||||||
|
|
||||||
|
* Pre-flights leaked mounts before every build; if the count crosses 10, refuses to start and emits the leaked-mount recipe.
|
||||||
|
* Catches the wedge signature mid-build (`failed to update builder last activity time` + `read-only file system`) and short-circuits the retry loop, branching the recipe on whether mounts are 0 or >0.
|
||||||
|
* Preserves the original compose stderr in the error so you can see what actually broke alongside the recipe.
|
||||||
|
|
||||||
|
Unrelated `read-only file system` errors (e.g. a config file mount) are NOT classified as a wedge — both sentinel phrases must match.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user