fix(security): close LOW ASVS findings — env bypass, SSE/deployment authz, CN fail-close, password byte-limit, exception leaks, BUG-12..16

Auth/session (V2.1.7, V4.1.5, V4.1.6, V2.1.4/V2.1.5):
- env secret validation no longer bypassed by attacker-injectable PYTEST* env;
  gated on explicit DECNET_TESTING=1 (set only in conftest).
- must_change_password now enforced on the SSE header-JWT path, not just ticket mint.
- GET /system/deployment-mode requires viewer auth (was leaking role + topology size).
- CreateUser/ResetUser passwords min_length=12; passwords >72 bytes rejected
  explicitly instead of bcrypt silently truncating.

Swarm ingestion (V9.1.3, BUG-16):
- Log listener hard-rejects peers with unparseable/empty cert CN (fail closed,
  ingests nothing) instead of tagging 'unknown'.
- Shutdown handlers no longer swallow real errors (narrowed to CancelledError).

Info leakage (V7.1.2, V14.1.2):
- Exception text sanitized on swarm-update, health, tarpit, realism, file-drop,
  blank-topology endpoints (raw tc/docker stderr, DB/Docker errors logged
  server-side, generic detail returned). pyproject license corrected to AGPL-3.0.

Correctness (BUG-12..16):
- BUG-12 atomic credential upsert (UNIQUE constraint + IntegrityError retry,
  consistent principal_key canonicalization).
- BUG-13 rule-tail watermark uses >= with seen-id dedup (no same-second drop).
- BUG-14 worker wake cleared before wait (no lost wake during tick).
- BUG-15 intel gather tolerates an unexpected provider raise.
- BUG-16 see above.

Already-closed (verified, no change): V2.1.6, V5.1.3, V9.1.2. Accept-risk +
documented: V2.1.8 cache window, V3.1.3 idle timeout. Tests added for every fix;
unanimous adversarial review after two refute-fix rounds.
This commit is contained in:
2026-06-10 13:27:14 -04:00
parent d80e6aa6d1
commit 245975a6dd
40 changed files with 1629 additions and 72 deletions

View File

@@ -181,3 +181,117 @@ async def test_clusterer_registered_in_cli():
"""`decnet clusterer` is registered as a master-only command."""
from decnet.cli.gating import MASTER_ONLY_COMMANDS
assert "clusterer" in MASTER_ONLY_COMMANDS
@pytest.mark.anyio
async def test_wake_during_tick_is_not_lost(repo):
"""BUG-14 regression: wake.clear() must run BEFORE wake.wait(), not after.
The worker loop pattern:
Fixed (after fix): tick → clear → wait ← current code
Buggy (before fix): tick → wait → clear
The race in the buggy pattern: a wake.set() could arrive from a _wake_on
background task between wait() returning and clear() executing. In asyncio
the task switch requires an ``await``; the original code had
``await _publish_result`` between wait() and clear(), providing a real
window. The fix closes this by moving clear() to run immediately after
tick (before wait), so there is no window between wait() returning and the
next clear().
**Structural test (red-before / green-after, deterministic):**
We intercept the internal ``wake`` event's ``clear()`` and ``wait()``
methods to record their invocation order, then assert that every
``clear`` call is immediately followed by a ``wait`` (never preceded by
one). Reverting the worker to the buggy ``wait → clear`` order produces
``("wait", "clear")`` consecutive pairs, which the assertion catches
deterministically without relying on wall-clock timing races.
"""
import unittest.mock as _mock
captured_wake: list[asyncio.Event] = []
call_log: list[str] = []
_orig_event_cls = asyncio.Event
class _LoggingEvent(_orig_event_cls): # type: ignore[misc]
"""Subclass captures the first event created (the wake event) and
logs clear()/wait() calls so we can verify their relative order."""
def __init__(self) -> None:
super().__init__()
if not captured_wake:
captured_wake.append(self)
def clear(self) -> None:
if captured_wake and self is captured_wake[0]:
call_log.append("clear")
super().clear()
async def wait(self) -> bool: # type: ignore[override]
if captured_wake and self is captured_wake[0]:
call_log.append("wait")
return await super().wait()
class _SimpleTicker(Clusterer):
name = "bug14_ticker"
async def tick(self, _repo) -> ClusterResult:
return ClusterResult()
shutdown = asyncio.Event()
with _mock.patch("decnet.clustering.worker.asyncio.Event", _LoggingEvent):
task = asyncio.create_task(
run_clusterer_loop(
repo,
poll_interval_secs=0.1,
clusterer=_SimpleTicker(),
shutdown=shutdown,
)
)
# Run for long enough to accumulate several clear/wait pairs.
await asyncio.sleep(0.5)
shutdown.set()
await asyncio.wait_for(task, timeout=2.0)
# Must have enough events to be meaningful.
assert len(call_log) >= 4, (
f"BUG-14: too few clear/wait calls recorded ({call_log!r}); "
"loop may not have run or event capture failed"
)
# Fixed invariant: every loop iteration runs clear() BEFORE wait().
# The resulting call_log for the fixed code is: clear, wait, clear, wait, ...
# For the buggy code (wait → clear) the log would be: wait, clear, wait, clear, ...
#
# We verify TWO conditions that together guarantee the fixed order:
#
# 1. The log starts with "clear" — the very first thing after tick is a clear.
# Buggy code starts with "wait" (wait ran first in the original loop).
#
# 2. Within each (clear, wait) pair at positions (2k, 2k+1), clear always
# precedes wait. We check that no "clear" appears at an ODD index (which
# would mean a clear followed another clear, or a wait appeared before
# the next clear).
assert call_log[0] == "clear", (
f"BUG-14 regression: first wake call was {call_log[0]!r}, expected 'clear'. "
f"Full log: {call_log!r}. The worker is using the buggy 'wait-then-clear' "
"order — wake.clear() must execute BEFORE wake.wait() each iteration."
)
# Verify the alternating pattern holds: indices 0,2,4,... should be "clear"
# and indices 1,3,5,... should be "wait".
for idx, call in enumerate(call_log):
if idx % 2 == 0:
assert call == "clear", (
f"BUG-14 regression: expected 'clear' at position {idx} but got "
f"{call!r} in log {call_log!r}. This indicates the loop is NOT "
"using the fixed 'clear → wait' order within each iteration."
)
else:
assert call == "wait", (
f"BUG-14 regression: expected 'wait' at position {idx} but got "
f"{call!r} in log {call_log!r}. This indicates the loop is NOT "
"using the fixed 'clear → wait' order within each iteration."
)