fix(profiler/behave_shell): tolerate non-UTF-8 bytes in shard reads
Real-world bug surfaced on the first live decky run: sessrec.c's json_escape (decnet/templates/_shared/sessrec/sessrec.c:111-141) only escapes bytes < 0x20 + DEL — bytes >= 0x80 pass through raw. An attacker pasting Latin-1 / GB18030 / any non-UTF-8 8-bit text yields a shard line that chokes Python's default UTF-8 text-mode read with 'utf-8 codec can't decode byte 0xac'. Three changes: 1. _events_for_sid now opens with errors='surrogateescape', preserving byte fidelity through the JSON parse. Surrogate-half chars correctly fail isascii() / isalpha() so the typed-letter histograms filter them out automatically. Tightening sessrec.c to escape >= 0x80 is filed for v0.2 — that's the proper forensic-data fix; the surrogateescape read makes the engine robust meanwhile. 2. Regression test (test_handler_tolerates_non_utf8_bytes_in_shard) builds a shard with raw 0xAC bytes inside a JSON 'data' string and asserts the handler still persists observations. 3. Collector's _emit_session now logs at WARNING (was DEBUG) when find_shard_with_sid returns None, citing the three usual causes (ARTIFACTS_ROOT perms, _SERVICE_RE whitelist, sessrec/collector race). Surfaces the silent-skip class of bug in seconds instead of hours — the first live run hid a perm mismatch (User=anti without SupplementaryGroups=decnet) for an entire session window before the symptom was traced upstream.
This commit is contained in:
@@ -53,9 +53,17 @@ def _events_for_sid(shard: Path, sid: str) -> list[AsciinemaEvent]:
|
||||
Mirrors the loader pattern in
|
||||
``tests/profiler/behave_shell/test_calibration_grid.py``: skip
|
||||
headers / non-matching sids / unparseable lines silently.
|
||||
|
||||
``errors="surrogateescape"`` because sessrec.c's json_escape only
|
||||
escapes bytes < 0x20 + DEL — bytes >= 0x80 pass through raw, so
|
||||
a real shard with Latin-1 / GB18030 / arbitrary 8-bit attacker
|
||||
paste content is NOT valid UTF-8. surrogateescape preserves byte
|
||||
fidelity through the JSON read; downstream isalpha() / isascii()
|
||||
correctly filter the surrogate-half chars out of the typed-letter
|
||||
histograms. Filed for v0.2: tighten sessrec.c to escape >= 0x80.
|
||||
"""
|
||||
events: list[AsciinemaEvent] = []
|
||||
with shard.open() as f:
|
||||
with shard.open(encoding="utf-8", errors="surrogateescape") as f:
|
||||
for line in f:
|
||||
try:
|
||||
rec = json.loads(line)
|
||||
|
||||
Reference in New Issue
Block a user