diff --git a/development/BEHAVE-EXTRACTOR.md b/development/BEHAVE-EXTRACTOR.md index 8fa2afde..91d8ef03 100644 --- a/development/BEHAVE-EXTRACTOR.md +++ b/development/BEHAVE-EXTRACTOR.md @@ -1058,6 +1058,85 @@ primitives, and the `team_coordinated` value of `operational.multi_actor_indicators`) remains the attribution engine's job — never the extractor's. +## Phase H step log (extractor-slice) + +Phase H ships in two slices: an **extractor slice** (H.1, H.2, and a +`0.1.0-pre` version marker) closing the engine itself, and an +**integration slice** (H.3 live smoke + H.4 worker wiring + H.5 +proper v0 tag) that rides `BEHAVE-INTEGRATION.md` Phase 4. This log +covers the extractor slice. + +### H.1 — Registry-coverage test + +`tests/profiler/behave_shell/test_registry_coverage.py` walks +`PRIMITIVE_REGISTRY` and asserts every Tier-A primitive has a slot +in the calibration grid (hard or conditional). Tier B +(8 cross-session primitives) is excluded by an explicit allow-list; +Tier C (`toolchain.*`) is excluded by prefix. Three checks ride: +forward (every Tier-A covered), reverse (no extractor-set drift from +the registry), and a `len(tier_a) == 37` invariant. CI now fails +before a registry addition can ship without a feature function. + +### H.2 — Calibration grid full sweep (2026-05-02 corpus) + +Shards at `/home/anti/Tools/BEHAVE/prototype_extractors/shell/` — +five classes, 15 sessions total. Per-class observation counts: + +| Class | Sessions | Observations | Distinct primitives | +|---|---|---|---| +| HUMAN | 1 | 34 | 34 | +| YOU-sim | 2 | 59 | 34 | +| LW-sim | 5 | 136 | 34 | +| CLAUDE-FF | 3 | 84 | 34 | +| CLAUDE-CL | 4 | 111 | 34 | + +**One real-shard regression surfaced and fixed:** +`environmental.keyboard_layout` was on the per-shard hard gate but +the calibration corpus maxes at ~90 typed letters per session — well +below `LAYOUT_MIN_TYPED_LETTERS=200`. Most input on these +SSH-recon shards is *pasted*, not typed. The honest fix per the +per-phase rule "v0 ships when honest, not when convenient" is to +move `environmental.keyboard_layout` from `PHASE_ABCDEFG_PRIMITIVES` +to `PHASE_F_CONDITIONAL_PRIMITIVES`, alongside `environmental.locale`. +The 200-letter floor stays — the keyboard-layout signal genuinely +needs richer typed text than this corpus has, and tuning the +threshold to pass would corrupt the signal. + +**Three Tier-A primitives never fire across the 2026-05-02 corpus, +all conditional and all expected:** + +* `cognitive.error_resilience.frustration_typing` — needs ≥ 2 + errored commands plus successful baseline (Phase D conditional). +* `environmental.locale` — needs an `env` / `locale` / `printenv` + dump in the output stream (Phase F conditional). +* `environmental.keyboard_layout` — needs ≥ 200 typed letters + per session (Phase F conditional after H.2). + +The hard gate (28 primitives) fires on every shard with commands; +the discrimination smoke-check passes too. No threshold re-tunes +needed for D / F / G — the corpus surfaced the keyboard_layout shape +mismatch only. + +**Calibration-grid binding update.** `PHASE_ABCDEFG_PRIMITIVES` +shrinks 28 → 27 (keyboard_layout moves to conditional); +`PHASE_F_CONDITIONAL_PRIMITIVES` grows 1 → 2. + +### H.5-pre — Extractor version marker + +`decnet/profiler/behave_shell/__init__.py` exports +`__version__ = "0.1.0-pre"`. The `-pre` suffix is honest: the +**extractor** is feature-complete (37/37 Tier-A primitives emit, all +green tests, calibration grid honest), but the *engine package* — +worker wiring, observations-table writes, AttackerDetail panel — +still rides the integration track. The actual `0.1.0` tag bumps the +suffix off only after `BEHAVE-INTEGRATION.md` Phase 4 lands. + +**Tier-A corpus delta (final extractor-slice):** all 37 of 37 +primitives have a feature function; **27** ride the per-shard hard +gate (28 pre-H.2; keyboard_layout moved out); **10** ride conditional +sets. Phase H integration slice (H.3 + H.4 + proper H.5 tag) is its +own plan. + --- **Owner:** ANTI. diff --git a/tests/profiler/behave_shell/test_calibration_grid.py b/tests/profiler/behave_shell/test_calibration_grid.py index bf965451..f435255b 100644 --- a/tests/profiler/behave_shell/test_calibration_grid.py +++ b/tests/profiler/behave_shell/test_calibration_grid.py @@ -59,10 +59,10 @@ PHASE_ABCDEFG_PRIMITIVES: frozenset[str] = frozenset({ "temporal.escalation_pattern", "temporal.lifecycle_markers.landing_ritual", # Phase F — environmental.* output-stream block + carry-over E.4 - # (locale is conditional, see PHASE_F_CONDITIONAL_PRIMITIVES) + # (locale and keyboard_layout are conditional — see + # PHASE_F_CONDITIONAL_PRIMITIVES) "environmental.shell_type", "environmental.terminal_multiplexer", - "environmental.keyboard_layout", "environmental.numpad_usage", "temporal.lifecycle_markers.exit_behavior", # Phase G — operational.* + emotional_valence.* (hard subset) @@ -85,12 +85,18 @@ PHASE_D_CONDITIONAL_PRIMITIVES: frozenset[str] = frozenset({ "cognitive.error_resilience.fallback_to_man", }) -# Phase F primitives conditional on shard content. ``environmental.locale`` -# fires only when the shard's output contains an env / locale dump -# (LANG=, LC_ALL=, LC_CTYPE=). It's tracked here, not in the per-shard -# hard gate. +# Phase F primitives conditional on shard content. +# * ``environmental.locale`` fires only when the shard's output contains +# an env / locale dump (LANG=, LC_ALL=, LC_CTYPE=). +# * ``environmental.keyboard_layout`` requires LAYOUT_MIN_TYPED_LETTERS +# (200) typed letters per session — short SSH-recon shards (the +# 2026-05-02 calibration corpus) max out around 90 typed letters +# per session because most input is pasted rather than typed. +# v0 keeps the 200-floor honesty rather than tuning to pass; longer- +# text corpora will surface it. PHASE_F_CONDITIONAL_PRIMITIVES: frozenset[str] = frozenset({ "environmental.locale", + "environmental.keyboard_layout", }) # Phase G primitives that ride sample-size floors and may legitimately