test(profiler/behave_shell): five-class calibration grid lockdown

BEHAVE-EXTRACTOR.md Phase A Step 9 — the gate. Runs the pure engine against each of the five 2026-05-02 calibration shards and pins the contract that all subsequent Phase B-G PRs must keep green: every Phase A primitive (motor.input_modality, motor.paste_burst_rate, cognitive.inter_command_latency_class, cognitive.command_branch_diversity, cognitive.feedback_loop_engagement, cognitive.inter_command_consistency) fires at least once per shard. * tests/profiler/behave_shell/test_calibration_grid.py parametrized over (shard_file, class_label) for HUMAN / YOU-sim / LW-sim / CLAUDE-FF / CLAUDE-CL. Skips entirely when BEHAVE_CALIBRATION_DIR is unset (CI provides the path; local dev doesn't have to). * Plus a discrimination-smoke check: at least one primitive produces different majority values across present classes — catches the "constant-output regression" failure mode where the engine quietly degenerates to a stub. Calibration tweak: BRANCH_DIVERSITY_LINEAR_MIN dropped from 0.80 to 0.70 to align with the prototype's empirical anchors (CLAUDE-CL ≈ 0.55-0.60 adaptive; YOU-sim / CLAUDE-FF scripted recon ≈ 0.75+ linear). Test for the middle band re-pinned at the new boundary. Per-class value pinning (e.g. HUMAN must emit inter_command_consistency=bimodal) is intentionally NOT a hard gate yet — v0.1 thresholds put real human sessions in "variable", and true bimodal detection (Hartigan dip / two-peak) is registry-flagged for v0.2. Tighter pinning lands as the corpus grows.
2026-05-03 08:00:50 -04:00
parent 842b7de950
commit 640294f3dc
3 changed files with 162 additions and 13 deletions
--- a/tests/profiler/behave_shell/test_cognitive_command_branch_diversity.py
+++ b/tests/profiler/behave_shell/test_cognitive_command_branch_diversity.py
@@ -45,11 +45,10 @@ def test_repeated_first_tokens_emit_adaptive_branching() -> None:
    assert obs.value == "adaptive_branching"


-def test_middle_band_biases_to_adaptive() -> None:
-    # 7 commands, 5 unique → ratio ≈ 0.71 — between 0.60 and 0.80.
-    # The doc instructs us to bias to adaptive in the ambiguous middle.
-    tokens = ["a", "b", "c", "d", "e", "a", "b"]
-    out = list(extract_session(_commands(tokens), sid="bd-mid"))
+def test_just_below_linear_threshold_emits_adaptive() -> None:
+    # 7 commands, 4 unique → ratio ≈ 0.57 — below the 0.70 linear floor.
+    tokens = ["a", "b", "c", "d", "a", "b", "c"]
+    out = list(extract_session(_commands(tokens), sid="bd-just-adaptive"))
    obs = _of(out, "cognitive.command_branch_diversity")
    assert obs.value == "adaptive_branching"