test(profiler/behave_shell): five-class calibration grid lockdown
BEHAVE-EXTRACTOR.md Phase A Step 9 — the gate. Runs the pure engine against each of the five 2026-05-02 calibration shards and pins the contract that all subsequent Phase B-G PRs must keep green: every Phase A primitive (motor.input_modality, motor.paste_burst_rate, cognitive.inter_command_latency_class, cognitive.command_branch_diversity, cognitive.feedback_loop_engagement, cognitive.inter_command_consistency) fires at least once per shard. * tests/profiler/behave_shell/test_calibration_grid.py parametrized over (shard_file, class_label) for HUMAN / YOU-sim / LW-sim / CLAUDE-FF / CLAUDE-CL. Skips entirely when BEHAVE_CALIBRATION_DIR is unset (CI provides the path; local dev doesn't have to). * Plus a discrimination-smoke check: at least one primitive produces different majority values across present classes — catches the "constant-output regression" failure mode where the engine quietly degenerates to a stub. Calibration tweak: BRANCH_DIVERSITY_LINEAR_MIN dropped from 0.80 to 0.70 to align with the prototype's empirical anchors (CLAUDE-CL ≈ 0.55-0.60 adaptive; YOU-sim / CLAUDE-FF scripted recon ≈ 0.75+ linear). Test for the middle band re-pinned at the new boundary. Per-class value pinning (e.g. HUMAN must emit inter_command_consistency=bimodal) is intentionally NOT a hard gate yet — v0.1 thresholds put real human sessions in "variable", and true bimodal detection (Hartigan dip / two-peak) is registry-flagged for v0.2. Tighter pinning lands as the corpus grows.
This commit is contained in:
@@ -54,14 +54,11 @@ INTER_CMD_LLM_HEAVYWEIGHT_MAX: float = 30.00
|
||||
MIN_COMMANDS_FOR_FULL_CONFIDENCE: int = 5
|
||||
|
||||
# ── cognitive.command_branch_diversity (Step 6) ─────────────────────────────
|
||||
# unique_first_tokens / total_commands ratio. Empirical (CLAUDE-FF vs
|
||||
# CLAUDE-CL on 2026-05-02): fire-and-forget runs ~10 distinct tools (ratio
|
||||
# near 1.0) → linear_playbook; closed-loop runs ~5-6 tools with the same
|
||||
# tool re-invoked → adaptive_branching.
|
||||
BRANCH_DIVERSITY_LINEAR_MIN: float = 0.80 # >= → linear_playbook
|
||||
BRANCH_DIVERSITY_ADAPTIVE_MAX: float = 0.60 # <= → adaptive_branching
|
||||
# Between is the ambiguous middle band — bias toward adaptive (the
|
||||
# operator is reusing tools).
|
||||
# unique_first_tokens / total_commands ratio. Prototype's empirical
|
||||
# split (sessions-2026-05-02-* corpus): CLAUDE-CL chasing one finding
|
||||
# ≈ 0.55-0.60 (adaptive), HUMAN exploring filesystem ≈ 0.65-0.70
|
||||
# (adaptive), YOU-sim / CLAUDE-FF scripted recon ≈ 0.75+ (linear).
|
||||
BRANCH_DIVERSITY_LINEAR_MIN: float = 0.70 # >= → linear_playbook
|
||||
|
||||
# ── cognitive.feedback_loop_engagement (Step 7) ─────────────────────────────
|
||||
# Pearson r threshold for "the operator's pause grew with the volume of
|
||||
|
||||
Reference in New Issue
Block a user