test(profiler/behave_shell): H.2 calibration grid full sweep

Run the five-class calibration grid (HUMAN / YOU-sim / LW-sim /
CLAUDE-FF / CLAUDE-CL) against the 2026-05-02 shards.

* Hard gate green for 27 primitives across all 5 shards.
* environmental.keyboard_layout moved from hard gate to
  PHASE_F_CONDITIONAL_PRIMITIVES — short SSH-recon corpus maxes at
  ~90 typed letters per session, well below the LAYOUT_MIN_TYPED_LETTERS
  (200) floor. The 200-floor stays per the per-phase "v0 ships when
  honest" rule; longer-text corpora will surface the layout signal.
* Three primitives never fire on the 2026-05-02 corpus, all already
  conditional and all expected:
  - cognitive.error_resilience.frustration_typing
  - environmental.locale
  - environmental.keyboard_layout

No D / F / G threshold re-tunes needed; only the keyboard_layout
binding-set move. Phase H step log appended to BEHAVE-EXTRACTOR.md
with per-class observation counts.
This commit is contained in:
2026-05-08 18:33:51 -04:00
parent ac04751c18
commit 9ebaca410a
2 changed files with 91 additions and 6 deletions

View File

@@ -1058,6 +1058,85 @@ primitives, and the `team_coordinated` value of
`operational.multi_actor_indicators`) remains the attribution `operational.multi_actor_indicators`) remains the attribution
engine's job — never the extractor's. engine's job — never the extractor's.
## Phase H step log (extractor-slice)
Phase H ships in two slices: an **extractor slice** (H.1, H.2, and a
`0.1.0-pre` version marker) closing the engine itself, and an
**integration slice** (H.3 live smoke + H.4 worker wiring + H.5
proper v0 tag) that rides `BEHAVE-INTEGRATION.md` Phase 4. This log
covers the extractor slice.
### H.1 — Registry-coverage test
`tests/profiler/behave_shell/test_registry_coverage.py` walks
`PRIMITIVE_REGISTRY` and asserts every Tier-A primitive has a slot
in the calibration grid (hard or conditional). Tier B
(8 cross-session primitives) is excluded by an explicit allow-list;
Tier C (`toolchain.*`) is excluded by prefix. Three checks ride:
forward (every Tier-A covered), reverse (no extractor-set drift from
the registry), and a `len(tier_a) == 37` invariant. CI now fails
before a registry addition can ship without a feature function.
### H.2 — Calibration grid full sweep (2026-05-02 corpus)
Shards at `/home/anti/Tools/BEHAVE/prototype_extractors/shell/`
five classes, 15 sessions total. Per-class observation counts:
| Class | Sessions | Observations | Distinct primitives |
|---|---|---|---|
| HUMAN | 1 | 34 | 34 |
| YOU-sim | 2 | 59 | 34 |
| LW-sim | 5 | 136 | 34 |
| CLAUDE-FF | 3 | 84 | 34 |
| CLAUDE-CL | 4 | 111 | 34 |
**One real-shard regression surfaced and fixed:**
`environmental.keyboard_layout` was on the per-shard hard gate but
the calibration corpus maxes at ~90 typed letters per session — well
below `LAYOUT_MIN_TYPED_LETTERS=200`. Most input on these
SSH-recon shards is *pasted*, not typed. The honest fix per the
per-phase rule "v0 ships when honest, not when convenient" is to
move `environmental.keyboard_layout` from `PHASE_ABCDEFG_PRIMITIVES`
to `PHASE_F_CONDITIONAL_PRIMITIVES`, alongside `environmental.locale`.
The 200-letter floor stays — the keyboard-layout signal genuinely
needs richer typed text than this corpus has, and tuning the
threshold to pass would corrupt the signal.
**Three Tier-A primitives never fire across the 2026-05-02 corpus,
all conditional and all expected:**
* `cognitive.error_resilience.frustration_typing` — needs ≥ 2
errored commands plus successful baseline (Phase D conditional).
* `environmental.locale` — needs an `env` / `locale` / `printenv`
dump in the output stream (Phase F conditional).
* `environmental.keyboard_layout` — needs ≥ 200 typed letters
per session (Phase F conditional after H.2).
The hard gate (28 primitives) fires on every shard with commands;
the discrimination smoke-check passes too. No threshold re-tunes
needed for D / F / G — the corpus surfaced the keyboard_layout shape
mismatch only.
**Calibration-grid binding update.** `PHASE_ABCDEFG_PRIMITIVES`
shrinks 28 → 27 (keyboard_layout moves to conditional);
`PHASE_F_CONDITIONAL_PRIMITIVES` grows 1 → 2.
### H.5-pre — Extractor version marker
`decnet/profiler/behave_shell/__init__.py` exports
`__version__ = "0.1.0-pre"`. The `-pre` suffix is honest: the
**extractor** is feature-complete (37/37 Tier-A primitives emit, all
green tests, calibration grid honest), but the *engine package*
worker wiring, observations-table writes, AttackerDetail panel —
still rides the integration track. The actual `0.1.0` tag bumps the
suffix off only after `BEHAVE-INTEGRATION.md` Phase 4 lands.
**Tier-A corpus delta (final extractor-slice):** all 37 of 37
primitives have a feature function; **27** ride the per-shard hard
gate (28 pre-H.2; keyboard_layout moved out); **10** ride conditional
sets. Phase H integration slice (H.3 + H.4 + proper H.5 tag) is its
own plan.
--- ---
**Owner:** ANTI. **Owner:** ANTI.

View File

@@ -59,10 +59,10 @@ PHASE_ABCDEFG_PRIMITIVES: frozenset[str] = frozenset({
"temporal.escalation_pattern", "temporal.escalation_pattern",
"temporal.lifecycle_markers.landing_ritual", "temporal.lifecycle_markers.landing_ritual",
# Phase F — environmental.* output-stream block + carry-over E.4 # Phase F — environmental.* output-stream block + carry-over E.4
# (locale is conditional, see PHASE_F_CONDITIONAL_PRIMITIVES) # (locale and keyboard_layout are conditional see
# PHASE_F_CONDITIONAL_PRIMITIVES)
"environmental.shell_type", "environmental.shell_type",
"environmental.terminal_multiplexer", "environmental.terminal_multiplexer",
"environmental.keyboard_layout",
"environmental.numpad_usage", "environmental.numpad_usage",
"temporal.lifecycle_markers.exit_behavior", "temporal.lifecycle_markers.exit_behavior",
# Phase G — operational.* + emotional_valence.* (hard subset) # Phase G — operational.* + emotional_valence.* (hard subset)
@@ -85,12 +85,18 @@ PHASE_D_CONDITIONAL_PRIMITIVES: frozenset[str] = frozenset({
"cognitive.error_resilience.fallback_to_man", "cognitive.error_resilience.fallback_to_man",
}) })
# Phase F primitives conditional on shard content. ``environmental.locale`` # Phase F primitives conditional on shard content.
# fires only when the shard's output contains an env / locale dump # * ``environmental.locale`` fires only when the shard's output contains
# (LANG=, LC_ALL=, LC_CTYPE=). It's tracked here, not in the per-shard # an env / locale dump (LANG=, LC_ALL=, LC_CTYPE=).
# hard gate. # * ``environmental.keyboard_layout`` requires LAYOUT_MIN_TYPED_LETTERS
# (200) typed letters per session — short SSH-recon shards (the
# 2026-05-02 calibration corpus) max out around 90 typed letters
# per session because most input is pasted rather than typed.
# v0 keeps the 200-floor honesty rather than tuning to pass; longer-
# text corpora will surface it.
PHASE_F_CONDITIONAL_PRIMITIVES: frozenset[str] = frozenset({ PHASE_F_CONDITIONAL_PRIMITIVES: frozenset[str] = frozenset({
"environmental.locale", "environmental.locale",
"environmental.keyboard_layout",
}) })
# Phase G primitives that ride sample-size floors and may legitimately # Phase G primitives that ride sample-size floors and may legitimately