test(profiler/behave_shell): H.2 calibration grid full sweep
Run the five-class calibration grid (HUMAN / YOU-sim / LW-sim / CLAUDE-FF / CLAUDE-CL) against the 2026-05-02 shards. * Hard gate green for 27 primitives across all 5 shards. * environmental.keyboard_layout moved from hard gate to PHASE_F_CONDITIONAL_PRIMITIVES — short SSH-recon corpus maxes at ~90 typed letters per session, well below the LAYOUT_MIN_TYPED_LETTERS (200) floor. The 200-floor stays per the per-phase "v0 ships when honest" rule; longer-text corpora will surface the layout signal. * Three primitives never fire on the 2026-05-02 corpus, all already conditional and all expected: - cognitive.error_resilience.frustration_typing - environmental.locale - environmental.keyboard_layout No D / F / G threshold re-tunes needed; only the keyboard_layout binding-set move. Phase H step log appended to BEHAVE-EXTRACTOR.md with per-class observation counts.
This commit is contained in:
@@ -1058,6 +1058,85 @@ primitives, and the `team_coordinated` value of
|
||||
`operational.multi_actor_indicators`) remains the attribution
|
||||
engine's job — never the extractor's.
|
||||
|
||||
## Phase H step log (extractor-slice)
|
||||
|
||||
Phase H ships in two slices: an **extractor slice** (H.1, H.2, and a
|
||||
`0.1.0-pre` version marker) closing the engine itself, and an
|
||||
**integration slice** (H.3 live smoke + H.4 worker wiring + H.5
|
||||
proper v0 tag) that rides `BEHAVE-INTEGRATION.md` Phase 4. This log
|
||||
covers the extractor slice.
|
||||
|
||||
### H.1 — Registry-coverage test
|
||||
|
||||
`tests/profiler/behave_shell/test_registry_coverage.py` walks
|
||||
`PRIMITIVE_REGISTRY` and asserts every Tier-A primitive has a slot
|
||||
in the calibration grid (hard or conditional). Tier B
|
||||
(8 cross-session primitives) is excluded by an explicit allow-list;
|
||||
Tier C (`toolchain.*`) is excluded by prefix. Three checks ride:
|
||||
forward (every Tier-A covered), reverse (no extractor-set drift from
|
||||
the registry), and a `len(tier_a) == 37` invariant. CI now fails
|
||||
before a registry addition can ship without a feature function.
|
||||
|
||||
### H.2 — Calibration grid full sweep (2026-05-02 corpus)
|
||||
|
||||
Shards at `/home/anti/Tools/BEHAVE/prototype_extractors/shell/` —
|
||||
five classes, 15 sessions total. Per-class observation counts:
|
||||
|
||||
| Class | Sessions | Observations | Distinct primitives |
|
||||
|---|---|---|---|
|
||||
| HUMAN | 1 | 34 | 34 |
|
||||
| YOU-sim | 2 | 59 | 34 |
|
||||
| LW-sim | 5 | 136 | 34 |
|
||||
| CLAUDE-FF | 3 | 84 | 34 |
|
||||
| CLAUDE-CL | 4 | 111 | 34 |
|
||||
|
||||
**One real-shard regression surfaced and fixed:**
|
||||
`environmental.keyboard_layout` was on the per-shard hard gate but
|
||||
the calibration corpus maxes at ~90 typed letters per session — well
|
||||
below `LAYOUT_MIN_TYPED_LETTERS=200`. Most input on these
|
||||
SSH-recon shards is *pasted*, not typed. The honest fix per the
|
||||
per-phase rule "v0 ships when honest, not when convenient" is to
|
||||
move `environmental.keyboard_layout` from `PHASE_ABCDEFG_PRIMITIVES`
|
||||
to `PHASE_F_CONDITIONAL_PRIMITIVES`, alongside `environmental.locale`.
|
||||
The 200-letter floor stays — the keyboard-layout signal genuinely
|
||||
needs richer typed text than this corpus has, and tuning the
|
||||
threshold to pass would corrupt the signal.
|
||||
|
||||
**Three Tier-A primitives never fire across the 2026-05-02 corpus,
|
||||
all conditional and all expected:**
|
||||
|
||||
* `cognitive.error_resilience.frustration_typing` — needs ≥ 2
|
||||
errored commands plus successful baseline (Phase D conditional).
|
||||
* `environmental.locale` — needs an `env` / `locale` / `printenv`
|
||||
dump in the output stream (Phase F conditional).
|
||||
* `environmental.keyboard_layout` — needs ≥ 200 typed letters
|
||||
per session (Phase F conditional after H.2).
|
||||
|
||||
The hard gate (28 primitives) fires on every shard with commands;
|
||||
the discrimination smoke-check passes too. No threshold re-tunes
|
||||
needed for D / F / G — the corpus surfaced the keyboard_layout shape
|
||||
mismatch only.
|
||||
|
||||
**Calibration-grid binding update.** `PHASE_ABCDEFG_PRIMITIVES`
|
||||
shrinks 28 → 27 (keyboard_layout moves to conditional);
|
||||
`PHASE_F_CONDITIONAL_PRIMITIVES` grows 1 → 2.
|
||||
|
||||
### H.5-pre — Extractor version marker
|
||||
|
||||
`decnet/profiler/behave_shell/__init__.py` exports
|
||||
`__version__ = "0.1.0-pre"`. The `-pre` suffix is honest: the
|
||||
**extractor** is feature-complete (37/37 Tier-A primitives emit, all
|
||||
green tests, calibration grid honest), but the *engine package* —
|
||||
worker wiring, observations-table writes, AttackerDetail panel —
|
||||
still rides the integration track. The actual `0.1.0` tag bumps the
|
||||
suffix off only after `BEHAVE-INTEGRATION.md` Phase 4 lands.
|
||||
|
||||
**Tier-A corpus delta (final extractor-slice):** all 37 of 37
|
||||
primitives have a feature function; **27** ride the per-shard hard
|
||||
gate (28 pre-H.2; keyboard_layout moved out); **10** ride conditional
|
||||
sets. Phase H integration slice (H.3 + H.4 + proper H.5 tag) is its
|
||||
own plan.
|
||||
|
||||
---
|
||||
|
||||
**Owner:** ANTI.
|
||||
|
||||
@@ -59,10 +59,10 @@ PHASE_ABCDEFG_PRIMITIVES: frozenset[str] = frozenset({
|
||||
"temporal.escalation_pattern",
|
||||
"temporal.lifecycle_markers.landing_ritual",
|
||||
# Phase F — environmental.* output-stream block + carry-over E.4
|
||||
# (locale is conditional, see PHASE_F_CONDITIONAL_PRIMITIVES)
|
||||
# (locale and keyboard_layout are conditional — see
|
||||
# PHASE_F_CONDITIONAL_PRIMITIVES)
|
||||
"environmental.shell_type",
|
||||
"environmental.terminal_multiplexer",
|
||||
"environmental.keyboard_layout",
|
||||
"environmental.numpad_usage",
|
||||
"temporal.lifecycle_markers.exit_behavior",
|
||||
# Phase G — operational.* + emotional_valence.* (hard subset)
|
||||
@@ -85,12 +85,18 @@ PHASE_D_CONDITIONAL_PRIMITIVES: frozenset[str] = frozenset({
|
||||
"cognitive.error_resilience.fallback_to_man",
|
||||
})
|
||||
|
||||
# Phase F primitives conditional on shard content. ``environmental.locale``
|
||||
# fires only when the shard's output contains an env / locale dump
|
||||
# (LANG=, LC_ALL=, LC_CTYPE=). It's tracked here, not in the per-shard
|
||||
# hard gate.
|
||||
# Phase F primitives conditional on shard content.
|
||||
# * ``environmental.locale`` fires only when the shard's output contains
|
||||
# an env / locale dump (LANG=, LC_ALL=, LC_CTYPE=).
|
||||
# * ``environmental.keyboard_layout`` requires LAYOUT_MIN_TYPED_LETTERS
|
||||
# (200) typed letters per session — short SSH-recon shards (the
|
||||
# 2026-05-02 calibration corpus) max out around 90 typed letters
|
||||
# per session because most input is pasted rather than typed.
|
||||
# v0 keeps the 200-floor honesty rather than tuning to pass; longer-
|
||||
# text corpora will surface it.
|
||||
PHASE_F_CONDITIONAL_PRIMITIVES: frozenset[str] = frozenset({
|
||||
"environmental.locale",
|
||||
"environmental.keyboard_layout",
|
||||
})
|
||||
|
||||
# Phase G primitives that ride sample-size floors and may legitimately
|
||||
|
||||
Reference in New Issue
Block a user