test(profiler/behave_shell): H.2 calibration grid full sweep
Run the five-class calibration grid (HUMAN / YOU-sim / LW-sim / CLAUDE-FF / CLAUDE-CL) against the 2026-05-02 shards. * Hard gate green for 27 primitives across all 5 shards. * environmental.keyboard_layout moved from hard gate to PHASE_F_CONDITIONAL_PRIMITIVES — short SSH-recon corpus maxes at ~90 typed letters per session, well below the LAYOUT_MIN_TYPED_LETTERS (200) floor. The 200-floor stays per the per-phase "v0 ships when honest" rule; longer-text corpora will surface the layout signal. * Three primitives never fire on the 2026-05-02 corpus, all already conditional and all expected: - cognitive.error_resilience.frustration_typing - environmental.locale - environmental.keyboard_layout No D / F / G threshold re-tunes needed; only the keyboard_layout binding-set move. Phase H step log appended to BEHAVE-EXTRACTOR.md with per-class observation counts.
This commit is contained in:
@@ -1058,6 +1058,85 @@ primitives, and the `team_coordinated` value of
|
|||||||
`operational.multi_actor_indicators`) remains the attribution
|
`operational.multi_actor_indicators`) remains the attribution
|
||||||
engine's job — never the extractor's.
|
engine's job — never the extractor's.
|
||||||
|
|
||||||
|
## Phase H step log (extractor-slice)
|
||||||
|
|
||||||
|
Phase H ships in two slices: an **extractor slice** (H.1, H.2, and a
|
||||||
|
`0.1.0-pre` version marker) closing the engine itself, and an
|
||||||
|
**integration slice** (H.3 live smoke + H.4 worker wiring + H.5
|
||||||
|
proper v0 tag) that rides `BEHAVE-INTEGRATION.md` Phase 4. This log
|
||||||
|
covers the extractor slice.
|
||||||
|
|
||||||
|
### H.1 — Registry-coverage test
|
||||||
|
|
||||||
|
`tests/profiler/behave_shell/test_registry_coverage.py` walks
|
||||||
|
`PRIMITIVE_REGISTRY` and asserts every Tier-A primitive has a slot
|
||||||
|
in the calibration grid (hard or conditional). Tier B
|
||||||
|
(8 cross-session primitives) is excluded by an explicit allow-list;
|
||||||
|
Tier C (`toolchain.*`) is excluded by prefix. Three checks ride:
|
||||||
|
forward (every Tier-A covered), reverse (no extractor-set drift from
|
||||||
|
the registry), and a `len(tier_a) == 37` invariant. CI now fails
|
||||||
|
before a registry addition can ship without a feature function.
|
||||||
|
|
||||||
|
### H.2 — Calibration grid full sweep (2026-05-02 corpus)
|
||||||
|
|
||||||
|
Shards at `/home/anti/Tools/BEHAVE/prototype_extractors/shell/` —
|
||||||
|
five classes, 15 sessions total. Per-class observation counts:
|
||||||
|
|
||||||
|
| Class | Sessions | Observations | Distinct primitives |
|
||||||
|
|---|---|---|---|
|
||||||
|
| HUMAN | 1 | 34 | 34 |
|
||||||
|
| YOU-sim | 2 | 59 | 34 |
|
||||||
|
| LW-sim | 5 | 136 | 34 |
|
||||||
|
| CLAUDE-FF | 3 | 84 | 34 |
|
||||||
|
| CLAUDE-CL | 4 | 111 | 34 |
|
||||||
|
|
||||||
|
**One real-shard regression surfaced and fixed:**
|
||||||
|
`environmental.keyboard_layout` was on the per-shard hard gate but
|
||||||
|
the calibration corpus maxes at ~90 typed letters per session — well
|
||||||
|
below `LAYOUT_MIN_TYPED_LETTERS=200`. Most input on these
|
||||||
|
SSH-recon shards is *pasted*, not typed. The honest fix per the
|
||||||
|
per-phase rule "v0 ships when honest, not when convenient" is to
|
||||||
|
move `environmental.keyboard_layout` from `PHASE_ABCDEFG_PRIMITIVES`
|
||||||
|
to `PHASE_F_CONDITIONAL_PRIMITIVES`, alongside `environmental.locale`.
|
||||||
|
The 200-letter floor stays — the keyboard-layout signal genuinely
|
||||||
|
needs richer typed text than this corpus has, and tuning the
|
||||||
|
threshold to pass would corrupt the signal.
|
||||||
|
|
||||||
|
**Three Tier-A primitives never fire across the 2026-05-02 corpus,
|
||||||
|
all conditional and all expected:**
|
||||||
|
|
||||||
|
* `cognitive.error_resilience.frustration_typing` — needs ≥ 2
|
||||||
|
errored commands plus successful baseline (Phase D conditional).
|
||||||
|
* `environmental.locale` — needs an `env` / `locale` / `printenv`
|
||||||
|
dump in the output stream (Phase F conditional).
|
||||||
|
* `environmental.keyboard_layout` — needs ≥ 200 typed letters
|
||||||
|
per session (Phase F conditional after H.2).
|
||||||
|
|
||||||
|
The hard gate (28 primitives) fires on every shard with commands;
|
||||||
|
the discrimination smoke-check passes too. No threshold re-tunes
|
||||||
|
needed for D / F / G — the corpus surfaced the keyboard_layout shape
|
||||||
|
mismatch only.
|
||||||
|
|
||||||
|
**Calibration-grid binding update.** `PHASE_ABCDEFG_PRIMITIVES`
|
||||||
|
shrinks 28 → 27 (keyboard_layout moves to conditional);
|
||||||
|
`PHASE_F_CONDITIONAL_PRIMITIVES` grows 1 → 2.
|
||||||
|
|
||||||
|
### H.5-pre — Extractor version marker
|
||||||
|
|
||||||
|
`decnet/profiler/behave_shell/__init__.py` exports
|
||||||
|
`__version__ = "0.1.0-pre"`. The `-pre` suffix is honest: the
|
||||||
|
**extractor** is feature-complete (37/37 Tier-A primitives emit, all
|
||||||
|
green tests, calibration grid honest), but the *engine package* —
|
||||||
|
worker wiring, observations-table writes, AttackerDetail panel —
|
||||||
|
still rides the integration track. The actual `0.1.0` tag bumps the
|
||||||
|
suffix off only after `BEHAVE-INTEGRATION.md` Phase 4 lands.
|
||||||
|
|
||||||
|
**Tier-A corpus delta (final extractor-slice):** all 37 of 37
|
||||||
|
primitives have a feature function; **27** ride the per-shard hard
|
||||||
|
gate (28 pre-H.2; keyboard_layout moved out); **10** ride conditional
|
||||||
|
sets. Phase H integration slice (H.3 + H.4 + proper H.5 tag) is its
|
||||||
|
own plan.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
**Owner:** ANTI.
|
**Owner:** ANTI.
|
||||||
|
|||||||
@@ -59,10 +59,10 @@ PHASE_ABCDEFG_PRIMITIVES: frozenset[str] = frozenset({
|
|||||||
"temporal.escalation_pattern",
|
"temporal.escalation_pattern",
|
||||||
"temporal.lifecycle_markers.landing_ritual",
|
"temporal.lifecycle_markers.landing_ritual",
|
||||||
# Phase F — environmental.* output-stream block + carry-over E.4
|
# Phase F — environmental.* output-stream block + carry-over E.4
|
||||||
# (locale is conditional, see PHASE_F_CONDITIONAL_PRIMITIVES)
|
# (locale and keyboard_layout are conditional — see
|
||||||
|
# PHASE_F_CONDITIONAL_PRIMITIVES)
|
||||||
"environmental.shell_type",
|
"environmental.shell_type",
|
||||||
"environmental.terminal_multiplexer",
|
"environmental.terminal_multiplexer",
|
||||||
"environmental.keyboard_layout",
|
|
||||||
"environmental.numpad_usage",
|
"environmental.numpad_usage",
|
||||||
"temporal.lifecycle_markers.exit_behavior",
|
"temporal.lifecycle_markers.exit_behavior",
|
||||||
# Phase G — operational.* + emotional_valence.* (hard subset)
|
# Phase G — operational.* + emotional_valence.* (hard subset)
|
||||||
@@ -85,12 +85,18 @@ PHASE_D_CONDITIONAL_PRIMITIVES: frozenset[str] = frozenset({
|
|||||||
"cognitive.error_resilience.fallback_to_man",
|
"cognitive.error_resilience.fallback_to_man",
|
||||||
})
|
})
|
||||||
|
|
||||||
# Phase F primitives conditional on shard content. ``environmental.locale``
|
# Phase F primitives conditional on shard content.
|
||||||
# fires only when the shard's output contains an env / locale dump
|
# * ``environmental.locale`` fires only when the shard's output contains
|
||||||
# (LANG=, LC_ALL=, LC_CTYPE=). It's tracked here, not in the per-shard
|
# an env / locale dump (LANG=, LC_ALL=, LC_CTYPE=).
|
||||||
# hard gate.
|
# * ``environmental.keyboard_layout`` requires LAYOUT_MIN_TYPED_LETTERS
|
||||||
|
# (200) typed letters per session — short SSH-recon shards (the
|
||||||
|
# 2026-05-02 calibration corpus) max out around 90 typed letters
|
||||||
|
# per session because most input is pasted rather than typed.
|
||||||
|
# v0 keeps the 200-floor honesty rather than tuning to pass; longer-
|
||||||
|
# text corpora will surface it.
|
||||||
PHASE_F_CONDITIONAL_PRIMITIVES: frozenset[str] = frozenset({
|
PHASE_F_CONDITIONAL_PRIMITIVES: frozenset[str] = frozenset({
|
||||||
"environmental.locale",
|
"environmental.locale",
|
||||||
|
"environmental.keyboard_layout",
|
||||||
})
|
})
|
||||||
|
|
||||||
# Phase G primitives that ride sample-size floors and may legitimately
|
# Phase G primitives that ride sample-size floors and may legitimately
|
||||||
|
|||||||
Reference in New Issue
Block a user