test(profiler/behave_shell): Phase 6 smoke harness + live-decky runbook
Two-half deliverable per BEHAVE-INTEGRATION.md §587-594: * scripts/behave_shell/replay_calibration.py — Python helper that drives the production handler against one asciinema shard, mints a temp SQLite repo + an Attacker per session, captures bus emissions in-process. Exits non-zero on zero-observation sessions. * scripts/behave_shell/smoke.sh — bash entry that replays all five 2026-05-02 calibration shards (HUMAN / YOU-sim / LW-sim / CLAUDE-FF / CLAUDE-CL). Auto-activates .311 venv, forces DECNET_DB_TYPE=sqlite, prints per-class summary. Suitable for CI. * scripts/behave_shell/README.md — runbook covering both halves. Pins the manual live-decky procedure (one SSH session per class against a deployed smoke-decky, expected dominant primitives table, SQL verification query, AttackerDetail panel check, pass criteria). * BEHAVE-INTEGRATION.md — Phase 6 completion log appended with current corpus results table (15 sessions, 424 observations across the five classes) and a note that the v0 tag (drop -pre) is gated on the manual live-decky round-trip and lands as a separate commit. Live-decky run is intentionally NOT scripted — the integration doc calls for manual SSH sessions per class so an operator confirms the bus / collector / disk-reach plumbing under real PTY conditions.
This commit is contained in:
110
scripts/behave_shell/README.md
Normal file
110
scripts/behave_shell/README.md
Normal file
@@ -0,0 +1,110 @@
|
||||
# BEHAVE-SHELL — Phase 6 smoke
|
||||
|
||||
Two halves:
|
||||
|
||||
1. **Offline replay** — `smoke.sh` replays the five 2026-05-02
|
||||
calibration shards through the production handler. Exercises the
|
||||
engine + storage layer end-to-end without a live PTY. Suitable for
|
||||
CI.
|
||||
2. **Live decky round-trip** — manual procedure below. Confirms the
|
||||
bus / collector / disk-reach plumbing on a real session.
|
||||
|
||||
## 1. Offline replay
|
||||
|
||||
```sh
|
||||
$ scripts/behave_shell/smoke.sh # auto-discovers ../BEHAVE/prototype_extractors/shell
|
||||
$ scripts/behave_shell/smoke.sh /path/to/calibration/dir # explicit dir
|
||||
```
|
||||
|
||||
Expected output (15 sessions across 5 classes, 424 total observations
|
||||
on the current corpus):
|
||||
|
||||
```
|
||||
[HUMAN] sessions=1 observations=34 distinct_primitives=34
|
||||
[YOU-sim] sessions=2 observations=59 distinct_primitives=34
|
||||
[LW-sim] sessions=5 observations=136 distinct_primitives=34
|
||||
[CLAUDE-FF] sessions=3 observations=84 distinct_primitives=34
|
||||
[CLAUDE-CL] sessions=4 observations=111 distinct_primitives=34
|
||||
smoke: OK — all classes emit observations end-to-end
|
||||
```
|
||||
|
||||
Exit codes: `0` full pass, `1` any class regressed, `2` argument /
|
||||
IO error.
|
||||
|
||||
The replay drives `decnet.profiler.behave_shell._handler.handle_session_ended`
|
||||
directly against a temp SQLite DB seeded with one Attacker per
|
||||
session. Bus emission is captured by an in-process publisher; no
|
||||
real bus is required.
|
||||
|
||||
## 2. Live decky round-trip (manual)
|
||||
|
||||
End-to-end confirmation. Run **once** before tagging v0 and **after**
|
||||
any change to the bus / collector / disk-reach layer.
|
||||
|
||||
### Setup
|
||||
|
||||
1. Init a fresh DECNET host (see `decnet init`).
|
||||
2. `decnet bus` worker is up (systemd unit
|
||||
`decnet-bus.service` or `scripts/bus/smoke.sh`).
|
||||
3. `decnet-profiler.service` is up — it owns the
|
||||
`attacker.session.ended` subscription and the BEHAVE-SHELL handler.
|
||||
4. `decnet-collector.service` is up — it publishes
|
||||
`attacker.session.ended` from `session_recorded` log events.
|
||||
5. Web API is up; you have a viewer JWT in your browser localStorage.
|
||||
6. Deploy a single `ssh` decky:
|
||||
```sh
|
||||
$ decnet decky deploy --service ssh --decky smoke-decky
|
||||
```
|
||||
The decky's sessrec wrapper appends to
|
||||
`/var/lib/decnet/artifacts/smoke-decky/ssh/transcripts/sessions-<UTC-DAY>.jsonl`.
|
||||
|
||||
### Run one session per calibration class
|
||||
|
||||
For each class, SSH into the decky and reproduce the canonical
|
||||
workload. Log out via the documented exit path so the
|
||||
`session_recorded` event fires. The collector aggregates the session
|
||||
and publishes `attacker.session.ended`; the profiler worker
|
||||
disk-reaches the shard, runs `extract_session()`, persists rows,
|
||||
publishes one `attacker.observation.<primitive>` per emission.
|
||||
|
||||
| Class | Workload sketch | Expected dominant primitives |
|
||||
|---|---|---|
|
||||
| HUMAN | Type each command live; correct typos; pause to read output. | `motor.input_modality=typed`, `cognitive.feedback_loop_engagement=closed_loop` |
|
||||
| YOU-sim | Paste short pre-canned commands at typing speed; minimal repeats. | `motor.input_modality=pasted`, `motor.paste_burst_rate=occasional`, `cognitive.command_branch_diversity=linear_playbook` |
|
||||
| LW-sim | Paste a recon sweep generated by a small LLM; ~2-8s between pastes. | `cognitive.inter_command_latency_class=llm_lightweight` |
|
||||
| CLAUDE-FF | Paste outputs from a fire-and-forget reasoning agent; ~8-30s gaps. | `cognitive.inter_command_latency_class=llm_heavyweight`, `cognitive.feedback_loop_engagement=fire_and_forget` |
|
||||
| CLAUDE-CL | Drive a closed-loop plan-execute-observe agent; >30s pauses on long output. | `cognitive.inter_command_latency_class=long`, `cognitive.feedback_loop_engagement=closed_loop` |
|
||||
|
||||
### Verify
|
||||
|
||||
For each class, after disconnecting:
|
||||
|
||||
1. **DB row landing** — within ~30s
|
||||
(the profiler tick interval), `observations` carries one row per
|
||||
primitive for the new attacker:
|
||||
```sh
|
||||
$ sqlite3 /var/lib/decnet/decnet.db \
|
||||
"SELECT primitive, value, confidence FROM observations \
|
||||
WHERE evidence_ref LIKE 'shard:smoke-decky/%' ORDER BY ts DESC LIMIT 40;"
|
||||
```
|
||||
2. **Bus events** — tail the bus worker log; you should see one
|
||||
`attacker.observation.<primitive>` per emitted row, plus the
|
||||
originating `attacker.session.ended`.
|
||||
3. **AttackerDetail panel** — open
|
||||
`/attackers/<uuid>` in the browser. The Behavioural primitives
|
||||
section should hydrate from the REST snapshot and live-update
|
||||
each time you replay the session
|
||||
(the SSE route forwards the new emissions in real time).
|
||||
|
||||
### Pass criteria
|
||||
|
||||
* All 5 classes produce ≥ 27 distinct primitives in
|
||||
`observations` (the per-shard hard gate from
|
||||
`tests/profiler/behave_shell/test_calibration_grid.py`).
|
||||
* The four day-one priority primitives appear in the panel and carry
|
||||
the expected values per class (table above).
|
||||
* No collector / profiler / web errors in the journal during the
|
||||
round-trip.
|
||||
|
||||
If any class regresses: rollback the last commit and run the offline
|
||||
replay (`smoke.sh`) to localise — same handler, no transport noise.
|
||||
Reference in New Issue
Block a user