Files
DECNET/scripts/behave_shell/README.md
anti 69c8cfd2b9 test(profiler/behave_shell): Phase 6 smoke harness + live-decky runbook
Two-half deliverable per BEHAVE-INTEGRATION.md §587-594:

* scripts/behave_shell/replay_calibration.py — Python helper that
  drives the production handler against one asciinema shard, mints
  a temp SQLite repo + an Attacker per session, captures bus
  emissions in-process. Exits non-zero on zero-observation sessions.

* scripts/behave_shell/smoke.sh — bash entry that replays all five
  2026-05-02 calibration shards (HUMAN / YOU-sim / LW-sim /
  CLAUDE-FF / CLAUDE-CL). Auto-activates .311 venv, forces
  DECNET_DB_TYPE=sqlite, prints per-class summary. Suitable for CI.

* scripts/behave_shell/README.md — runbook covering both halves.
  Pins the manual live-decky procedure (one SSH session per class
  against a deployed smoke-decky, expected dominant primitives table,
  SQL verification query, AttackerDetail panel check, pass criteria).

* BEHAVE-INTEGRATION.md — Phase 6 completion log appended with
  current corpus results table (15 sessions, 424 observations across
  the five classes) and a note that the v0 tag (drop -pre) is gated
  on the manual live-decky round-trip and lands as a separate
  commit.

Live-decky run is intentionally NOT scripted — the integration doc
calls for manual SSH sessions per class so an operator confirms the
bus / collector / disk-reach plumbing under real PTY conditions.
2026-05-08 21:42:11 -04:00

111 lines
4.9 KiB
Markdown

# BEHAVE-SHELL — Phase 6 smoke
Two halves:
1. **Offline replay**`smoke.sh` replays the five 2026-05-02
calibration shards through the production handler. Exercises the
engine + storage layer end-to-end without a live PTY. Suitable for
CI.
2. **Live decky round-trip** — manual procedure below. Confirms the
bus / collector / disk-reach plumbing on a real session.
## 1. Offline replay
```sh
$ scripts/behave_shell/smoke.sh # auto-discovers ../BEHAVE/prototype_extractors/shell
$ scripts/behave_shell/smoke.sh /path/to/calibration/dir # explicit dir
```
Expected output (15 sessions across 5 classes, 424 total observations
on the current corpus):
```
[HUMAN] sessions=1 observations=34 distinct_primitives=34
[YOU-sim] sessions=2 observations=59 distinct_primitives=34
[LW-sim] sessions=5 observations=136 distinct_primitives=34
[CLAUDE-FF] sessions=3 observations=84 distinct_primitives=34
[CLAUDE-CL] sessions=4 observations=111 distinct_primitives=34
smoke: OK — all classes emit observations end-to-end
```
Exit codes: `0` full pass, `1` any class regressed, `2` argument /
IO error.
The replay drives `decnet.profiler.behave_shell._handler.handle_session_ended`
directly against a temp SQLite DB seeded with one Attacker per
session. Bus emission is captured by an in-process publisher; no
real bus is required.
## 2. Live decky round-trip (manual)
End-to-end confirmation. Run **once** before tagging v0 and **after**
any change to the bus / collector / disk-reach layer.
### Setup
1. Init a fresh DECNET host (see `decnet init`).
2. `decnet bus` worker is up (systemd unit
`decnet-bus.service` or `scripts/bus/smoke.sh`).
3. `decnet-profiler.service` is up — it owns the
`attacker.session.ended` subscription and the BEHAVE-SHELL handler.
4. `decnet-collector.service` is up — it publishes
`attacker.session.ended` from `session_recorded` log events.
5. Web API is up; you have a viewer JWT in your browser localStorage.
6. Deploy a single `ssh` decky:
```sh
$ decnet decky deploy --service ssh --decky smoke-decky
```
The decky's sessrec wrapper appends to
`/var/lib/decnet/artifacts/smoke-decky/ssh/transcripts/sessions-<UTC-DAY>.jsonl`.
### Run one session per calibration class
For each class, SSH into the decky and reproduce the canonical
workload. Log out via the documented exit path so the
`session_recorded` event fires. The collector aggregates the session
and publishes `attacker.session.ended`; the profiler worker
disk-reaches the shard, runs `extract_session()`, persists rows,
publishes one `attacker.observation.<primitive>` per emission.
| Class | Workload sketch | Expected dominant primitives |
|---|---|---|
| HUMAN | Type each command live; correct typos; pause to read output. | `motor.input_modality=typed`, `cognitive.feedback_loop_engagement=closed_loop` |
| YOU-sim | Paste short pre-canned commands at typing speed; minimal repeats. | `motor.input_modality=pasted`, `motor.paste_burst_rate=occasional`, `cognitive.command_branch_diversity=linear_playbook` |
| LW-sim | Paste a recon sweep generated by a small LLM; ~2-8s between pastes. | `cognitive.inter_command_latency_class=llm_lightweight` |
| CLAUDE-FF | Paste outputs from a fire-and-forget reasoning agent; ~8-30s gaps. | `cognitive.inter_command_latency_class=llm_heavyweight`, `cognitive.feedback_loop_engagement=fire_and_forget` |
| CLAUDE-CL | Drive a closed-loop plan-execute-observe agent; >30s pauses on long output. | `cognitive.inter_command_latency_class=long`, `cognitive.feedback_loop_engagement=closed_loop` |
### Verify
For each class, after disconnecting:
1. **DB row landing** — within ~30s
(the profiler tick interval), `observations` carries one row per
primitive for the new attacker:
```sh
$ sqlite3 /var/lib/decnet/decnet.db \
"SELECT primitive, value, confidence FROM observations \
WHERE evidence_ref LIKE 'shard:smoke-decky/%' ORDER BY ts DESC LIMIT 40;"
```
2. **Bus events** — tail the bus worker log; you should see one
`attacker.observation.<primitive>` per emitted row, plus the
originating `attacker.session.ended`.
3. **AttackerDetail panel** — open
`/attackers/<uuid>` in the browser. The Behavioural primitives
section should hydrate from the REST snapshot and live-update
each time you replay the session
(the SSE route forwards the new emissions in real time).
### Pass criteria
* All 5 classes produce ≥ 27 distinct primitives in
`observations` (the per-shard hard gate from
`tests/profiler/behave_shell/test_calibration_grid.py`).
* The four day-one priority primitives appear in the panel and carry
the expected values per class (table above).
* No collector / profiler / web errors in the journal during the
round-trip.
If any class regresses: rollback the last commit and run the offline
replay (`smoke.sh`) to localise — same handler, no transport noise.