Two-half deliverable per BEHAVE-INTEGRATION.md §587-594: * scripts/behave_shell/replay_calibration.py — Python helper that drives the production handler against one asciinema shard, mints a temp SQLite repo + an Attacker per session, captures bus emissions in-process. Exits non-zero on zero-observation sessions. * scripts/behave_shell/smoke.sh — bash entry that replays all five 2026-05-02 calibration shards (HUMAN / YOU-sim / LW-sim / CLAUDE-FF / CLAUDE-CL). Auto-activates .311 venv, forces DECNET_DB_TYPE=sqlite, prints per-class summary. Suitable for CI. * scripts/behave_shell/README.md — runbook covering both halves. Pins the manual live-decky procedure (one SSH session per class against a deployed smoke-decky, expected dominant primitives table, SQL verification query, AttackerDetail panel check, pass criteria). * BEHAVE-INTEGRATION.md — Phase 6 completion log appended with current corpus results table (15 sessions, 424 observations across the five classes) and a note that the v0 tag (drop -pre) is gated on the manual live-decky round-trip and lands as a separate commit. Live-decky run is intentionally NOT scripted — the integration doc calls for manual SSH sessions per class so an operator confirms the bus / collector / disk-reach plumbing under real PTY conditions.
111 lines
4.9 KiB
Markdown
111 lines
4.9 KiB
Markdown
# BEHAVE-SHELL — Phase 6 smoke
|
|
|
|
Two halves:
|
|
|
|
1. **Offline replay** — `smoke.sh` replays the five 2026-05-02
|
|
calibration shards through the production handler. Exercises the
|
|
engine + storage layer end-to-end without a live PTY. Suitable for
|
|
CI.
|
|
2. **Live decky round-trip** — manual procedure below. Confirms the
|
|
bus / collector / disk-reach plumbing on a real session.
|
|
|
|
## 1. Offline replay
|
|
|
|
```sh
|
|
$ scripts/behave_shell/smoke.sh # auto-discovers ../BEHAVE/prototype_extractors/shell
|
|
$ scripts/behave_shell/smoke.sh /path/to/calibration/dir # explicit dir
|
|
```
|
|
|
|
Expected output (15 sessions across 5 classes, 424 total observations
|
|
on the current corpus):
|
|
|
|
```
|
|
[HUMAN] sessions=1 observations=34 distinct_primitives=34
|
|
[YOU-sim] sessions=2 observations=59 distinct_primitives=34
|
|
[LW-sim] sessions=5 observations=136 distinct_primitives=34
|
|
[CLAUDE-FF] sessions=3 observations=84 distinct_primitives=34
|
|
[CLAUDE-CL] sessions=4 observations=111 distinct_primitives=34
|
|
smoke: OK — all classes emit observations end-to-end
|
|
```
|
|
|
|
Exit codes: `0` full pass, `1` any class regressed, `2` argument /
|
|
IO error.
|
|
|
|
The replay drives `decnet.profiler.behave_shell._handler.handle_session_ended`
|
|
directly against a temp SQLite DB seeded with one Attacker per
|
|
session. Bus emission is captured by an in-process publisher; no
|
|
real bus is required.
|
|
|
|
## 2. Live decky round-trip (manual)
|
|
|
|
End-to-end confirmation. Run **once** before tagging v0 and **after**
|
|
any change to the bus / collector / disk-reach layer.
|
|
|
|
### Setup
|
|
|
|
1. Init a fresh DECNET host (see `decnet init`).
|
|
2. `decnet bus` worker is up (systemd unit
|
|
`decnet-bus.service` or `scripts/bus/smoke.sh`).
|
|
3. `decnet-profiler.service` is up — it owns the
|
|
`attacker.session.ended` subscription and the BEHAVE-SHELL handler.
|
|
4. `decnet-collector.service` is up — it publishes
|
|
`attacker.session.ended` from `session_recorded` log events.
|
|
5. Web API is up; you have a viewer JWT in your browser localStorage.
|
|
6. Deploy a single `ssh` decky:
|
|
```sh
|
|
$ decnet decky deploy --service ssh --decky smoke-decky
|
|
```
|
|
The decky's sessrec wrapper appends to
|
|
`/var/lib/decnet/artifacts/smoke-decky/ssh/transcripts/sessions-<UTC-DAY>.jsonl`.
|
|
|
|
### Run one session per calibration class
|
|
|
|
For each class, SSH into the decky and reproduce the canonical
|
|
workload. Log out via the documented exit path so the
|
|
`session_recorded` event fires. The collector aggregates the session
|
|
and publishes `attacker.session.ended`; the profiler worker
|
|
disk-reaches the shard, runs `extract_session()`, persists rows,
|
|
publishes one `attacker.observation.<primitive>` per emission.
|
|
|
|
| Class | Workload sketch | Expected dominant primitives |
|
|
|---|---|---|
|
|
| HUMAN | Type each command live; correct typos; pause to read output. | `motor.input_modality=typed`, `cognitive.feedback_loop_engagement=closed_loop` |
|
|
| YOU-sim | Paste short pre-canned commands at typing speed; minimal repeats. | `motor.input_modality=pasted`, `motor.paste_burst_rate=occasional`, `cognitive.command_branch_diversity=linear_playbook` |
|
|
| LW-sim | Paste a recon sweep generated by a small LLM; ~2-8s between pastes. | `cognitive.inter_command_latency_class=llm_lightweight` |
|
|
| CLAUDE-FF | Paste outputs from a fire-and-forget reasoning agent; ~8-30s gaps. | `cognitive.inter_command_latency_class=llm_heavyweight`, `cognitive.feedback_loop_engagement=fire_and_forget` |
|
|
| CLAUDE-CL | Drive a closed-loop plan-execute-observe agent; >30s pauses on long output. | `cognitive.inter_command_latency_class=long`, `cognitive.feedback_loop_engagement=closed_loop` |
|
|
|
|
### Verify
|
|
|
|
For each class, after disconnecting:
|
|
|
|
1. **DB row landing** — within ~30s
|
|
(the profiler tick interval), `observations` carries one row per
|
|
primitive for the new attacker:
|
|
```sh
|
|
$ sqlite3 /var/lib/decnet/decnet.db \
|
|
"SELECT primitive, value, confidence FROM observations \
|
|
WHERE evidence_ref LIKE 'shard:smoke-decky/%' ORDER BY ts DESC LIMIT 40;"
|
|
```
|
|
2. **Bus events** — tail the bus worker log; you should see one
|
|
`attacker.observation.<primitive>` per emitted row, plus the
|
|
originating `attacker.session.ended`.
|
|
3. **AttackerDetail panel** — open
|
|
`/attackers/<uuid>` in the browser. The Behavioural primitives
|
|
section should hydrate from the REST snapshot and live-update
|
|
each time you replay the session
|
|
(the SSE route forwards the new emissions in real time).
|
|
|
|
### Pass criteria
|
|
|
|
* All 5 classes produce ≥ 27 distinct primitives in
|
|
`observations` (the per-shard hard gate from
|
|
`tests/profiler/behave_shell/test_calibration_grid.py`).
|
|
* The four day-one priority primitives appear in the panel and carry
|
|
the expected values per class (table above).
|
|
* No collector / profiler / web errors in the journal during the
|
|
round-trip.
|
|
|
|
If any class regresses: rollback the last commit and run the offline
|
|
replay (`smoke.sh`) to localise — same handler, no transport noise.
|