Files
anti 69c8cfd2b9 test(profiler/behave_shell): Phase 6 smoke harness + live-decky runbook
Two-half deliverable per BEHAVE-INTEGRATION.md §587-594:

* scripts/behave_shell/replay_calibration.py — Python helper that
  drives the production handler against one asciinema shard, mints
  a temp SQLite repo + an Attacker per session, captures bus
  emissions in-process. Exits non-zero on zero-observation sessions.

* scripts/behave_shell/smoke.sh — bash entry that replays all five
  2026-05-02 calibration shards (HUMAN / YOU-sim / LW-sim /
  CLAUDE-FF / CLAUDE-CL). Auto-activates .311 venv, forces
  DECNET_DB_TYPE=sqlite, prints per-class summary. Suitable for CI.

* scripts/behave_shell/README.md — runbook covering both halves.
  Pins the manual live-decky procedure (one SSH session per class
  against a deployed smoke-decky, expected dominant primitives table,
  SQL verification query, AttackerDetail panel check, pass criteria).

* BEHAVE-INTEGRATION.md — Phase 6 completion log appended with
  current corpus results table (15 sessions, 424 observations across
  the five classes) and a note that the v0 tag (drop -pre) is gated
  on the manual live-decky round-trip and lands as a separate
  commit.

Live-decky run is intentionally NOT scripted — the integration doc
calls for manual SSH sessions per class so an operator confirms the
bus / collector / disk-reach plumbing under real PTY conditions.
2026-05-08 21:42:11 -04:00

4.9 KiB

BEHAVE-SHELL — Phase 6 smoke

Two halves:

  1. Offline replaysmoke.sh replays the five 2026-05-02 calibration shards through the production handler. Exercises the engine + storage layer end-to-end without a live PTY. Suitable for CI.
  2. Live decky round-trip — manual procedure below. Confirms the bus / collector / disk-reach plumbing on a real session.

1. Offline replay

$ scripts/behave_shell/smoke.sh                             # auto-discovers ../BEHAVE/prototype_extractors/shell
$ scripts/behave_shell/smoke.sh /path/to/calibration/dir    # explicit dir

Expected output (15 sessions across 5 classes, 424 total observations on the current corpus):

[HUMAN]      sessions=1 observations=34 distinct_primitives=34
[YOU-sim]    sessions=2 observations=59 distinct_primitives=34
[LW-sim]     sessions=5 observations=136 distinct_primitives=34
[CLAUDE-FF]  sessions=3 observations=84 distinct_primitives=34
[CLAUDE-CL]  sessions=4 observations=111 distinct_primitives=34
smoke: OK — all classes emit observations end-to-end

Exit codes: 0 full pass, 1 any class regressed, 2 argument / IO error.

The replay drives decnet.profiler.behave_shell._handler.handle_session_ended directly against a temp SQLite DB seeded with one Attacker per session. Bus emission is captured by an in-process publisher; no real bus is required.

2. Live decky round-trip (manual)

End-to-end confirmation. Run once before tagging v0 and after any change to the bus / collector / disk-reach layer.

Setup

  1. Init a fresh DECNET host (see decnet init).
  2. decnet bus worker is up (systemd unit decnet-bus.service or scripts/bus/smoke.sh).
  3. decnet-profiler.service is up — it owns the attacker.session.ended subscription and the BEHAVE-SHELL handler.
  4. decnet-collector.service is up — it publishes attacker.session.ended from session_recorded log events.
  5. Web API is up; you have a viewer JWT in your browser localStorage.
  6. Deploy a single ssh decky:
    $ decnet decky deploy --service ssh --decky smoke-decky
    
    The decky's sessrec wrapper appends to /var/lib/decnet/artifacts/smoke-decky/ssh/transcripts/sessions-<UTC-DAY>.jsonl.

Run one session per calibration class

For each class, SSH into the decky and reproduce the canonical workload. Log out via the documented exit path so the session_recorded event fires. The collector aggregates the session and publishes attacker.session.ended; the profiler worker disk-reaches the shard, runs extract_session(), persists rows, publishes one attacker.observation.<primitive> per emission.

Class Workload sketch Expected dominant primitives
HUMAN Type each command live; correct typos; pause to read output. motor.input_modality=typed, cognitive.feedback_loop_engagement=closed_loop
YOU-sim Paste short pre-canned commands at typing speed; minimal repeats. motor.input_modality=pasted, motor.paste_burst_rate=occasional, cognitive.command_branch_diversity=linear_playbook
LW-sim Paste a recon sweep generated by a small LLM; ~2-8s between pastes. cognitive.inter_command_latency_class=llm_lightweight
CLAUDE-FF Paste outputs from a fire-and-forget reasoning agent; ~8-30s gaps. cognitive.inter_command_latency_class=llm_heavyweight, cognitive.feedback_loop_engagement=fire_and_forget
CLAUDE-CL Drive a closed-loop plan-execute-observe agent; >30s pauses on long output. cognitive.inter_command_latency_class=long, cognitive.feedback_loop_engagement=closed_loop

Verify

For each class, after disconnecting:

  1. DB row landing — within ~30s (the profiler tick interval), observations carries one row per primitive for the new attacker:
    $ sqlite3 /var/lib/decnet/decnet.db \
        "SELECT primitive, value, confidence FROM observations \
         WHERE evidence_ref LIKE 'shard:smoke-decky/%' ORDER BY ts DESC LIMIT 40;"
    
  2. Bus events — tail the bus worker log; you should see one attacker.observation.<primitive> per emitted row, plus the originating attacker.session.ended.
  3. AttackerDetail panel — open /attackers/<uuid> in the browser. The Behavioural primitives section should hydrate from the REST snapshot and live-update each time you replay the session (the SSE route forwards the new emissions in real time).

Pass criteria

  • All 5 classes produce ≥ 27 distinct primitives in observations (the per-shard hard gate from tests/profiler/behave_shell/test_calibration_grid.py).
  • The four day-one priority primitives appear in the panel and carry the expected values per class (table above).
  • No collector / profiler / web errors in the journal during the round-trip.

If any class regresses: rollback the last commit and run the offline replay (smoke.sh) to localise — same handler, no transport noise.