test(profiler/behave_shell): Phase 6 smoke harness + live-decky runbook

Two-half deliverable per BEHAVE-INTEGRATION.md §587-594:

* scripts/behave_shell/replay_calibration.py — Python helper that
  drives the production handler against one asciinema shard, mints
  a temp SQLite repo + an Attacker per session, captures bus
  emissions in-process. Exits non-zero on zero-observation sessions.

* scripts/behave_shell/smoke.sh — bash entry that replays all five
  2026-05-02 calibration shards (HUMAN / YOU-sim / LW-sim /
  CLAUDE-FF / CLAUDE-CL). Auto-activates .311 venv, forces
  DECNET_DB_TYPE=sqlite, prints per-class summary. Suitable for CI.

* scripts/behave_shell/README.md — runbook covering both halves.
  Pins the manual live-decky procedure (one SSH session per class
  against a deployed smoke-decky, expected dominant primitives table,
  SQL verification query, AttackerDetail panel check, pass criteria).

* BEHAVE-INTEGRATION.md — Phase 6 completion log appended with
  current corpus results table (15 sessions, 424 observations across
  the five classes) and a note that the v0 tag (drop -pre) is gated
  on the manual live-decky round-trip and lands as a separate
  commit.

Live-decky run is intentionally NOT scripted — the integration doc
calls for manual SSH sessions per class so an operator confirms the
bus / collector / disk-reach plumbing under real PTY conditions.
This commit is contained in:
2026-05-08 21:42:11 -04:00
parent b3ff80d74e
commit 69c8cfd2b9
4 changed files with 427 additions and 0 deletions

View File

@@ -676,5 +676,57 @@ this integration; explicitly listed under "Out of scope" above.
---
## Phase 6 completion log
Phase 6 ships the smoke harness in `scripts/behave_shell/`.
* **`replay_calibration.py`** — Python tool that drives the
production handler
(`decnet.profiler.behave_shell._handler.handle_session_ended`)
against one calibration shard. Mints a temp SQLite repo + an
Attacker row per session, captures bus emissions in-process, and
asserts every session in the shard produces ≥ 1 observation. Exits
non-zero on any session that emits zero observations.
* **`smoke.sh`** — bash entry point. Auto-discovers the calibration
dir at `../BEHAVE/prototype_extractors/shell`, replays all five
classes, and prints a per-class summary. Auto-activates the `.311`
venv if present; forces `DECNET_DB_TYPE=sqlite` so the smoke
doesn't depend on a running MySQL.
* **`README.md`** — runbook covering both halves of Phase 6:
the offline replay (CI-suitable) and the **manual live decky
round-trip** (one SSH session per calibration class against a
deployed `smoke-decky`, with expected dominant primitives, the SQL
verification query, and the AttackerDetail panel check).
**Offline replay results on the 2026-05-02 corpus:**
| Class | Sessions | Observations | Distinct primitives |
|------------|---------:|-------------:|--------------------:|
| HUMAN | 1 | 34 | 34 |
| YOU-sim | 2 | 59 | 34 |
| LW-sim | 5 | 136 | 34 |
| CLAUDE-FF | 3 | 84 | 34 |
| CLAUDE-CL | 4 | 111 | 34 |
| **Total** | **15** | **424** | — |
`smoke: OK — all classes emit observations end-to-end`. Every session
in every class persists 27+ Tier-A primitives (the per-shard hard
gate) plus the conditional sets where present.
**The live-decky run is intentionally NOT scripted** — Phase 6 of
the integration doc explicitly calls for a manual SSH session per
class so the operator confirms the bus / collector / disk-reach
plumbing under real PTY conditions. The README pins the procedure +
pass criteria + per-class expected primitives.
**v0 tag gating.** The proper `0.1.0` tag (drop `-pre` from
`decnet/profiler/behave_shell/__init__.py`) lands once the operator
has run the manual round-trip across all five classes and confirmed
the AttackerDetail panel hydrates + live-updates as expected. That
commit is intentionally separate from this Phase 6 commit so the
v0-tag commit message can cite the live-smoke walkthrough.
---
**Owner:** ANTI.
**Implementation gate:** this doc reviewed → Phase 1 starts.

View File

@@ -0,0 +1,110 @@
# BEHAVE-SHELL — Phase 6 smoke
Two halves:
1. **Offline replay**`smoke.sh` replays the five 2026-05-02
calibration shards through the production handler. Exercises the
engine + storage layer end-to-end without a live PTY. Suitable for
CI.
2. **Live decky round-trip** — manual procedure below. Confirms the
bus / collector / disk-reach plumbing on a real session.
## 1. Offline replay
```sh
$ scripts/behave_shell/smoke.sh # auto-discovers ../BEHAVE/prototype_extractors/shell
$ scripts/behave_shell/smoke.sh /path/to/calibration/dir # explicit dir
```
Expected output (15 sessions across 5 classes, 424 total observations
on the current corpus):
```
[HUMAN] sessions=1 observations=34 distinct_primitives=34
[YOU-sim] sessions=2 observations=59 distinct_primitives=34
[LW-sim] sessions=5 observations=136 distinct_primitives=34
[CLAUDE-FF] sessions=3 observations=84 distinct_primitives=34
[CLAUDE-CL] sessions=4 observations=111 distinct_primitives=34
smoke: OK — all classes emit observations end-to-end
```
Exit codes: `0` full pass, `1` any class regressed, `2` argument /
IO error.
The replay drives `decnet.profiler.behave_shell._handler.handle_session_ended`
directly against a temp SQLite DB seeded with one Attacker per
session. Bus emission is captured by an in-process publisher; no
real bus is required.
## 2. Live decky round-trip (manual)
End-to-end confirmation. Run **once** before tagging v0 and **after**
any change to the bus / collector / disk-reach layer.
### Setup
1. Init a fresh DECNET host (see `decnet init`).
2. `decnet bus` worker is up (systemd unit
`decnet-bus.service` or `scripts/bus/smoke.sh`).
3. `decnet-profiler.service` is up — it owns the
`attacker.session.ended` subscription and the BEHAVE-SHELL handler.
4. `decnet-collector.service` is up — it publishes
`attacker.session.ended` from `session_recorded` log events.
5. Web API is up; you have a viewer JWT in your browser localStorage.
6. Deploy a single `ssh` decky:
```sh
$ decnet decky deploy --service ssh --decky smoke-decky
```
The decky's sessrec wrapper appends to
`/var/lib/decnet/artifacts/smoke-decky/ssh/transcripts/sessions-<UTC-DAY>.jsonl`.
### Run one session per calibration class
For each class, SSH into the decky and reproduce the canonical
workload. Log out via the documented exit path so the
`session_recorded` event fires. The collector aggregates the session
and publishes `attacker.session.ended`; the profiler worker
disk-reaches the shard, runs `extract_session()`, persists rows,
publishes one `attacker.observation.<primitive>` per emission.
| Class | Workload sketch | Expected dominant primitives |
|---|---|---|
| HUMAN | Type each command live; correct typos; pause to read output. | `motor.input_modality=typed`, `cognitive.feedback_loop_engagement=closed_loop` |
| YOU-sim | Paste short pre-canned commands at typing speed; minimal repeats. | `motor.input_modality=pasted`, `motor.paste_burst_rate=occasional`, `cognitive.command_branch_diversity=linear_playbook` |
| LW-sim | Paste a recon sweep generated by a small LLM; ~2-8s between pastes. | `cognitive.inter_command_latency_class=llm_lightweight` |
| CLAUDE-FF | Paste outputs from a fire-and-forget reasoning agent; ~8-30s gaps. | `cognitive.inter_command_latency_class=llm_heavyweight`, `cognitive.feedback_loop_engagement=fire_and_forget` |
| CLAUDE-CL | Drive a closed-loop plan-execute-observe agent; >30s pauses on long output. | `cognitive.inter_command_latency_class=long`, `cognitive.feedback_loop_engagement=closed_loop` |
### Verify
For each class, after disconnecting:
1. **DB row landing** — within ~30s
(the profiler tick interval), `observations` carries one row per
primitive for the new attacker:
```sh
$ sqlite3 /var/lib/decnet/decnet.db \
"SELECT primitive, value, confidence FROM observations \
WHERE evidence_ref LIKE 'shard:smoke-decky/%' ORDER BY ts DESC LIMIT 40;"
```
2. **Bus events** — tail the bus worker log; you should see one
`attacker.observation.<primitive>` per emitted row, plus the
originating `attacker.session.ended`.
3. **AttackerDetail panel** — open
`/attackers/<uuid>` in the browser. The Behavioural primitives
section should hydrate from the REST snapshot and live-update
each time you replay the session
(the SSE route forwards the new emissions in real time).
### Pass criteria
* All 5 classes produce ≥ 27 distinct primitives in
`observations` (the per-shard hard gate from
`tests/profiler/behave_shell/test_calibration_grid.py`).
* The four day-one priority primitives appear in the panel and carry
the expected values per class (table above).
* No collector / profiler / web errors in the journal during the
round-trip.
If any class regresses: rollback the last commit and run the offline
replay (`smoke.sh`) to localise — same handler, no transport noise.

View File

@@ -0,0 +1,169 @@
"""Replay one calibration-corpus shard through the BEHAVE-SHELL handler.
Phase 6 smoke helper. Drives the production handler
(``decnet.profiler.behave_shell._handler.handle_session_ended``)
against an asciinema shard from
``BEHAVE/prototype_extractors/shell/`` *without* a live decky.
Mints a temp SQLite repo, an Attacker row, and an
``attacker.session.ended``-shape payload, then calls the handler
exactly the way the worker does.
This is **not** a substitute for the manual decky run described in
``scripts/behave_shell/README.md`` — the integration doc's Phase 6
calls for a real PTY round-trip. This helper exercises the handler +
storage layer end-to-end without the worker loop, so a failure here
points at the engine and not at the bus / collector / disk-reach
plumbing.
Usage::
python scripts/behave_shell/replay_calibration.py \\
--shard /path/to/sessions-2026-05-02.jsonl \\
--label HUMAN
Exit codes:
0 every session in the shard produced ≥ 1 observation
1 zero observations produced for at least one session
2 argument / IO error
"""
from __future__ import annotations
import argparse
import asyncio
import collections
import json
import sys
import tempfile
import uuid
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
from decnet.profiler.behave_shell._handler import handle_session_ended
from decnet.web.db.factory import get_repository
def _sids_in_shard(shard: Path) -> list[str]:
sids: list[str] = []
seen: set[str] = set()
with shard.open() as f:
for line in f:
try:
rec = json.loads(line)
except (ValueError, json.JSONDecodeError):
continue
if not isinstance(rec, dict):
continue
sid = rec.get("sid")
if not isinstance(sid, str) or sid in seen:
continue
seen.add(sid)
sids.append(sid)
return sids
async def _seed_attacker(repo: Any, ip: str) -> str:
return await repo.upsert_attacker({
"ip": ip,
"first_seen": datetime.now(timezone.utc),
"last_seen": datetime.now(timezone.utc),
"event_count": 1,
"service_count": 1,
"decky_count": 1,
"services": "[\"ssh\"]",
"deckies": "[\"smoke-decky\"]",
"traversal_path": None,
"is_traversal": False,
"bounty_count": 0,
"credential_count": 0,
"fingerprints": "[]",
"commands": "[]",
"country_code": None,
"country_source": None,
"asn": None,
"as_name": None,
"asn_source": None,
"updated_at": datetime.now(timezone.utc),
})
def _payload_for(shard: Path, sid: str, ip: str) -> dict[str, Any]:
return {
"session_id": sid,
"attacker_uuid": None,
"attacker_ip": ip,
"decky_id": "smoke-decky",
"service": "ssh",
"ended_at": datetime.now(timezone.utc).isoformat(),
"duration_s": 0.0,
"commands": [],
"shard_path": str(shard),
}
async def _replay(shard: Path, label: str) -> int:
sids = _sids_in_shard(shard)
if not sids:
print(f"[{label}] FAIL — no sids found in shard", file=sys.stderr)
return 1
with tempfile.TemporaryDirectory(prefix="behave-smoke.") as tmp:
db_path = Path(tmp) / "smoke.db"
repo = get_repository(db_path=str(db_path))
await repo.initialize()
bus_events: list[tuple[str, dict[str, Any], str]] = []
def _publish(topic: str, payload: dict[str, Any], event_type: str) -> None:
bus_events.append((topic, payload, event_type))
per_sid_counts: dict[str, int] = {}
per_sid_primitives: dict[str, collections.Counter] = {}
for sid in sids:
ip = f"10.{abs(hash(sid)) % 256}.{abs(hash(sid + label)) % 256}.5"
await _seed_attacker(repo, ip)
n = await handle_session_ended(repo, _payload_for(shard, sid, ip), _publish)
per_sid_counts[sid] = n
per_sid_primitives[sid] = collections.Counter()
# Snapshot the observations table for each sid via evidence_ref.
all_primitives: collections.Counter[str] = collections.Counter()
for topic, payload, _etype in bus_events:
primitive = payload.get("primitive") or topic.split(".", 2)[2]
all_primitives[primitive] += 1
total_obs = sum(per_sid_counts.values())
empty_sids = [sid for sid, n in per_sid_counts.items() if n == 0]
print(f"[{label}] sessions={len(sids)} observations={total_obs} "
f"distinct_primitives={len(all_primitives)} bus_events={len(bus_events)}")
if empty_sids:
print(f"[{label}] FAIL — {len(empty_sids)}/{len(sids)} sessions emitted "
f"zero observations", file=sys.stderr)
for sid in empty_sids[:3]:
print(f"[{label}] empty sid={sid}", file=sys.stderr)
return 1
# One-line top-5 primitive sample for visual sanity.
top = ", ".join(
f"{p}={c}" for p, c in all_primitives.most_common(5)
)
print(f"[{label}] top: {top}")
return 0
async def _main() -> int:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("--shard", required=True, type=Path,
help="Path to a sessions-YYYY-MM-DD.jsonl shard")
parser.add_argument("--label", required=True,
help="Calibration class label (HUMAN / YOU-sim / "
"LW-sim / CLAUDE-FF / CLAUDE-CL)")
args = parser.parse_args()
if not args.shard.is_file():
print(f"shard not a file: {args.shard}", file=sys.stderr)
return 2
return await _replay(args.shard, args.label)
if __name__ == "__main__":
sys.exit(asyncio.run(_main()))

96
scripts/behave_shell/smoke.sh Executable file
View File

@@ -0,0 +1,96 @@
#!/usr/bin/env bash
### Usage: scripts/behave_shell/smoke.sh [BEHAVE_CALIBRATION_DIR]
#
# BEHAVE-INTEGRATION Phase 6 — offline replay smoke test.
#
# Runs the production handler
# (`decnet.profiler.behave_shell._handler.handle_session_ended`) against
# each of the five 2026-05-02 calibration shards, asserts every session
# in every shard produces ≥ 1 observation, and prints a per-class
# summary.
#
# This is the **offline** half of Phase 6. The **live-decky** half is
# documented in `scripts/behave_shell/README.md` — that one needs a
# real PTY round-trip and stays manual.
#
# Argument:
# $1 Optional path to the directory holding
# sessions-2026-05-02-*.jsonl. Defaults to
# ../BEHAVE/prototype_extractors/shell relative to this repo.
#
# Exits 0 on full pass, 1 on any class regression, 2 on bad input.
set -euo pipefail
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${HERE}/../.." && pwd)"
DEFAULT_DIR="${REPO_ROOT}/../BEHAVE/prototype_extractors/shell"
CALIB_DIR="${1:-${DEFAULT_DIR}}"
if [[ ! -d "${CALIB_DIR}" ]]; then
echo "smoke: FAIL — calibration dir not found: ${CALIB_DIR}" >&2
echo "smoke: pass it as \$1 or symlink it next to DECNET/" >&2
exit 2
fi
# Auto-activate the project venv so the script works whether or not
# the caller already sourced it (mirrors the .311 convention from the
# pre-commit hook).
if [[ -d "${REPO_ROOT}/.311" ]]; then
# shellcheck disable=SC1091
source "${REPO_ROOT}/.311/bin/activate"
fi
# Force sqlite so the smoke doesn't depend on a running mysql.
export DECNET_DB_TYPE="sqlite"
# Suppress the verbose decnet logger so the per-class summary lines
# stay readable. ANTI's developer log has DEBUG enabled via env; mute
# at the smoke entrypoint.
export DECNET_LOG_LEVEL="${DECNET_LOG_LEVEL:-WARNING}"
unset DECNET_DEVELOPER_MODE 2>/dev/null || true
declare -a SHARDS=(
"sessions-2026-05-02.jsonl|HUMAN"
"sessions-2026-05-02-with-llm.jsonl|YOU-sim"
"sessions-2026-05-02-new.jsonl|LW-sim"
"sessions-2026-05-02-with-claude.jsonl|CLAUDE-FF"
"sessions-2026-05-02-closed-loop.jsonl|CLAUDE-CL"
)
LOGDIR="$(mktemp -d -t behave-smoke.XXXXXX)"
trap 'rm -rf "${LOGDIR}"' EXIT
echo "smoke: replaying ${#SHARDS[@]} calibration classes from ${CALIB_DIR}"
echo "smoke: per-class logs in ${LOGDIR}"
echo
failed=0
for entry in "${SHARDS[@]}"; do
fn="${entry%%|*}"
label="${entry##*|}"
shard="${CALIB_DIR}/${fn}"
if [[ ! -f "${shard}" ]]; then
echo "[${label}] SKIP — shard not present: ${shard}" >&2
continue
fi
log="${LOGDIR}/${label}.log"
set +e
python "${HERE}/replay_calibration.py" \
--shard "${shard}" --label "${label}" >"${log}" 2>&1
rc=$?
set -e
# Surface the summary lines (everything starting with '['). They go
# to stdout in the python tool; stderr noise stays in the log file.
grep -E '^\[' "${log}" || true
if [[ "${rc}" -ne 0 ]]; then
failed=$((failed + 1))
echo "[${label}] (full log: ${log})" >&2
fi
done
echo
if [[ "${failed}" -gt 0 ]]; then
echo "smoke: FAIL — ${failed} class(es) regressed" >&2
exit 1
fi
echo "smoke: OK — all classes emit observations end-to-end"