feat(realism): synthetic_files table + planner wiring + scheduler swap

Stage 3 of the realism migration. Replaces orchestrator/scheduler.py's
hardcoded _FILE_TEMPLATES/_USERS (3 templates emitting epoch-suffixed
filenames like notes-1777315854.txt with identical bodies per
template) with a persona-driven realism engine.

New surface:

- SyntheticFile SQLModel (synthetic_files table, UNIQUE on
  decky_uuid+path) — per-(decky, path) state for the future
  edit-in-place flow. Pre-v1, no _migrate_* helper.
- BaseRepository methods: record_synthetic_file,
  update_synthetic_file, list_synthetic_files,
  pick_random_synthetic_file_for_edit (used by stage 3b).
- realism/naming.py: per-content-class filename templates,
  persona-conditioned. /var/log/cron.log + logrotate skeleton for
  system-class; /home/<persona>/TODO.md, scratch.md, etc. for
  user-class. Anti-regression test pins "no 8+ digit decimals in
  basenames" (the realism failure today).
- realism/bodies.py: deterministic body templates per content_class.
  TODO body uses checkbox markdown, script body has a shebang, cron
  body matches syslog cron shape ("CRON[PID]: (user) CMD (...)").
- realism/planner.py: pick(deckies, now, rng) returns a Plan.
  Diurnal-gated, weighted user/system content split (70/30 user
  bias). Create-only in stage 3; edit branch lands in stage 3b.

Scheduler split:

- scheduler.pick is now traffic-only (sync).
- scheduler.pick_file is async, takes a repo, resolves personas
  (Topology.email_personas for topology-source deckies; global
  realism.personas_pool otherwise), and maps Plan -> FileAction.
- FileAction gains persona/content_class/mtime fields.

Worker:

- _one_tick rolls 50/50 between traffic and file each tick. After a
  successful FileAction plant, _record_synthetic_file persists or
  patches the synthetic_files row (catching the unique-constraint
  collision on re-plant of the same path).
- SSHDriver._run_file passes action.mtime through to plant_file so
  files don't all stamp at wall-clock-now.
This commit is contained in:
2026-04-27 16:22:07 -04:00
parent 636c057cc5
commit cb1872c52f
15 changed files with 1541 additions and 105 deletions

233
decnet/realism/bodies.py Normal file
View File

@@ -0,0 +1,233 @@
"""Per-content-class body generators (deterministic templates).
Stage 3 of the realism migration ships deterministic per-class
templates — varied enough that two notes on the same decky aren't
identical, formulaic enough that system-class files (cron logs,
journal entries) look like cron actually wrote them.
Stage 6 wires LLM enrichment for user-classes; the templates here
remain the fallback path so the orchestrator tick never blocks on
Ollama.
Determinism: every namer/body takes a :class:`SystemRandom` (from
:mod:`secrets`). Tests pin the RNG seed for reproducibility; the
orchestrator passes a fresh RNG per tick so production picks are
unpredictable.
The factory mirrors :mod:`decnet.realism.naming`: caller passes a
:class:`~decnet.realism.taxonomy.ContentClass`; we return the body
generator registered for it. Email + canary classes raise —
those bodies come from the email driver and canary cultivator
respectively, not from realism.bodies.
"""
from __future__ import annotations
import secrets
from datetime import datetime, timezone
from typing import Callable, Optional
from decnet.realism.taxonomy import ContentClass
# ── User-class body generators ─────────────────────────────────────────────
_NOTE_TEMPLATES: tuple[str, ...] = (
"follow up with the team on this",
"remember to ping the on-call",
"ask about the staging migration timeline",
"double-check the runbook before next shift",
"todo: rotate keys; check on backup task",
"meeting notes from yesterday — copy onto wiki when free",
"this is broken in prod; talk to ops monday",
"draft response to the auditor — keep it short",
)
def _body_note(persona: str, rng: secrets.SystemRandom) -> str:
n = rng.randint(2, 5)
lines = rng.sample(_NOTE_TEMPLATES, k=min(n, len(_NOTE_TEMPLATES)))
return "\n".join(lines) + "\n"
_TODO_VERBS: tuple[str, ...] = (
"rotate keys", "review pr",
"clean up logs", "update docs",
"follow up on ticket",
"test backup restore",
"deploy to staging",
"ack auditor email",
"patch CVE backlog",
)
def _body_todo(persona: str, rng: secrets.SystemRandom) -> str:
n = rng.randint(3, 7)
items = rng.sample(_TODO_VERBS, k=min(n, len(_TODO_VERBS)))
# Roughly a third pre-checked — looks like a list that's been
# touched at least once.
out = []
for item in items:
marker = "[x]" if rng.random() < 0.33 else "[ ]"
out.append(f"- {marker} {item}")
return "\n".join(out) + "\n"
_DRAFT_PARAGRAPHS: tuple[str, ...] = (
"Hi team,\n\nQuick update on the project. We're tracking ahead of schedule "
"on the migration but the staging soak revealed a regression in the "
"auth path. I'll have a fix in by end of week.\n\nThanks,\n",
"Hi,\n\nFollowing up on yesterday's meeting. Action items below:\n\n"
"- Engineering owns the deployment plan\n"
"- Ops will draft the runbook update\n"
"- We sync again Friday\n\n",
"All,\n\nProposal attached. Key points:\n\n"
"1. We are not changing the data model in this release\n"
"2. The new endpoint is opt-in via feature flag\n"
"3. Rollback path is one config flip\n\n"
"Feedback by EOD?\n\n",
)
def _body_draft(persona: str, rng: secrets.SystemRandom) -> str:
return rng.choice(_DRAFT_PARAGRAPHS)
_SCRIPT_TEMPLATES: tuple[str, ...] = (
"#!/usr/bin/env bash\nset -euo pipefail\n\n"
"BACKUP_DIR=/var/backups\n"
"STAMP=$(date +%Y%m%d-%H%M)\n"
"echo \"backup start $STAMP\"\n"
"tar czf \"$BACKUP_DIR/db-$STAMP.tar.gz\" /var/lib/mysql\n"
"echo \"backup done\"\n",
"#!/usr/bin/env bash\nset -e\n\n"
"# clean up old logs\n"
"find /var/log -name '*.log.*.gz' -mtime +30 -delete\n",
"#!/usr/bin/env python3\n\"\"\"Quick fix for the reporting job.\"\"\"\n"
"import sys\n\n"
"def main():\n print('todo: real fix here')\n\n"
"if __name__ == '__main__':\n sys.exit(main())\n",
)
def _body_script(persona: str, rng: secrets.SystemRandom) -> str:
return rng.choice(_SCRIPT_TEMPLATES)
# ── System-class body generators ───────────────────────────────────────────
_CRON_COMMANDS: tuple[str, ...] = (
"(root) CMD (run-parts /etc/cron.daily)",
"(root) CMD (run-parts /etc/cron.hourly)",
"(www-data) CMD (cd /var/www && /usr/bin/php artisan schedule:run)",
"(backup) CMD (/usr/local/bin/backup.sh)",
"(root) CMD (test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ))",
)
def _body_log_cron(persona: str, rng: secrets.SystemRandom) -> str:
n = rng.randint(8, 24)
base = datetime.now(timezone.utc)
lines = []
for i in range(n):
hour = (base.hour - i) % 24
minute = rng.randint(0, 59)
pid = rng.randint(1000, 99999)
cmd = rng.choice(_CRON_COMMANDS)
# ISO-ish "Apr 27 09:13:44 host CRON[1234]: ..." cron syslog shape.
date_s = base.strftime("%b %d")
lines.append(
f"{date_s} {hour:02d}:{minute:02d}:{rng.randint(0,59):02d} "
f"hostname CRON[{pid}]: {cmd}"
)
return "\n".join(lines) + "\n"
_DAEMON_LINES: tuple[str, ...] = (
"systemd[1]: Started Daily apt download activities.",
"systemd[1]: apt-daily.service: Succeeded.",
"systemd[1]: Reached target Multi-User System.",
"kernel: [UFW BLOCK] IN=eth0 OUT= MAC=…",
"sshd[2103]: pam_unix(sshd:session): session opened for user admin by (uid=0)",
"sshd[2103]: Received disconnect from 10.0.0.4 port 47282:11: disconnected by user",
"CRON[1894]: pam_unix(cron:session): session closed for user root",
)
def _body_log_daemon(persona: str, rng: secrets.SystemRandom) -> str:
n = rng.randint(10, 30)
lines = []
base = datetime.now(timezone.utc)
for _ in range(n):
lines.append(
f"{base.strftime('%b %d %H:%M:%S')} hostname "
f"{rng.choice(_DAEMON_LINES)}"
)
return "\n".join(lines) + "\n"
def _body_cache_tmp(persona: str, rng: secrets.SystemRandom) -> str:
# ~64-256 bytes of opaque session-ish payload — most /tmp/.cache-*
# files in the wild are short binary or k=v dumps. We emit ASCII
# so docker exec write paths don't need binary-safety acrobatics.
nbytes = rng.randint(64, 256)
chars = "abcdefghijklmnopqrstuvwxyz0123456789"
return "session=" + "".join(rng.choice(chars) for _ in range(nbytes)) + "\n"
def _body_email(persona: str, rng: secrets.SystemRandom) -> str:
raise NotImplementedError(
"email bodies come from the email driver, not realism.bodies"
)
def _body_canary(persona: str, rng: secrets.SystemRandom) -> str:
raise NotImplementedError(
"canary bodies come from the canary cultivator (stage 7), "
"not realism.bodies"
)
# ── Dispatch ───────────────────────────────────────────────────────────────
_BODIES: dict[ContentClass, Callable[[str, secrets.SystemRandom], str]] = {
ContentClass.NOTE: _body_note,
ContentClass.TODO: _body_todo,
ContentClass.DRAFT: _body_draft,
ContentClass.SCRIPT: _body_script,
ContentClass.LOG_CRON: _body_log_cron,
ContentClass.LOG_DAEMON: _body_log_daemon,
ContentClass.CACHE_TMP: _body_cache_tmp,
ContentClass.EMAIL: _body_email,
ContentClass.CANARY_AWS_CREDS: _body_canary,
ContentClass.CANARY_ENV_FILE: _body_canary,
ContentClass.CANARY_GIT_CONFIG: _body_canary,
ContentClass.CANARY_SSH_KEY: _body_canary,
ContentClass.CANARY_HONEYDOC: _body_canary,
ContentClass.CANARY_HONEYDOC_DOCX: _body_canary,
ContentClass.CANARY_HONEYDOC_PDF: _body_canary,
ContentClass.CANARY_MYSQL_DUMP: _body_canary,
}
def make_body(
content_class: ContentClass,
persona: str,
*,
rand: Optional[secrets.SystemRandom] = None,
) -> str:
"""Return deterministic body bytes (utf-8 string) for *content_class*.
Stage 3 ships templates only; stage 6 adds an optional
``LLMBackend`` parameter that, when supplied and the breaker is
closed, replaces the template return for user-classes.
"""
rng = rand or secrets.SystemRandom()
gen = _BODIES.get(content_class)
if gen is None:
raise KeyError(
f"no body generator registered for content_class={content_class!r}"
)
return gen(persona, rng)

192
decnet/realism/naming.py Normal file
View File

@@ -0,0 +1,192 @@
"""Per-content-class filename generators.
The pre-realism orchestrator emitted ``notes-1777315854.txt``
(unix-epoch suffix) — a tell on first glance. Real users name
``notes.txt``, ``TODO.md``, ``backup-2025-04.sql.gz``. Real systems
write ``cron.log``, ``cron.log.1``, ``cron.log.2.gz`` (logrotate
shape, no epoch).
Stage 3 ships **deterministic templates only**, persona-conditioned.
Stage 6 wires LLM enrichment for the user-classes (``note``, ``todo``,
``draft``, ``script``); the deterministic templates remain the
fallback when LLM is disabled or times out.
The factory mirrors :func:`decnet.canary.factory.get_generator`:
caller passes a :class:`~decnet.realism.taxonomy.ContentClass`; we
return the namer registered for it. Renaming a content_class is a
schema change and would invalidate ``synthetic_files.path`` lookups,
so the dispatch is exhaustive — no silent fallbacks for unknown
classes.
"""
from __future__ import annotations
import secrets
import string
from typing import Callable, Optional
from decnet.realism.taxonomy import ContentClass
# Persona → home-dir convention. Most personas are linux-style; the
# rare "windows" persona gets ``C:\\Users\\<persona>\\Documents`` style
# paths (out of scope until per-OS personas land). For now everything
# is POSIX.
def _home(persona: str) -> str:
"""Return the canonical home directory for *persona*.
The persona's ``name`` is used as the linux username when it's a
plausible login (lowercase, no spaces); otherwise we fall back to
a generic ``user`` so the path doesn't reveal a persona display
name on the decky filesystem.
"""
candidate = persona.lower().replace(" ", "")
if candidate.isalnum() and candidate.isascii() and candidate:
return f"/home/{candidate}"
return "/home/user"
def _random_token(rng: secrets.SystemRandom, length: int = 6) -> str:
"""Lowercase-alphanum token of length *length* — like ``mkstemp``."""
return "".join(rng.choice(string.ascii_lowercase + string.digits) for _ in range(length))
# ── User-class namers ──────────────────────────────────────────────────────
_NOTE_NAMES: tuple[str, ...] = (
"notes.txt", "scratch.md", "ideas.txt", "Untitled-3.txt",
"draft.md", "keys.txt", "passwords.txt", "TODO.md",
)
_TODO_NAMES: tuple[str, ...] = (
"TODO.md", "todo.txt", "things.md", "tasks.txt", "punchlist.md",
)
_DRAFT_NAMES: tuple[str, ...] = (
"Q3-budget-DRAFT.md", "proposal.md", "letter.txt",
"rfc-internal.md", "memo.txt", "1on1-notes.md",
)
_SCRIPT_NAMES: tuple[str, ...] = (
"backup.sh", "deploy.sh", "cleanup.sh", "rotate.sh",
"fix.py", "tmp.py", "scratch.py",
)
def _name_user(
persona: str, names: tuple[str, ...], rng: secrets.SystemRandom,
) -> str:
return f"{_home(persona)}/{rng.choice(names)}"
def _name_note(persona: str, rng: secrets.SystemRandom) -> str:
return _name_user(persona, _NOTE_NAMES, rng)
def _name_todo(persona: str, rng: secrets.SystemRandom) -> str:
return _name_user(persona, _TODO_NAMES, rng)
def _name_draft(persona: str, rng: secrets.SystemRandom) -> str:
return _name_user(persona, _DRAFT_NAMES, rng)
def _name_script(persona: str, rng: secrets.SystemRandom) -> str:
return _name_user(persona, _SCRIPT_NAMES, rng)
# ── System-class namers ────────────────────────────────────────────────────
# logrotate skeleton: cron.log, cron.log.1, cron.log.2.gz. No epoch
# suffix — the realism failure today is `cron-1777317867.log`.
_CRON_LOGROTATE: tuple[str, ...] = (
"/var/log/cron.log", "/var/log/cron.log.1", "/var/log/cron.log.2.gz",
)
_DAEMON_LOGROTATE: tuple[str, ...] = (
"/var/log/daemon.log", "/var/log/syslog", "/var/log/messages",
"/var/log/auth.log", "/var/log/auth.log.1",
)
def _name_log_cron(persona: str, rng: secrets.SystemRandom) -> str:
return rng.choice(_CRON_LOGROTATE)
def _name_log_daemon(persona: str, rng: secrets.SystemRandom) -> str:
return rng.choice(_DAEMON_LOGROTATE)
def _name_cache_tmp(persona: str, rng: secrets.SystemRandom) -> str:
# mkstemp shape: /tmp/.cache-XXXXXX with random alphanumerics.
# Hidden dot keeps it out of `ls` by default — same as glibc/python.
# Bandit B108 fires on the literal "/tmp/" path; suppressed at the
# site because this is a path we are *generating for a target
# decky*, not a file we are opening on the host.
return f"/tmp/.cache-{_random_token(rng, 6)}" # nosec B108
# ── Email + canary placeholders ────────────────────────────────────────────
# Email "names" (paths) are produced by the email driver's spool logic,
# not by realism naming. Canary paths are advisory — operators usually
# specify ``placement_path`` directly. Stage 7 of the realism migration
# refines canary placement based on persona + content_class.
def _name_email(persona: str, rng: secrets.SystemRandom) -> str:
raise NotImplementedError(
"email paths come from the email driver's spool logic, not "
"realism.naming"
)
def _name_canary(persona: str, rng: secrets.SystemRandom) -> str:
raise NotImplementedError(
"canary placement is set by the canary cultivator (stage 7), "
"not realism.naming"
)
# ── Dispatch ───────────────────────────────────────────────────────────────
_NAMERS: dict[ContentClass, Callable[[str, secrets.SystemRandom], str]] = {
ContentClass.NOTE: _name_note,
ContentClass.TODO: _name_todo,
ContentClass.DRAFT: _name_draft,
ContentClass.SCRIPT: _name_script,
ContentClass.LOG_CRON: _name_log_cron,
ContentClass.LOG_DAEMON: _name_log_daemon,
ContentClass.CACHE_TMP: _name_cache_tmp,
ContentClass.EMAIL: _name_email,
ContentClass.CANARY_AWS_CREDS: _name_canary,
ContentClass.CANARY_ENV_FILE: _name_canary,
ContentClass.CANARY_GIT_CONFIG: _name_canary,
ContentClass.CANARY_SSH_KEY: _name_canary,
ContentClass.CANARY_HONEYDOC: _name_canary,
ContentClass.CANARY_HONEYDOC_DOCX: _name_canary,
ContentClass.CANARY_HONEYDOC_PDF: _name_canary,
ContentClass.CANARY_MYSQL_DUMP: _name_canary,
}
def make_path(
content_class: ContentClass,
persona: str,
*,
rand: Optional[secrets.SystemRandom] = None,
) -> str:
"""Return a plausible absolute container-side path for *content_class*.
Persona-conditioned for user-classes (``/home/<persona>/…``).
System-classes ignore persona and pick from a logrotate-shaped
skeleton. Email and canary classes raise — those paths come
from the respective drivers, not from realism naming.
"""
rng = rand or secrets.SystemRandom()
namer = _NAMERS.get(content_class)
if namer is None:
raise KeyError(
f"no namer registered for content_class={content_class!r}"
)
return namer(persona, rng)

View File

@@ -1,13 +1,21 @@
"""Realism planner — picks the next ``(decky, persona, class, action)`` tuple.
Stage-1 stub: the public signature is in place so the orchestrator
worker (stage 3) can import it, but the body returns ``None`` ("nothing
to do this tick") until stage 3 wires the synthetic_files table and
naming/body generators.
Stage 3: returns ``create``-only plans (the edit branch lands in
stage 3b). Pure-function, deterministic given the same inputs:
caller passes deckies (with personas pre-resolved on each row),
``now``, and an RNG.
The eventual policy lives entirely in :func:`pick`; downstream
consumers should not branch on ``ContentClass`` themselves — let the
planner decide weights and rate-limits in one place.
The persona resolution split — topology-pool vs. global-pool — is
the orchestrator's job, not the planner's. Each decky dict reaching
:func:`pick` carries a ``_realism_personas`` key with the resolved
:class:`~decnet.realism.personas.EmailPersona` list. Keeps the
planner test-isolated and avoids forcing it to know about the
:class:`~decnet.web.db.repository.BaseRepository` / topology pool /
global pool.
Diurnal gating uses :func:`decnet.realism.diurnal.in_work_hours` per
persona; we filter the (decky, persona) pairs *before* picking, so a
persona outside its window is never considered.
"""
from __future__ import annotations
@@ -15,39 +23,110 @@ import secrets
from datetime import datetime
from typing import Any, Optional, Sequence
from decnet.realism.taxonomy import Plan
from decnet.realism import bodies, naming
from decnet.realism.diurnal import in_work_hours, sample_mtime
from decnet.realism.personas import EmailPersona
from decnet.realism.taxonomy import ContentClass, Plan
# Stage-3 weighted sampling:
# * User content (notes/todo/draft/script) gets the bulk — those are
# the realism win when a persona "looks busy."
# * System content (cron/daemon/cache) is plausible filler.
# * Email + canary are owned by other paths and not picked here.
_USER_CLASS_WEIGHTS: tuple[tuple[ContentClass, int], ...] = (
(ContentClass.NOTE, 30),
(ContentClass.TODO, 20),
(ContentClass.DRAFT, 15),
(ContentClass.SCRIPT, 10),
)
_SYSTEM_CLASS_WEIGHTS: tuple[tuple[ContentClass, int], ...] = (
(ContentClass.LOG_CRON, 12),
(ContentClass.LOG_DAEMON, 8),
(ContentClass.CACHE_TMP, 5),
)
def _weighted_pick(
weights: tuple[tuple[ContentClass, int], ...],
rng: secrets.SystemRandom,
) -> ContentClass:
total = sum(w for _, w in weights)
target = rng.randint(1, total)
running = 0
for cls, w in weights:
running += w
if target <= running:
return cls
return weights[-1][0] # unreachable, satisfy mypy
def _eligible_pairs(
deckies: Sequence[dict[str, Any]],
now: datetime,
) -> list[tuple[dict[str, Any], EmailPersona]]:
"""Cross-product of deckies × resolved personas, diurnal-filtered.
A decky with no personas (empty ``_realism_personas``) is skipped
entirely; same fail-quiet semantics as the emailgen scheduler.
"""
out: list[tuple[dict[str, Any], EmailPersona]] = []
for decky in deckies:
personas: list[EmailPersona] = decky.get("_realism_personas") or []
for persona in personas:
if in_work_hours(persona.active_hours, now):
out.append((decky, persona))
return out
def pick(
deckies: Sequence[dict[str, Any]],
now: datetime,
*,
repo: Any = None,
rand: Optional[secrets.SystemRandom] = None,
) -> Optional[Plan]:
"""Return the next :class:`Plan` for the orchestrator's tick.
"""Return a single :class:`Plan` for the orchestrator's tick.
Stage-1 stub returns ``None`` unconditionally so the orchestrator
can import this function before the real implementation lands. The
full policy (diurnal gate, action distribution 60/30/10
create/edit/leave, content-class weights, canary rate-limit) lands
in stage 3 of the realism migration.
Stage-3 policy: create-only. Stage 3b extends with the
create/edit/leave roll and the synthetic_files lookup for edits.
Parameters
----------
deckies :
Output of :meth:`BaseRepository.list_running_deckies`. Each
entry must carry ``uuid``, ``name``, ``services``,
``email_personas`` (topology-pool JSON or list).
now :
Tick timestamp. Injected so tests don't need to monkey-patch
:func:`datetime.utcnow`.
repo :
:class:`BaseRepository` for synthetic_files lookup (edit
action). Optional in stage 1; required from stage 3 onward.
rand :
RNG for sampling. Defaults to a fresh
:class:`secrets.SystemRandom`.
Returns ``None`` when no eligible (decky, persona) pair exists —
the orchestrator treats that as "skip this tick" the same way the
pre-realism scheduler did.
"""
_ = (deckies, now, repo, rand) # silence unused-arg until stage 3
return None
rng = rand or secrets.SystemRandom()
eligible = _eligible_pairs(deckies, now)
if not eligible:
return None
decky, persona = rng.choice(eligible)
# User vs system content — biased toward user (realism wins are
# bigger there). Once stage 3b ships edit-in-place, the edit
# branch will reuse the same content_class as the existing row;
# the create branch picks fresh here.
if rng.random() < 0.7:
content_class = _weighted_pick(_USER_CLASS_WEIGHTS, rng)
else:
content_class = _weighted_pick(_SYSTEM_CLASS_WEIGHTS, rng)
target_path = naming.make_path(content_class, persona.name, rand=rng)
body_hint = bodies.make_body(content_class, persona.name, rand=rng)
mtime = sample_mtime(persona.active_hours, now, rand=rng)
return Plan(
decky_uuid=decky["uuid"],
decky_name=decky["name"],
persona=persona.name,
content_class=content_class,
action="create",
target_path=target_path,
mtime=mtime,
body_hint=body_hint,
notes=(
f"persona={persona.name}",
f"class={content_class.value}",
f"window={persona.active_hours}",
),
)