Adds 12 new primitives across two waves of spec work this session.
meta.* layer (8 primitives) — corpus-snapshot footprint:
total_messages, corpus_span_days, msg_per_day, active_days,
activity_density, first_seen_ts, last_seen_ts, fingerprint_confidence.
Motivated by two actors with identical message counts (53 each) producing
indistinguishable profiles despite radically different presence shapes
(0.3-day burst vs 47-day long tail).
Language-aware characterization primitives (4 primitives):
stylometric.pos_ngram_signature — SimHash over POS bigram frequency vector;
syntactic skeleton fingerprint that survives full vocabulary paraphrase.
lexical.dialect_region — BCP-47 free_string (es-CL, es-AR, es-MX, …);
designed for EYENET integration with INGEOTEC regional-spanish-models.
lexical.evaluative_morphology_density — diminutive/augmentative/pejorative
suffix density; stable per-author trait baked into language acquisition.
lexical.optional_grammar_signature — SimHash over optional-grammar choice
points (compound/simple past, subjunctive, leísmo, relative pronoun);
high-reliability Spain vs LatAm discriminator.
Also fixes stale scratchpad.md references throughout (README.md is now the
authority), bumps behave-text to 0.1.3, and updates CHANGELOG.
behave-text
Text/messaging-domain behavioral observation registry. Defines what can be observed about an actor through their written messaging activity — stylometric fingerprints, lexical patterns, interaction rhythms, and governance-role signals.
BEHAVE-TEXT operates on derived features, not raw text. Sensors hash, aggregate, and classify before emitting — the raw message content never enters a BEHAVE observation. This is a tighter constraint than BEHAVE-SHELL because the source signal is text content; the PII risk is higher.
The topic prefix is actor.observation.text (not attacker.) because chat groups
include non-attacker roles — admins, buyers, sellers, bots, lurkers. The framing
is deliberately neutral: BEHAVE-TEXT observes actors, not adversaries.
Install
pip install behave-text
For local development:
pip install -e ../core/ -e ".[dev]"
Quickstart
from behave_text.spec import Observation, Window, TOPIC_PREFIX, event_topic_for
obs = Observation(
primitive="stylometric.capitalization_habit",
value="lowercase",
confidence=0.91,
window=Window(start_ts=1714000000.0, end_ts=1714086400.0),
source="behave/text-sensor/stylometry.py",
)
topic = event_topic_for("stylometric.capitalization_habit")
# → "actor.observation.text.stylometric.capitalization_habit"
Public API (behave_text.spec)
| Symbol | Description |
|---|---|
Observation |
Registry-aware subclass of behave_core.spec.Observation. Validates primitive and value against PRIMITIVE_REGISTRY. |
Window |
Re-exported from behave_core. |
ObservationValue |
Re-exported union type. |
PRIMITIVE_REGISTRY |
dict[str, ValueTypeSpec] — the full primitive catalog (47 entries). |
ValueKind |
Enum: CATEGORICAL, NUMERIC, HASH, ARRAY, FREE_STRING, BOOL. |
ValueTypeSpec |
Pydantic model: kind, allowed values, bounds, notes. |
is_known(primitive) |
bool — whether a primitive path is registered. |
get(primitive) |
Returns the ValueTypeSpec; raises KeyError if unknown. |
TOPIC_PREFIX |
"actor.observation.text" |
event_topic_for(primitive) |
Returns the full event bus topic string. |
Note: to_event_payload / from_event_payload (full round-trip helpers) are
present in behave-shell but not yet implemented here — status: planned.
Primitives
47 primitives across 7 categories.
meta.* — Corpus-snapshot footprint (8 primitives)
Meta primitives describe the actor's presence in the corpus window itself —
how many messages, how long a span, how densely distributed. They are not
stylometric features; they are the scaffolding that other primitives assume.
Several primitives (notably temporal_evolution.lifecycle_phase) implicitly
depend on these quantities; meta.* makes them first-class so downstream
attribution engines can access and weight them explicitly.
| Primitive | Kind | Description |
|---|---|---|
meta.total_messages |
numeric | Raw message count for this actor in the corpus snapshot. Anchor for msg_per_day and fingerprint_confidence. |
meta.corpus_span_days |
numeric | Wall-clock fractional days between first and last message. First-to-last only — blind to gaps. A 47-day span with 5 active days still yields 47. Recomputable from first_seen_ts / last_seen_ts. |
meta.msg_per_day |
numeric | total_messages / corpus_span_days. Separates bursty visitors (53 msgs / 0.3 days = 53/day) from long-tail lurkers (53 msgs / 47 days = 1.1/day). Undefined when span = 0; extractors emit null/omit rather than divide-by-zero. |
meta.active_days |
numeric | Distinct calendar days (UTC) with ≥1 message. Always ≤ corpus_span_days. Distinguishes a periodic visitor (span=47, active=3) from a near-daily regular (span=47, active=40). |
meta.activity_density |
numeric [0,1] | active_days / corpus_span_days. 1.0 = present every day of the window. Near-0 = appeared once or twice across a long window. Undefined when span = 0; emit null/omit for single-day actors. |
meta.first_seen_ts |
free_string | ISO 8601 timestamp (UTC offset) of the actor's earliest message. Anchors corpus_span_days in absolute time for cross-extraction comparison. |
meta.last_seen_ts |
free_string | ISO 8601 timestamp (UTC offset) of the actor's latest message. See first_seen_ts. |
meta.fingerprint_confidence |
categorical | Qualitative reliability of this actor's full fingerprint: low, medium, high. Attribution engines should weight all other observations by this before compositing. Derivation is extractor-defined — extractors declare their heuristic in the source label (e.g. #confidence-v1). |
stylometric.* — Writing style fingerprints (13 primitives)
Stylometric primitives capture the unconscious writing habits that distinguish one author from another. The field goes back to the Mosteller-Wallace Federalist Papers study (1963): function-word frequencies alone can attribute authorship with high accuracy in long-form English text. BEHAVE-TEXT adapts these methods to short-form Spanish chat, which introduces domain-specific challenges (short messages, informal register, code-switching, emoji). Calibration results from the Rutify corpus are noted inline where they affect interpretation.
| Primitive | Kind | Description |
|---|---|---|
stylometric.punctuation_style |
hash | Canonical punctuation-pattern fingerprint hash. Captures the author's consistent punctuation tics (double spaces, comma habits, no-period endings) as a searchable signature. |
stylometric.capitalization_habit |
categorical | Dominant capitalization rule. lowercase = no capitals. proper = standard sentence/title case. random_caps = no consistent rule. mixed_i = consistent lowercase 'i' mid-sentence — common in Spanish chat where the standalone-'I' habit doesn't apply but the behavior transfers. |
stylometric.emoji_usage |
categorical | Rate of emoji use. none, occasional, frequent, exclusive (messages rarely without emoji). Captures tone and register. |
stylometric.emoji_placement |
categorical | Emoji position relative to sentence-ending punctuation. pre_punctuation = 'Hola 😊.' post_punctuation = 'Hola. 😊' Individual authors are strikingly consistent in this micro-habit. |
stylometric.message_length_class |
categorical | Median message length bucket: short 1-5 words, medium 6-20, long 21-50, paragraph >50. See also message_length_variance_class for distribution shape. |
stylometric.message_length_variance_class |
categorical | Distribution shape of per-message word counts. tight CV<0.5 (always 1-3 words). varied 0.5≤CV<1.5 (normal mix). bimodal CV≥1.5 (mostly short with occasional rants). Two authors can share the same median length but have wildly different variance. |
stylometric.linebreak_style |
categorical | Whether the author sends one complete thought per message or bursts multiple short sequential messages. multi_line = habitual 3-5 short messages per turn. wall_of_text = dense blocks, rarely uses line breaks. Captures a stylistic rhythm that is hard to consciously alter. |
stylometric.typo_signature |
hash | SHA-256 of the canonical persistent-typo set — the specific recurring errors the author makes consistently (e.g. always writes tener as tenet, or porque as xq). Persistent typos are strong authorship signals because they reflect keyboard-motor habits. |
stylometric.function_word_distribution_top50 |
hash | 64-bit SimHash over the 50 most common Spanish function-word frequency vector. Based on the Mosteller-Wallace method. Calibration note (2026-05-02, Rutify corpus): within-author and cross-author Hamming distance distributions overlap (within median 8 bits, cross median 10 bits) in short-message chat — this primitive alone cannot discriminate authors. Engines should weight it low and composite with character n-grams and distinctive vocabulary. Kept in v0 for calibration grids. |
stylometric.function_word_distribution_top200 |
hash | 64-bit SimHash over the 200 most common Spanish function words. The wider list reaches into the long tail (rare-but-individual words like tampoco, aunque, mientras) that carry more discriminating signal in short-message corpora. Not yet emitted by v0 prototype — populated in v0.2. |
stylometric.character_ngram_simhash |
hash | 64-bit SimHash over character n-gram frequencies (default n=3), lowercased. Orthogonal to function-word distributions: captures punctuation tics, accent-stripping habits, typo patterns, and idiom fragments that survive paraphrase. Accents are preserved because accent-stripping is itself a stylistic tic. Source label declares n size (e.g. #char3gram). |
stylometric.distinctive_vocabulary_signature |
hash | 64-bit SimHash over a TF-IDF-weighted top-K rare-word vector. Captures the author's distinctive lexicon — words they use that other authors in the same corpus do not. Complementary to function-word distributions: where function_word_* captures common-word style, this captures individual lexical choice. Requires the full corpus for IDF computation. Source label declares top-K and corpus tag (e.g. #tfidf-top50). |
stylometric.pos_ngram_signature |
hash | 64-bit SimHash over a POS n-gram (default bigram) frequency vector. Captures syntactic skeleton independent of vocabulary — an author can change every word and retain the same grammatical fingerprint. Orthogonal to character n-grams and function-word distributions. Tagger-dependent: source label must declare tagger, language model, and n (e.g. #spacy-es_core_news_sm-bi). Calibration note: chat-domain text produces tagger noise — weight low until validated on labelled chat corpora. |
lexical.* — Vocabulary and linguistic patterns (11 primitives)
Lexical primitives characterize what and how an actor writes at the word and sentence level. Where stylometric primitives fingerprint unconscious micro-habits, lexical primitives capture deliberate linguistic choices — vocabulary richness, how questions are formed, register.
| Primitive | Kind | Description |
|---|---|---|
lexical.vocabulary_richness |
numeric [0,1] | Moving-Average Type-Token Ratio (MATTR) over a sliding window (default 50 tokens). Volume-independent: each window contributes its own unique/total ratio, the value is the mean. Avoids the standard TTR bias where larger corpora mechanically score lower. Source label declares window size. |
lexical.slang_density |
numeric [0,1] | Rate of slang terms per message, against a locale-tuned slang corpus. |
lexical.code_switching_rate |
numeric [0,1] | Language switches per N tokens (Solorio & Liu metric). A speaker who switches between Spanish and English, or Spanish and lunfardo/caló, will have a higher rate than a monolingual writer. |
lexical.code_switching_matrix_language |
free_string | BCP-47 tag of the dominant (matrix) language in code-switching texts (e.g. es-CL, es-AR). The matrix language is the grammatical scaffold; embedded languages appear as inserts. |
lexical.code_switching_embedded_languages |
array[free_string] | BCP-47 list of non-matrix languages observed in the actor's messages. |
lexical.sentence_complexity_class |
categorical | Dominant clause structure. simple = single-clause. compound = two independent clauses joined by coordinating conjunctions (pero, y, o). complex = dependent clauses and subordination (aunque, porque, cuando). Reflects education level and cognitive investment. |
lexical.question_formation_style |
categorical | How questions are formed. punctuation_only = question mark without interrogative words ('¿Cuánto?') — very common in Spanish chat. lexical = explicit interrogatives (¿qué, cómo, cuándo). formal = inverted subject-verb or formal register. |
lexical.imperative_style |
categorical | How commands and requests are framed. informal_directive = tú/vos imperative (dame, hazlo). formal_directive = usted imperative (hágame el favor). polite = conditional/modal softening (¿podría...?). Stable per-author trait in hierarchical contexts. |
lexical.dialect_region |
free_string | Dominant regional variety of the actor's matrix language as a BCP-47 language-region tag (e.g. es-CL, es-AR, es-MX, es-ES, en-US). Detected from lexical marker density against per-region vocabulary tables. Emit literal unknown below confidence threshold. Detection method declared in source label (e.g. #dialect-markers-v1). Complementary to code_switching_matrix_language, which derives language via switching analysis rather than direct marker lookup. |
lexical.evaluative_morphology_density |
numeric [0,1] | Rate of evaluative morpheme tokens / total tokens. Covers Spanish diminutives (-ito/-ita), augmentatives (-ón/-ote), pejoratives (-ejo/-ucho), and intensives (-azo). Heavy diminutive use is characteristic of Mexican/Central American Spanish; River Plate speakers use them significantly less. Stable per-author — baked into language acquisition and hard to consciously suppress. Source label declares morpheme set and tool version (e.g. #eval-morph-es-v1). |
lexical.optional_grammar_signature |
hash | 64-bit SimHash over the author's preference probability vector at optional-grammar choice points. For Spanish: compound vs simple past (he comido vs comí — high-reliability Spain/LatAm discriminator), subjunctive usage rate, leísmo/laísmo/loísmo clitic patterns, and relative pronoun choice (que vs el cual). Each choice point is a scalar [0,1]; the SimHash is computed over the concatenated vector. Choice-point set is extractor-defined and declared in source label (e.g. #optgrammar-es-v1). Requires sufficient corpus volume for stable probabilities — gate on meta.fingerprint_confidence before use. |
temporal_evolution.* — Behavioral change over time (1 primitive)
| Primitive | Kind | Description |
|---|---|---|
temporal_evolution.lifecycle_phase |
categorical | Auto-classified lifecycle stage from windowed within-corpus analysis. arrival_burst = first 24hr, first-window volume dominates (empirically validated against OxPayload's first 12 hours in Rutify). stable_member = low drift across the full tenure. fluctuating_member = tenure ≥24hr with median drift between stable and inflection thresholds — established noisy regulars (e.g. lamarabitch). inflection_member = long-tenure actor with a real behavioral shift in at least one window-pair. declining_member = monotonically decreasing per-window message counts. unknown = insufficient data. Window size adapts to tenure: <24hr → 2h, <7d → 12h, <30d → 1d, otherwise 7d. |
network.* — Governance and role signals (2 primitives)
Network primitives capture the actor's structural role in the group — inferred from interaction patterns rather than content — and a bot detector. These are heuristic composites built from other primitives; treat them as candidate signals, not verdicts.
| Primitive | Kind | Description |
|---|---|---|
network.is_likely_bot |
categorical | Heuristic bot detector. likely_bot when conversation_initiation_rate ≥ 0.95 AND attention_pattern = broadcast AND vocabulary_richness < 0.65. Validated (2026-05-03) against SangMata_beta_bot (caught) vs 11 high-volume humans (no false positives). Low-volume bots (e.g. QuotLyBot, 9 messages) sit below the fingerprint threshold. Source label declares heuristic version (e.g. #bot-heuristic-v1). |
network.governance_role_signal |
categorical | Heuristic role shape from interaction primitives + lifecycle. admin_pattern = init_rate ≥ 0.80, attention reciprocal, non-bot, non-arrival_burst. responder_pattern = init_rate ≤ 0.45, attention reciprocal. bot_pattern = matches is_likely_bot. regular = everything else above volume threshold. Empirically caught 4/4 high-volume Rutify admins, sebaImlI as responder, SangMata as bot. NOT a ground-truth admin label. |
interaction.* — Messaging behavior (6 primitives)
Interaction primitives characterize how the actor participates in conversations — timing, initiation rate, and attention patterns.
| Primitive | Kind | Description |
|---|---|---|
interaction.response_latency_class |
categorical | How quickly the actor responds to messages directed at them. immediate <30s (suggests active monitoring or automation). fast 30s-5min. normal 5-60min. slow 1-24hr. sporadic = no consistent pattern. |
interaction.conversation_initiation_rate |
numeric [0,1] | Thread-starting messages / total messages. High rate = the actor drives conversations. |
interaction.message_burst_rate |
categorical | Whether the actor sends multiple messages per turn. habitual = almost always bursts (3+ messages before any reply). single = almost always one message per turn. Tied to stylometric.linebreak_style multi_line. |
interaction.active_hours_class |
free_string | UTC active-hours window summary (e.g. 05:00-14:00 UTC). Free string — the window shape varies by actor and doesn't fit a closed enum. |
interaction.session_duration_class |
categorical | Typical session length: short <15min, medium 15-90min, long 90min-4hr, marathon >4hr. Shares the enum with behave_shell's temporal.session_duration. |
interaction.attention_pattern |
categorical | Reply-graph centrality shape. broadcast = sends to many, replies to few (one-to-many). focused = concentrates on a small set of interlocutors. reciprocal = balanced give-and-take. |
content.* — Content-derived signals, EXPERIMENTAL (6 primitives)
Content primitives are derived from message text through classifiers rather than structural/timing analysis. They carry the highest risk of false positives, are brittle to vocabulary drift, and are locale-specific. An attribution engine may choose to weight these at zero until field-validated against labeled data.
| Primitive | Kind | Description |
|---|---|---|
content.role_signal |
categorical | Locale-tuned role-vocabulary classifier. Values: admin, seller, buyer, lurker, newbie. May be moved to a separate IOC/keyword-detection layer after Rutify testing. EXPERIMENTAL |
content.transactional_language |
numeric [0,1] | Rate of transactional terms per message. Locale-specific; brittle to vocabulary drift. EXPERIMENTAL |
content.opsec_awareness |
numeric [0,1] | Rate of security-conscious phrases. HIGH FALSE-POSITIVE RISK on casual conversation about deleting files/messages. EXPERIMENTAL |
content.targeting_language |
array[free_string] | IOC-shaped target patterns (bank names, government portals, RUT ranges). Consider moving to a dedicated IOC layer. EXPERIMENTAL |
content.boasting_pattern |
categorical | Success-claim frequency: none, occasional, frequent. Corpus-dependent regex. EXPERIMENTAL |
content.conflict_style |
categorical | Dispute-tone classification: aggressive, defusing, appellate. Needs labelled training data. EXPERIMENTAL |
Schema
Machine-readable JSON Schema:
json/observation.schema.json
Regenerate after model changes:
python scripts/generate_schema.py
Tests
pytest tests/
Attribution recipes
attribution-recipes.md — placeholder document sketching
how an external attribution engine would consume actor.observation.text.* topics
to build actor profiles (credential_broker, low_skill_buyer, group_admin, etc.).
Not populated yet — awaiting Rutify corpus calibration. Not part of the BEHAVE spec.
License
Code and schemas: GPL-3.0-or-later Spec prose (this file, attribution-recipes.md): CC-BY-SA-4.0