Files

2026-05-29 18:13:41 -04:00

22 KiB

Raw Blame History

behave-text

Text/messaging-domain behavioral observation registry. Defines what can be observed about an actor through their written messaging activity — stylometric fingerprints, lexical patterns, interaction rhythms, and governance-role signals.

BEHAVE-TEXT operates on derived features, not raw text. Sensors hash, aggregate, and classify before emitting — the raw message content never enters a BEHAVE observation. This is a tighter constraint than BEHAVE-SHELL because the source signal is text content; the PII risk is higher.

The topic prefix is actor.observation.text (not attacker.) because chat groups include non-attacker roles — admins, buyers, sellers, bots, lurkers. The framing is deliberately neutral: BEHAVE-TEXT observes actors, not adversaries.

Install

pip install behave-text

For local development:

pip install -e ../core/ -e ".[dev]"

Quickstart

from behave_text.spec import Observation, Window, TOPIC_PREFIX, event_topic_for

obs = Observation(
    primitive="stylometric.capitalization_habit",
    value="lowercase",
    confidence=0.91,
    window=Window(start_ts=1714000000.0, end_ts=1714086400.0),
    source="behave/text-sensor/stylometry.py",
)
topic = event_topic_for("stylometric.capitalization_habit")
# → "actor.observation.text.stylometric.capitalization_habit"

Public API (`behave_text.spec`)

Symbol	Description
`Observation`	Registry-aware subclass of `behave_core.spec.Observation`. Validates `primitive` and `value` against `PRIMITIVE_REGISTRY`.
`Window`	Re-exported from `behave_core`.
`ObservationValue`	Re-exported union type.
`PRIMITIVE_REGISTRY`	`dict[str, ValueTypeSpec]` — the full primitive catalog (47 entries).
`ValueKind`	Enum: `CATEGORICAL`, `NUMERIC`, `HASH`, `ARRAY`, `FREE_STRING`, `BOOL`.
`ValueTypeSpec`	Pydantic model: kind, allowed values, bounds, notes.
`is_known(primitive)`	`bool` — whether a primitive path is registered.
`get(primitive)`	Returns the `ValueTypeSpec`; raises `KeyError` if unknown.
`TOPIC_PREFIX`	`"actor.observation.text"`
`event_topic_for(primitive)`	Returns the full event bus topic string.

Note: to_event_payload / from_event_payload (full round-trip helpers) are present in behave-shell but not yet implemented here — status: planned.

Primitives

47 primitives across 7 categories.

`meta.*` — Corpus-snapshot footprint (8 primitives)

Meta primitives describe the actor's presence in the corpus window itself — how many messages, how long a span, how densely distributed. They are not stylometric features; they are the scaffolding that other primitives assume. Several primitives (notably temporal_evolution.lifecycle_phase) implicitly depend on these quantities; meta.* makes them first-class so downstream attribution engines can access and weight them explicitly.

Primitive	Kind	Description
`meta.total_messages`	numeric	Raw message count for this actor in the corpus snapshot. Anchor for `msg_per_day` and `fingerprint_confidence`.
`meta.corpus_span_days`	numeric	Wall-clock fractional days between first and last message. First-to-last only — blind to gaps. A 47-day span with 5 active days still yields 47. Recomputable from `first_seen_ts` / `last_seen_ts`.
`meta.msg_per_day`	numeric	`total_messages / corpus_span_days`. Separates bursty visitors (53 msgs / 0.3 days = 53/day) from long-tail lurkers (53 msgs / 47 days = 1.1/day). Undefined when span = 0; extractors emit null/omit rather than divide-by-zero.
`meta.active_days`	numeric	Distinct calendar days (UTC) with ≥1 message. Always ≤ `corpus_span_days`. Distinguishes a periodic visitor (span=47, active=3) from a near-daily regular (span=47, active=40).
`meta.activity_density`	numeric [0,1]	`active_days / corpus_span_days`. 1.0 = present every day of the window. Near-0 = appeared once or twice across a long window. Undefined when span = 0; emit null/omit for single-day actors.
`meta.first_seen_ts`	free_string	ISO 8601 timestamp (UTC offset) of the actor's earliest message. Anchors `corpus_span_days` in absolute time for cross-extraction comparison.
`meta.last_seen_ts`	free_string	ISO 8601 timestamp (UTC offset) of the actor's latest message. See `first_seen_ts`.
`meta.fingerprint_confidence`	categorical	Qualitative reliability of this actor's full fingerprint: `low`, `medium`, `high`. Attribution engines should weight all other observations by this before compositing. Derivation is extractor-defined — extractors declare their heuristic in the source label (e.g. `#confidence-v1`).

`stylometric.*` — Writing style fingerprints (13 primitives)

Stylometric primitives capture the unconscious writing habits that distinguish one author from another. The field goes back to the Mosteller-Wallace Federalist Papers study (1963): function-word frequencies alone can attribute authorship with high accuracy in long-form English text. BEHAVE-TEXT adapts these methods to short-form Spanish chat, which introduces domain-specific challenges (short messages, informal register, code-switching, emoji). Calibration results from the Rutify corpus are noted inline where they affect interpretation.

Primitive	Kind	Description
`stylometric.punctuation_style`	hash	Canonical punctuation-pattern fingerprint hash. Captures the author's consistent punctuation tics (double spaces, comma habits, no-period endings) as a searchable signature.
`stylometric.capitalization_habit`	categorical	Dominant capitalization rule. `lowercase` = no capitals. `proper` = standard sentence/title case. `random_caps` = no consistent rule. `mixed_i` = consistent lowercase 'i' mid-sentence — common in Spanish chat where the standalone-'I' habit doesn't apply but the behavior transfers.
`stylometric.emoji_usage`	categorical	Rate of emoji use. `none`, `occasional`, `frequent`, `exclusive` (messages rarely without emoji). Captures tone and register.
`stylometric.emoji_placement`	categorical	Emoji position relative to sentence-ending punctuation. `pre_punctuation` = 'Hola 😊.' `post_punctuation` = 'Hola. 😊' Individual authors are strikingly consistent in this micro-habit.
`stylometric.message_length_class`	categorical	Median message length bucket: `short` 1-5 words, `medium` 6-20, `long` 21-50, `paragraph` >50. See also `message_length_variance_class` for distribution shape.
`stylometric.message_length_variance_class`	categorical	Distribution shape of per-message word counts. `tight` CV<0.5 (always 1-3 words). `varied` 0.5≤CV<1.5 (normal mix). `bimodal` CV≥1.5 (mostly short with occasional rants). Two authors can share the same median length but have wildly different variance.
`stylometric.linebreak_style`	categorical	Whether the author sends one complete thought per message or bursts multiple short sequential messages. `multi_line` = habitual 3-5 short messages per turn. `wall_of_text` = dense blocks, rarely uses line breaks. Captures a stylistic rhythm that is hard to consciously alter.
`stylometric.typo_signature`	hash	SHA-256 of the canonical persistent-typo set — the specific recurring errors the author makes consistently (e.g. always writes `tener` as `tenet`, or `porque` as `xq`). Persistent typos are strong authorship signals because they reflect keyboard-motor habits.
`stylometric.function_word_distribution_top50`	hash	64-bit SimHash over the 50 most common Spanish function-word frequency vector. Based on the Mosteller-Wallace method. Calibration note (2026-05-02, Rutify corpus): within-author and cross-author Hamming distance distributions overlap (within median 8 bits, cross median 10 bits) in short-message chat — this primitive alone cannot discriminate authors. Engines should weight it low and composite with character n-grams and distinctive vocabulary. Kept in v0 for calibration grids.
`stylometric.function_word_distribution_top200`	hash	64-bit SimHash over the 200 most common Spanish function words. The wider list reaches into the long tail (rare-but-individual words like `tampoco`, `aunque`, `mientras`) that carry more discriminating signal in short-message corpora. Not yet emitted by v0 prototype — populated in v0.2.
`stylometric.character_ngram_simhash`	hash	64-bit SimHash over character n-gram frequencies (default n=3), lowercased. Orthogonal to function-word distributions: captures punctuation tics, accent-stripping habits, typo patterns, and idiom fragments that survive paraphrase. Accents are preserved because accent-stripping is itself a stylistic tic. Source label declares n size (e.g. `#char3gram`).
`stylometric.distinctive_vocabulary_signature`	hash	64-bit SimHash over a TF-IDF-weighted top-K rare-word vector. Captures the author's distinctive lexicon — words they use that other authors in the same corpus do not. Complementary to function-word distributions: where `function_word_*` captures common-word style, this captures individual lexical choice. Requires the full corpus for IDF computation. Source label declares top-K and corpus tag (e.g. `#tfidf-top50`).
`stylometric.pos_ngram_signature`	hash	64-bit SimHash over a POS n-gram (default bigram) frequency vector. Captures syntactic skeleton independent of vocabulary — an author can change every word and retain the same grammatical fingerprint. Orthogonal to character n-grams and function-word distributions. Tagger-dependent: source label must declare tagger, language model, and n (e.g. `#spacy-es_core_news_sm-bi`). Calibration note: chat-domain text produces tagger noise — weight low until validated on labelled chat corpora.

`lexical.*` — Vocabulary and linguistic patterns (11 primitives)

Lexical primitives characterize what and how an actor writes at the word and sentence level. Where stylometric primitives fingerprint unconscious micro-habits, lexical primitives capture deliberate linguistic choices — vocabulary richness, how questions are formed, register.

Primitive	Kind	Description
`lexical.vocabulary_richness`	numeric [0,1]	Moving-Average Type-Token Ratio (MATTR) over a sliding window (default 50 tokens). Volume-independent: each window contributes its own unique/total ratio, the value is the mean. Avoids the standard TTR bias where larger corpora mechanically score lower. Source label declares window size.
`lexical.slang_density`	numeric [0,1]	Rate of slang terms per message, against a locale-tuned slang corpus.
`lexical.code_switching_rate`	numeric [0,1]	Language switches per N tokens (Solorio & Liu metric). A speaker who switches between Spanish and English, or Spanish and lunfardo/caló, will have a higher rate than a monolingual writer.
`lexical.code_switching_matrix_language`	free_string	BCP-47 tag of the dominant (matrix) language in code-switching texts (e.g. `es-CL`, `es-AR`). The matrix language is the grammatical scaffold; embedded languages appear as inserts.
`lexical.code_switching_embedded_languages`	array[free_string]	BCP-47 list of non-matrix languages observed in the actor's messages.
`lexical.sentence_complexity_class`	categorical	Dominant clause structure. `simple` = single-clause. `compound` = two independent clauses joined by coordinating conjunctions (pero, y, o). `complex` = dependent clauses and subordination (aunque, porque, cuando). Reflects education level and cognitive investment.
`lexical.question_formation_style`	categorical	How questions are formed. `punctuation_only` = question mark without interrogative words ('¿Cuánto?') — very common in Spanish chat. `lexical` = explicit interrogatives (¿qué, cómo, cuándo). `formal` = inverted subject-verb or formal register.
`lexical.imperative_style`	categorical	How commands and requests are framed. `informal_directive` = tú/vos imperative (dame, hazlo). `formal_directive` = usted imperative (hágame el favor). `polite` = conditional/modal softening (¿podría...?). Stable per-author trait in hierarchical contexts.
`lexical.dialect_region`	free_string	Dominant regional variety of the actor's matrix language as a BCP-47 language-region tag (e.g. `es-CL`, `es-AR`, `es-MX`, `es-ES`, `en-US`). Detected from lexical marker density against per-region vocabulary tables. Emit literal `unknown` below confidence threshold. Detection method declared in source label (e.g. `#dialect-markers-v1`). Complementary to `code_switching_matrix_language`, which derives language via switching analysis rather than direct marker lookup.
`lexical.evaluative_morphology_density`	numeric [0,1]	Rate of evaluative morpheme tokens / total tokens. Covers Spanish diminutives (`-ito`/`-ita`), augmentatives (`-ón`/`-ote`), pejoratives (`-ejo`/`-ucho`), and intensives (`-azo`). Heavy diminutive use is characteristic of Mexican/Central American Spanish; River Plate speakers use them significantly less. Stable per-author — baked into language acquisition and hard to consciously suppress. Source label declares morpheme set and tool version (e.g. `#eval-morph-es-v1`).
`lexical.optional_grammar_signature`	hash	64-bit SimHash over the author's preference probability vector at optional-grammar choice points. For Spanish: compound vs simple past (`he comido` vs `comí` — high-reliability Spain/LatAm discriminator), subjunctive usage rate, leísmo/laísmo/loísmo clitic patterns, and relative pronoun choice (`que` vs `el cual`). Each choice point is a scalar [0,1]; the SimHash is computed over the concatenated vector. Choice-point set is extractor-defined and declared in source label (e.g. `#optgrammar-es-v1`). Requires sufficient corpus volume for stable probabilities — gate on `meta.fingerprint_confidence` before use.

`temporal_evolution.*` — Behavioral change over time (1 primitive)

Primitive	Kind	Description
`temporal_evolution.lifecycle_phase`	categorical	Auto-classified lifecycle stage from windowed within-corpus analysis. `arrival_burst` = first 24hr, first-window volume dominates (empirically validated against OxPayload's first 12 hours in Rutify). `stable_member` = low drift across the full tenure. `fluctuating_member` = tenure ≥24hr with median drift between stable and inflection thresholds — established noisy regulars (e.g. lamarabitch). `inflection_member` = long-tenure actor with a real behavioral shift in at least one window-pair. `declining_member` = monotonically decreasing per-window message counts. `unknown` = insufficient data. Window size adapts to tenure: <24hr → 2h, <7d → 12h, <30d → 1d, otherwise 7d.

`network.*` — Governance and role signals (2 primitives)

Network primitives capture the actor's structural role in the group — inferred from interaction patterns rather than content — and a bot detector. These are heuristic composites built from other primitives; treat them as candidate signals, not verdicts.

Primitive	Kind	Description
`network.is_likely_bot`	categorical	Heuristic bot detector. `likely_bot` when `conversation_initiation_rate` ≥ 0.95 AND `attention_pattern` = `broadcast` AND `vocabulary_richness` < 0.65. Validated (2026-05-03) against SangMata_beta_bot (caught) vs 11 high-volume humans (no false positives). Low-volume bots (e.g. QuotLyBot, 9 messages) sit below the fingerprint threshold. Source label declares heuristic version (e.g. `#bot-heuristic-v1`).
`network.governance_role_signal`	categorical	Heuristic role shape from interaction primitives + lifecycle. `admin_pattern` = init_rate ≥ 0.80, attention reciprocal, non-bot, non-arrival_burst. `responder_pattern` = init_rate ≤ 0.45, attention reciprocal. `bot_pattern` = matches `is_likely_bot`. `regular` = everything else above volume threshold. Empirically caught 4/4 high-volume Rutify admins, sebaImlI as responder, SangMata as bot. NOT a ground-truth admin label.

`interaction.*` — Messaging behavior (6 primitives)

Interaction primitives characterize how the actor participates in conversations — timing, initiation rate, and attention patterns.

Primitive	Kind	Description
`interaction.response_latency_class`	categorical	How quickly the actor responds to messages directed at them. `immediate` <30s (suggests active monitoring or automation). `fast` 30s-5min. `normal` 5-60min. `slow` 1-24hr. `sporadic` = no consistent pattern.
`interaction.conversation_initiation_rate`	numeric [0,1]	Thread-starting messages / total messages. High rate = the actor drives conversations.
`interaction.message_burst_rate`	categorical	Whether the actor sends multiple messages per turn. `habitual` = almost always bursts (3+ messages before any reply). `single` = almost always one message per turn. Tied to `stylometric.linebreak_style multi_line`.
`interaction.active_hours_class`	free_string	UTC active-hours window summary (e.g. `05:00-14:00 UTC`). Free string — the window shape varies by actor and doesn't fit a closed enum.
`interaction.session_duration_class`	categorical	Typical session length: `short` <15min, `medium` 15-90min, `long` 90min-4hr, `marathon` >4hr. Shares the enum with `behave_shell`'s `temporal.session_duration`.
`interaction.attention_pattern`	categorical	Reply-graph centrality shape. `broadcast` = sends to many, replies to few (one-to-many). `focused` = concentrates on a small set of interlocutors. `reciprocal` = balanced give-and-take.

`content.*` — Content-derived signals, EXPERIMENTAL (6 primitives)

Content primitives are derived from message text through classifiers rather than structural/timing analysis. They carry the highest risk of false positives, are brittle to vocabulary drift, and are locale-specific. An attribution engine may choose to weight these at zero until field-validated against labeled data.

Primitive	Kind	Description
`content.role_signal`	categorical	Locale-tuned role-vocabulary classifier. Values: `admin`, `seller`, `buyer`, `lurker`, `newbie`. May be moved to a separate IOC/keyword-detection layer after Rutify testing. `EXPERIMENTAL`
`content.transactional_language`	numeric [0,1]	Rate of transactional terms per message. Locale-specific; brittle to vocabulary drift. `EXPERIMENTAL`
`content.opsec_awareness`	numeric [0,1]	Rate of security-conscious phrases. HIGH FALSE-POSITIVE RISK on casual conversation about deleting files/messages. `EXPERIMENTAL`
`content.targeting_language`	array[free_string]	IOC-shaped target patterns (bank names, government portals, RUT ranges). Consider moving to a dedicated IOC layer. `EXPERIMENTAL`
`content.boasting_pattern`	categorical	Success-claim frequency: `none`, `occasional`, `frequent`. Corpus-dependent regex. `EXPERIMENTAL`
`content.conflict_style`	categorical	Dispute-tone classification: `aggressive`, `defusing`, `appellate`. Needs labelled training data. `EXPERIMENTAL`

Engine implementation notes

Cross-alphabet fingerprint comparisons are undefined

Several primitives produce hash-based fingerprints by hashing over character or syntactic sequences:

stylometric.character_ngram_simhash
stylometric.pos_ngram_signature
stylometric.function_word_distribution_top50 / top200
stylometric.distinctive_vocabulary_signature
lexical.optional_grammar_signature

These fingerprints are only meaningful within a single writing-system boundary. A Latin-script actor (Spanish, English, French, Portuguese) and a Cyrillic-script actor (Russian, Bulgarian, Serbian) share zero character n-grams by definition. Comparing their character_ngram_simhash values produces a Hamming distance that is numerically valid but semantically undefined — it does not measure dissimilarity, it measures incomparability.

The same applies to any other script boundary: Arabic vs Latin, Hangul vs Hiragana, Devanagari vs Cyrillic, and so on.

Engine rule: before compositing or comparing any hash-based fingerprint between two actors, gate on script/language compatibility. Use lexical.dialect_region or lexical.code_switching_matrix_language to determine whether two actors share a writing system. If they do not, treat the fingerprint distance as undefined rather than as evidence of dissimilarity — do not include it in the similarity composite.

Primitives that are not subject to this constraint (safe to compare across writing systems without gating):

All meta.* primitives — corpus-footprint metrics are script-agnostic.
All interaction.* primitives — timing and graph-structure signals are script-agnostic.
stylometric.emoji_usage, stylometric.emoji_placement — Unicode emoji are shared across scripts.
stylometric.capitalization_habit — only meaningful within scripts that have case; emit unknown for caseless scripts (Arabic, CJK, etc.).
network.*, temporal_evolution.* — structural signals, script-agnostic.

Schema

Machine-readable JSON Schema: json/observation.schema.json

Regenerate after model changes:

python scripts/generate_schema.py

Tests

pytest tests/

Attribution recipes

attribution-recipes.md — placeholder document sketching how an external attribution engine would consume actor.observation.text.* topics to build actor profiles (credential_broker, low_skill_buyer, group_admin, etc.). Not populated yet — awaiting Rutify corpus calibration. Not part of the BEHAVE spec.

License

Code and schemas: GPL-3.0-or-later Spec prose (this file, attribution-recipes.md): CC-BY-SA-4.0

22 KiB Raw Blame History