- Add readme field to all three pyproject.toml files (populates PyPI description page) - Bump versions to 0.1.1 and update behave-core pin in shell/text - Update Install sections in root, BEHAVE-SHELL, and BEHAVE-TEXT READMEs to lead with pip install from PyPI - Add CHANGELOG.md at project root (Keep-a-Changelog format)
201 lines
15 KiB
Markdown
201 lines
15 KiB
Markdown
<!-- SPDX-License-Identifier: CC-BY-SA-4.0 -->
|
|
# behave-text
|
|
|
|
[← repo](../README.md)
|
|
|
|
Text/messaging-domain behavioral observation registry. Defines what can be observed
|
|
about an actor through their written messaging activity — stylometric fingerprints,
|
|
lexical patterns, interaction rhythms, and governance-role signals.
|
|
|
|
BEHAVE-TEXT operates on **derived features, not raw text**. Sensors hash, aggregate,
|
|
and classify before emitting — the raw message content never enters a BEHAVE
|
|
observation. This is a tighter constraint than BEHAVE-SHELL because the source
|
|
signal *is* text content; the PII risk is higher.
|
|
|
|
The topic prefix is `actor.observation.text` (not `attacker.`) because chat groups
|
|
include non-attacker roles — admins, buyers, sellers, bots, lurkers. The framing
|
|
is deliberately neutral: BEHAVE-TEXT observes actors, not adversaries.
|
|
|
|
## Install
|
|
|
|
```bash
|
|
pip install behave-text
|
|
```
|
|
|
|
For local development:
|
|
|
|
```bash
|
|
pip install -e ../core/ -e ".[dev]"
|
|
```
|
|
|
|
## Quickstart
|
|
|
|
```python
|
|
from behave_text.spec import Observation, Window, TOPIC_PREFIX, event_topic_for
|
|
|
|
obs = Observation(
|
|
primitive="stylometric.capitalization_habit",
|
|
value="lowercase",
|
|
confidence=0.91,
|
|
window=Window(start_ts=1714000000.0, end_ts=1714086400.0),
|
|
source="behave/text-sensor/stylometry.py",
|
|
)
|
|
topic = event_topic_for("stylometric.capitalization_habit")
|
|
# → "actor.observation.text.stylometric.capitalization_habit"
|
|
```
|
|
|
|
## Public API (`behave_text.spec`)
|
|
|
|
| Symbol | Description |
|
|
|---|---|
|
|
| `Observation` | Registry-aware subclass of `behave_core.spec.Observation`. Validates `primitive` and `value` against `PRIMITIVE_REGISTRY`. |
|
|
| `Window` | Re-exported from `behave_core`. |
|
|
| `ObservationValue` | Re-exported union type. |
|
|
| `PRIMITIVE_REGISTRY` | `dict[str, ValueTypeSpec]` — the full primitive catalog (35 entries). |
|
|
| `ValueKind` | Enum: `CATEGORICAL`, `NUMERIC`, `HASH`, `ARRAY`, `FREE_STRING`, `BOOL`. |
|
|
| `ValueTypeSpec` | Pydantic model: kind, allowed values, bounds, notes. |
|
|
| `is_known(primitive)` | `bool` — whether a primitive path is registered. |
|
|
| `get(primitive)` | Returns the `ValueTypeSpec`; raises `KeyError` if unknown. |
|
|
| `TOPIC_PREFIX` | `"actor.observation.text"` |
|
|
| `event_topic_for(primitive)` | Returns the full event bus topic string. |
|
|
|
|
Note: `to_event_payload` / `from_event_payload` (full round-trip helpers) are
|
|
present in `behave-shell` but not yet implemented here — `status: planned`.
|
|
|
|
## Primitives
|
|
|
|
35 primitives across 6 categories.
|
|
|
|
---
|
|
|
|
### `stylometric.*` — Writing style fingerprints (12 primitives)
|
|
|
|
Stylometric primitives capture the unconscious writing habits that distinguish
|
|
one author from another. The field goes back to the Mosteller-Wallace Federalist
|
|
Papers study (1963): function-word frequencies alone can attribute authorship
|
|
with high accuracy in long-form English text. BEHAVE-TEXT adapts these methods
|
|
to short-form Spanish chat, which introduces domain-specific challenges (short
|
|
messages, informal register, code-switching, emoji). Calibration results from
|
|
the Rutify corpus are noted inline where they affect interpretation.
|
|
|
|
| Primitive | Kind | Description |
|
|
|---|---|---|
|
|
| `stylometric.punctuation_style` | hash | Canonical punctuation-pattern fingerprint hash. Captures the author's consistent punctuation tics (double spaces, comma habits, no-period endings) as a searchable signature. |
|
|
| `stylometric.capitalization_habit` | categorical | Dominant capitalization rule. `lowercase` = no capitals. `proper` = standard sentence/title case. `random_caps` = no consistent rule. `mixed_i` = consistent lowercase 'i' mid-sentence — common in Spanish chat where the standalone-'I' habit doesn't apply but the behavior transfers. |
|
|
| `stylometric.emoji_usage` | categorical | Rate of emoji use. `none`, `occasional`, `frequent`, `exclusive` (messages rarely without emoji). Captures tone and register. |
|
|
| `stylometric.emoji_placement` | categorical | Emoji position relative to sentence-ending punctuation. `pre_punctuation` = 'Hola 😊.' `post_punctuation` = 'Hola. 😊' Individual authors are strikingly consistent in this micro-habit. |
|
|
| `stylometric.message_length_class` | categorical | Median message length bucket: `short` 1-5 words, `medium` 6-20, `long` 21-50, `paragraph` >50. See also `message_length_variance_class` for distribution shape. |
|
|
| `stylometric.message_length_variance_class` | categorical | Distribution shape of per-message word counts. `tight` CV<0.5 (always 1-3 words). `varied` 0.5≤CV<1.5 (normal mix). `bimodal` CV≥1.5 (mostly short with occasional rants). Two authors can share the same median length but have wildly different variance. |
|
|
| `stylometric.linebreak_style` | categorical | Whether the author sends one complete thought per message or bursts multiple short sequential messages. `multi_line` = habitual 3-5 short messages per turn. `wall_of_text` = dense blocks, rarely uses line breaks. Captures a stylistic rhythm that is hard to consciously alter. |
|
|
| `stylometric.typo_signature` | hash | SHA-256 of the canonical persistent-typo set — the specific recurring errors the author makes consistently (e.g. always writes `tener` as `tenet`, or `porque` as `xq`). Persistent typos are strong authorship signals because they reflect keyboard-motor habits. |
|
|
| `stylometric.function_word_distribution_top50` | hash | 64-bit SimHash over the 50 most common Spanish function-word frequency vector. Based on the Mosteller-Wallace method. **Calibration note (2026-05-02, Rutify corpus):** within-author and cross-author Hamming distance distributions overlap (within median 8 bits, cross median 10 bits) in short-message chat — this primitive alone cannot discriminate authors. Engines should weight it low and composite with character n-grams and distinctive vocabulary. Kept in v0 for calibration grids. |
|
|
| `stylometric.function_word_distribution_top200` | hash | 64-bit SimHash over the 200 most common Spanish function words. The wider list reaches into the long tail (rare-but-individual words like `tampoco`, `aunque`, `mientras`) that carry more discriminating signal in short-message corpora. Not yet emitted by v0 prototype — populated in v0.2. |
|
|
| `stylometric.character_ngram_simhash` | hash | 64-bit SimHash over character n-gram frequencies (default n=3), lowercased. Orthogonal to function-word distributions: captures punctuation tics, accent-stripping habits, typo patterns, and idiom fragments that survive paraphrase. Accents are preserved because accent-stripping is itself a stylistic tic. Source label declares n size (e.g. `#char3gram`). |
|
|
| `stylometric.distinctive_vocabulary_signature` | hash | 64-bit SimHash over a TF-IDF-weighted top-K rare-word vector. Captures the author's distinctive lexicon — words they use that other authors in the same corpus do not. Complementary to function-word distributions: where `function_word_*` captures common-word style, this captures individual lexical choice. Requires the full corpus for IDF computation. Source label declares top-K and corpus tag (e.g. `#tfidf-top50`). |
|
|
|
|
---
|
|
|
|
### `lexical.*` — Vocabulary and linguistic patterns (8 primitives)
|
|
|
|
Lexical primitives characterize *what* and *how* an actor writes at the word and
|
|
sentence level. Where stylometric primitives fingerprint unconscious micro-habits,
|
|
lexical primitives capture deliberate linguistic choices — vocabulary richness,
|
|
how questions are formed, register.
|
|
|
|
| Primitive | Kind | Description |
|
|
|---|---|---|
|
|
| `lexical.vocabulary_richness` | numeric [0,1] | Moving-Average Type-Token Ratio (MATTR) over a sliding window (default 50 tokens). Volume-independent: each window contributes its own unique/total ratio, the value is the mean. Avoids the standard TTR bias where larger corpora mechanically score lower. Source label declares window size. |
|
|
| `lexical.slang_density` | numeric [0,1] | Rate of slang terms per message, against a locale-tuned slang corpus. |
|
|
| `lexical.code_switching_rate` | numeric [0,1] | Language switches per N tokens (Solorio & Liu metric). A speaker who switches between Spanish and English, or Spanish and lunfardo/caló, will have a higher rate than a monolingual writer. |
|
|
| `lexical.code_switching_matrix_language` | free_string | BCP-47 tag of the dominant (matrix) language in code-switching texts (e.g. `es-CL`, `es-AR`). The matrix language is the grammatical scaffold; embedded languages appear as inserts. |
|
|
| `lexical.code_switching_embedded_languages` | array[free_string] | BCP-47 list of non-matrix languages observed in the actor's messages. |
|
|
| `lexical.sentence_complexity_class` | categorical | Dominant clause structure. `simple` = single-clause. `compound` = two independent clauses joined by coordinating conjunctions (pero, y, o). `complex` = dependent clauses and subordination (aunque, porque, cuando). Reflects education level and cognitive investment. |
|
|
| `lexical.question_formation_style` | categorical | How questions are formed. `punctuation_only` = question mark without interrogative words ('¿Cuánto?') — very common in Spanish chat. `lexical` = explicit interrogatives (¿qué, cómo, cuándo). `formal` = inverted subject-verb or formal register. |
|
|
| `lexical.imperative_style` | categorical | How commands and requests are framed. `informal_directive` = tú/vos imperative (dame, hazlo). `formal_directive` = usted imperative (hágame el favor). `polite` = conditional/modal softening (¿podría...?). Stable per-author trait in hierarchical contexts. |
|
|
|
|
---
|
|
|
|
### `temporal_evolution.*` — Behavioral change over time (1 primitive)
|
|
|
|
| Primitive | Kind | Description |
|
|
|---|---|---|
|
|
| `temporal_evolution.lifecycle_phase` | categorical | Auto-classified lifecycle stage from windowed within-corpus analysis. `arrival_burst` = first 24hr, first-window volume dominates (empirically validated against OxPayload's first 12 hours in Rutify). `stable_member` = low drift across the full tenure. `fluctuating_member` = tenure ≥24hr with median drift between stable and inflection thresholds — established noisy regulars (e.g. lamarabitch). `inflection_member` = long-tenure actor with a real behavioral shift in at least one window-pair. `declining_member` = monotonically decreasing per-window message counts. `unknown` = insufficient data. Window size adapts to tenure: <24hr → 2h, <7d → 12h, <30d → 1d, otherwise 7d. |
|
|
|
|
---
|
|
|
|
### `network.*` — Governance and role signals (2 primitives)
|
|
|
|
Network primitives capture the actor's *structural role* in the group — inferred
|
|
from interaction patterns rather than content — and a bot detector. These are
|
|
heuristic composites built from other primitives; treat them as candidate signals,
|
|
not verdicts.
|
|
|
|
| Primitive | Kind | Description |
|
|
|---|---|---|
|
|
| `network.is_likely_bot` | categorical | Heuristic bot detector. `likely_bot` when `conversation_initiation_rate` ≥ 0.95 AND `attention_pattern` = `broadcast` AND `vocabulary_richness` < 0.65. Validated (2026-05-03) against SangMata_beta_bot (caught) vs 11 high-volume humans (no false positives). Low-volume bots (e.g. QuotLyBot, 9 messages) sit below the fingerprint threshold. Source label declares heuristic version (e.g. `#bot-heuristic-v1`). |
|
|
| `network.governance_role_signal` | categorical | Heuristic role shape from interaction primitives + lifecycle. `admin_pattern` = init_rate ≥ 0.80, attention reciprocal, non-bot, non-arrival_burst. `responder_pattern` = init_rate ≤ 0.45, attention reciprocal. `bot_pattern` = matches `is_likely_bot`. `regular` = everything else above volume threshold. Empirically caught 4/4 high-volume Rutify admins, sebaImlI as responder, SangMata as bot. NOT a ground-truth admin label. |
|
|
|
|
---
|
|
|
|
### `interaction.*` — Messaging behavior (6 primitives)
|
|
|
|
Interaction primitives characterize *how* the actor participates in conversations —
|
|
timing, initiation rate, and attention patterns.
|
|
|
|
| Primitive | Kind | Description |
|
|
|---|---|---|
|
|
| `interaction.response_latency_class` | categorical | How quickly the actor responds to messages directed at them. `immediate` <30s (suggests active monitoring or automation). `fast` 30s-5min. `normal` 5-60min. `slow` 1-24hr. `sporadic` = no consistent pattern. |
|
|
| `interaction.conversation_initiation_rate` | numeric [0,1] | Thread-starting messages / total messages. High rate = the actor drives conversations. |
|
|
| `interaction.message_burst_rate` | categorical | Whether the actor sends multiple messages per turn. `habitual` = almost always bursts (3+ messages before any reply). `single` = almost always one message per turn. Tied to `stylometric.linebreak_style multi_line`. |
|
|
| `interaction.active_hours_class` | free_string | UTC active-hours window summary (e.g. `05:00-14:00 UTC`). Free string — the window shape varies by actor and doesn't fit a closed enum. |
|
|
| `interaction.session_duration_class` | categorical | Typical session length: `short` <15min, `medium` 15-90min, `long` 90min-4hr, `marathon` >4hr. Shares the enum with `behave_shell`'s `temporal.session_duration`. |
|
|
| `interaction.attention_pattern` | categorical | Reply-graph centrality shape. `broadcast` = sends to many, replies to few (one-to-many). `focused` = concentrates on a small set of interlocutors. `reciprocal` = balanced give-and-take. |
|
|
|
|
---
|
|
|
|
### `content.*` — Content-derived signals, EXPERIMENTAL (6 primitives)
|
|
|
|
Content primitives are derived from message text through classifiers rather than
|
|
structural/timing analysis. They carry the highest risk of false positives, are
|
|
brittle to vocabulary drift, and are locale-specific. An attribution engine may
|
|
choose to weight these at zero until field-validated against labeled data.
|
|
|
|
| Primitive | Kind | Description |
|
|
|---|---|---|
|
|
| `content.role_signal` | categorical | Locale-tuned role-vocabulary classifier. Values: `admin`, `seller`, `buyer`, `lurker`, `newbie`. May be moved to a separate IOC/keyword-detection layer after Rutify testing. `EXPERIMENTAL` |
|
|
| `content.transactional_language` | numeric [0,1] | Rate of transactional terms per message. Locale-specific; brittle to vocabulary drift. `EXPERIMENTAL` |
|
|
| `content.opsec_awareness` | numeric [0,1] | Rate of security-conscious phrases. **HIGH FALSE-POSITIVE RISK** on casual conversation about deleting files/messages. `EXPERIMENTAL` |
|
|
| `content.targeting_language` | array[free_string] | IOC-shaped target patterns (bank names, government portals, RUT ranges). Consider moving to a dedicated IOC layer. `EXPERIMENTAL` |
|
|
| `content.boasting_pattern` | categorical | Success-claim frequency: `none`, `occasional`, `frequent`. Corpus-dependent regex. `EXPERIMENTAL` |
|
|
| `content.conflict_style` | categorical | Dispute-tone classification: `aggressive`, `defusing`, `appellate`. Needs labelled training data. `EXPERIMENTAL` |
|
|
|
|
---
|
|
|
|
## Schema
|
|
|
|
Machine-readable JSON Schema:
|
|
[`json/observation.schema.json`](json/observation.schema.json)
|
|
|
|
Regenerate after model changes:
|
|
```bash
|
|
python scripts/generate_schema.py
|
|
```
|
|
|
|
## Tests
|
|
|
|
```bash
|
|
pytest tests/
|
|
```
|
|
|
|
## Attribution recipes
|
|
|
|
[`attribution-recipes.md`](attribution-recipes.md) — placeholder document sketching
|
|
how an external attribution engine would consume `actor.observation.text.*` topics
|
|
to build actor profiles (`credential_broker`, `low_skill_buyer`, `group_admin`, etc.).
|
|
**Not populated yet** — awaiting Rutify corpus calibration. Not part of the BEHAVE spec.
|
|
|
|
## License
|
|
|
|
Code and schemas: [GPL-3.0-or-later](../LICENSE)
|
|
Spec prose (this file, attribution-recipes.md): [CC-BY-SA-4.0](../LICENSE.docs)
|