Compare commits
3 Commits
text/v0.1.
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 0d18a3e30d | |||
| b182e2fe3b | |||
| 214ce50941 |
@@ -14,8 +14,12 @@ aggregates, and cryptographic hashes — never raw keystrokes or command text.
|
||||
## Install
|
||||
|
||||
```bash
|
||||
pip install -e ../core/ -e .
|
||||
# development (pytest + ruff):
|
||||
pip install behave-shell
|
||||
```
|
||||
|
||||
For local development (pytest + ruff):
|
||||
|
||||
```bash
|
||||
pip install -e ../core/ -e ".[dev]"
|
||||
```
|
||||
|
||||
|
||||
@@ -4,12 +4,13 @@ build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "behave-shell"
|
||||
version = "0.1.0"
|
||||
version = "0.1.1"
|
||||
description = "BEHAVE-SHELL — shell-session behavioral observation registry, layered on behave-core"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.11"
|
||||
license = { text = "GPL-3.0-or-later" }
|
||||
authors = [{ name = "ANTI" }]
|
||||
dependencies = ["pydantic>=2.6", "behave-core>=0.1.0"]
|
||||
dependencies = ["pydantic>=2.6", "behave-core>=0.1.1"]
|
||||
|
||||
[project.optional-dependencies]
|
||||
dev = ["pytest>=8", "pytest-cov", "ruff"]
|
||||
|
||||
@@ -19,8 +19,12 @@ is deliberately neutral: BEHAVE-TEXT observes actors, not adversaries.
|
||||
## Install
|
||||
|
||||
```bash
|
||||
pip install -e ../core/ -e .
|
||||
# development:
|
||||
pip install behave-text
|
||||
```
|
||||
|
||||
For local development:
|
||||
|
||||
```bash
|
||||
pip install -e ../core/ -e ".[dev]"
|
||||
```
|
||||
|
||||
@@ -47,7 +51,7 @@ topic = event_topic_for("stylometric.capitalization_habit")
|
||||
| `Observation` | Registry-aware subclass of `behave_core.spec.Observation`. Validates `primitive` and `value` against `PRIMITIVE_REGISTRY`. |
|
||||
| `Window` | Re-exported from `behave_core`. |
|
||||
| `ObservationValue` | Re-exported union type. |
|
||||
| `PRIMITIVE_REGISTRY` | `dict[str, ValueTypeSpec]` — the full primitive catalog (35 entries). |
|
||||
| `PRIMITIVE_REGISTRY` | `dict[str, ValueTypeSpec]` — the full primitive catalog (47 entries). |
|
||||
| `ValueKind` | Enum: `CATEGORICAL`, `NUMERIC`, `HASH`, `ARRAY`, `FREE_STRING`, `BOOL`. |
|
||||
| `ValueTypeSpec` | Pydantic model: kind, allowed values, bounds, notes. |
|
||||
| `is_known(primitive)` | `bool` — whether a primitive path is registered. |
|
||||
@@ -60,11 +64,33 @@ present in `behave-shell` but not yet implemented here — `status: planned`.
|
||||
|
||||
## Primitives
|
||||
|
||||
35 primitives across 6 categories.
|
||||
47 primitives across 7 categories.
|
||||
|
||||
---
|
||||
|
||||
### `stylometric.*` — Writing style fingerprints (12 primitives)
|
||||
### `meta.*` — Corpus-snapshot footprint (8 primitives)
|
||||
|
||||
Meta primitives describe the actor's presence in the corpus window itself —
|
||||
how many messages, how long a span, how densely distributed. They are not
|
||||
stylometric features; they are the scaffolding that other primitives assume.
|
||||
Several primitives (notably `temporal_evolution.lifecycle_phase`) implicitly
|
||||
depend on these quantities; `meta.*` makes them first-class so downstream
|
||||
attribution engines can access and weight them explicitly.
|
||||
|
||||
| Primitive | Kind | Description |
|
||||
|---|---|---|
|
||||
| `meta.total_messages` | numeric | Raw message count for this actor in the corpus snapshot. Anchor for `msg_per_day` and `fingerprint_confidence`. |
|
||||
| `meta.corpus_span_days` | numeric | Wall-clock fractional days between first and last message. First-to-last only — blind to gaps. A 47-day span with 5 active days still yields 47. Recomputable from `first_seen_ts` / `last_seen_ts`. |
|
||||
| `meta.msg_per_day` | numeric | `total_messages / corpus_span_days`. Separates bursty visitors (53 msgs / 0.3 days = 53/day) from long-tail lurkers (53 msgs / 47 days = 1.1/day). Undefined when span = 0; extractors emit null/omit rather than divide-by-zero. |
|
||||
| `meta.active_days` | numeric | Distinct calendar days (UTC) with ≥1 message. Always ≤ `corpus_span_days`. Distinguishes a periodic visitor (span=47, active=3) from a near-daily regular (span=47, active=40). |
|
||||
| `meta.activity_density` | numeric [0,1] | `active_days / corpus_span_days`. 1.0 = present every day of the window. Near-0 = appeared once or twice across a long window. Undefined when span = 0; emit null/omit for single-day actors. |
|
||||
| `meta.first_seen_ts` | free_string | ISO 8601 timestamp (UTC offset) of the actor's earliest message. Anchors `corpus_span_days` in absolute time for cross-extraction comparison. |
|
||||
| `meta.last_seen_ts` | free_string | ISO 8601 timestamp (UTC offset) of the actor's latest message. See `first_seen_ts`. |
|
||||
| `meta.fingerprint_confidence` | categorical | Qualitative reliability of this actor's full fingerprint: `low`, `medium`, `high`. Attribution engines should weight all other observations by this before compositing. Derivation is **extractor-defined** — extractors declare their heuristic in the source label (e.g. `#confidence-v1`). |
|
||||
|
||||
---
|
||||
|
||||
### `stylometric.*` — Writing style fingerprints (13 primitives)
|
||||
|
||||
Stylometric primitives capture the unconscious writing habits that distinguish
|
||||
one author from another. The field goes back to the Mosteller-Wallace Federalist
|
||||
@@ -88,10 +114,11 @@ the Rutify corpus are noted inline where they affect interpretation.
|
||||
| `stylometric.function_word_distribution_top200` | hash | 64-bit SimHash over the 200 most common Spanish function words. The wider list reaches into the long tail (rare-but-individual words like `tampoco`, `aunque`, `mientras`) that carry more discriminating signal in short-message corpora. Not yet emitted by v0 prototype — populated in v0.2. |
|
||||
| `stylometric.character_ngram_simhash` | hash | 64-bit SimHash over character n-gram frequencies (default n=3), lowercased. Orthogonal to function-word distributions: captures punctuation tics, accent-stripping habits, typo patterns, and idiom fragments that survive paraphrase. Accents are preserved because accent-stripping is itself a stylistic tic. Source label declares n size (e.g. `#char3gram`). |
|
||||
| `stylometric.distinctive_vocabulary_signature` | hash | 64-bit SimHash over a TF-IDF-weighted top-K rare-word vector. Captures the author's distinctive lexicon — words they use that other authors in the same corpus do not. Complementary to function-word distributions: where `function_word_*` captures common-word style, this captures individual lexical choice. Requires the full corpus for IDF computation. Source label declares top-K and corpus tag (e.g. `#tfidf-top50`). |
|
||||
| `stylometric.pos_ngram_signature` | hash | 64-bit SimHash over a POS n-gram (default bigram) frequency vector. Captures syntactic skeleton independent of vocabulary — an author can change every word and retain the same grammatical fingerprint. Orthogonal to character n-grams and function-word distributions. Tagger-dependent: source label must declare tagger, language model, and n (e.g. `#spacy-es_core_news_sm-bi`). Calibration note: chat-domain text produces tagger noise — weight low until validated on labelled chat corpora. |
|
||||
|
||||
---
|
||||
|
||||
### `lexical.*` — Vocabulary and linguistic patterns (8 primitives)
|
||||
### `lexical.*` — Vocabulary and linguistic patterns (11 primitives)
|
||||
|
||||
Lexical primitives characterize *what* and *how* an actor writes at the word and
|
||||
sentence level. Where stylometric primitives fingerprint unconscious micro-habits,
|
||||
@@ -108,6 +135,9 @@ how questions are formed, register.
|
||||
| `lexical.sentence_complexity_class` | categorical | Dominant clause structure. `simple` = single-clause. `compound` = two independent clauses joined by coordinating conjunctions (pero, y, o). `complex` = dependent clauses and subordination (aunque, porque, cuando). Reflects education level and cognitive investment. |
|
||||
| `lexical.question_formation_style` | categorical | How questions are formed. `punctuation_only` = question mark without interrogative words ('¿Cuánto?') — very common in Spanish chat. `lexical` = explicit interrogatives (¿qué, cómo, cuándo). `formal` = inverted subject-verb or formal register. |
|
||||
| `lexical.imperative_style` | categorical | How commands and requests are framed. `informal_directive` = tú/vos imperative (dame, hazlo). `formal_directive` = usted imperative (hágame el favor). `polite` = conditional/modal softening (¿podría...?). Stable per-author trait in hierarchical contexts. |
|
||||
| `lexical.dialect_region` | free_string | Dominant regional variety of the actor's matrix language as a BCP-47 language-region tag (e.g. `es-CL`, `es-AR`, `es-MX`, `es-ES`, `en-US`). Detected from lexical marker density against per-region vocabulary tables. Emit literal `unknown` below confidence threshold. Detection method declared in source label (e.g. `#dialect-markers-v1`). Complementary to `code_switching_matrix_language`, which derives language via switching analysis rather than direct marker lookup. |
|
||||
| `lexical.evaluative_morphology_density` | numeric [0,1] | Rate of evaluative morpheme tokens / total tokens. Covers Spanish diminutives (`-ito`/`-ita`), augmentatives (`-ón`/`-ote`), pejoratives (`-ejo`/`-ucho`), and intensives (`-azo`). Heavy diminutive use is characteristic of Mexican/Central American Spanish; River Plate speakers use them significantly less. Stable per-author — baked into language acquisition and hard to consciously suppress. Source label declares morpheme set and tool version (e.g. `#eval-morph-es-v1`). |
|
||||
| `lexical.optional_grammar_signature` | hash | 64-bit SimHash over the author's preference probability vector at optional-grammar choice points. For Spanish: compound vs simple past (`he comido` vs `comí` — high-reliability Spain/LatAm discriminator), subjunctive usage rate, leísmo/laísmo/loísmo clitic patterns, and relative pronoun choice (`que` vs `el cual`). Each choice point is a scalar [0,1]; the SimHash is computed over the concatenated vector. Choice-point set is extractor-defined and declared in source label (e.g. `#optgrammar-es-v1`). Requires sufficient corpus volume for stable probabilities — gate on `meta.fingerprint_confidence` before use. |
|
||||
|
||||
---
|
||||
|
||||
@@ -167,6 +197,49 @@ choose to weight these at zero until field-validated against labeled data.
|
||||
|
||||
---
|
||||
|
||||
## Engine implementation notes
|
||||
|
||||
### Cross-alphabet fingerprint comparisons are undefined
|
||||
|
||||
Several primitives produce hash-based fingerprints by hashing over character or
|
||||
syntactic sequences:
|
||||
|
||||
- `stylometric.character_ngram_simhash`
|
||||
- `stylometric.pos_ngram_signature`
|
||||
- `stylometric.function_word_distribution_top50` / `top200`
|
||||
- `stylometric.distinctive_vocabulary_signature`
|
||||
- `lexical.optional_grammar_signature`
|
||||
|
||||
These fingerprints are only **meaningful within a single writing-system boundary**.
|
||||
A Latin-script actor (Spanish, English, French, Portuguese) and a Cyrillic-script
|
||||
actor (Russian, Bulgarian, Serbian) share zero character n-grams by definition.
|
||||
Comparing their `character_ngram_simhash` values produces a Hamming distance that
|
||||
is numerically valid but semantically undefined — it does not measure dissimilarity,
|
||||
it measures incomparability.
|
||||
|
||||
The same applies to any other script boundary: Arabic vs Latin, Hangul vs Hiragana,
|
||||
Devanagari vs Cyrillic, and so on.
|
||||
|
||||
**Engine rule:** before compositing or comparing any hash-based fingerprint between
|
||||
two actors, gate on script/language compatibility. Use `lexical.dialect_region` or
|
||||
`lexical.code_switching_matrix_language` to determine whether two actors share a
|
||||
writing system. If they do not, treat the fingerprint distance as `undefined` rather
|
||||
than as evidence of dissimilarity — do not include it in the similarity composite.
|
||||
|
||||
Primitives that are **not** subject to this constraint (safe to compare across
|
||||
writing systems without gating):
|
||||
|
||||
- All `meta.*` primitives — corpus-footprint metrics are script-agnostic.
|
||||
- All `interaction.*` primitives — timing and graph-structure signals are
|
||||
script-agnostic.
|
||||
- `stylometric.emoji_usage`, `stylometric.emoji_placement` — Unicode emoji are
|
||||
shared across scripts.
|
||||
- `stylometric.capitalization_habit` — only meaningful within scripts that have
|
||||
case; emit `unknown` for caseless scripts (Arabic, CJK, etc.).
|
||||
- `network.*`, `temporal_evolution.*` — structural signals, script-agnostic.
|
||||
|
||||
---
|
||||
|
||||
## Schema
|
||||
|
||||
Machine-readable JSON Schema:
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
# BEHAVE-TEXT Attribution Recipes
|
||||
|
||||
> **This document is not part of BEHAVE-TEXT.** BEHAVE-TEXT (`scratchpad.md`) defines the observation taxonomy and emission envelope. It does **not** assert who an actor is, link sessions, or assign profiles. Those are attribution-engine concerns.
|
||||
> **This document is not part of BEHAVE-TEXT.** BEHAVE-TEXT (`README.md`) defines the observation taxonomy and emission envelope. It does **not** assert who an actor is, link sessions, or assign profiles. Those are attribution-engine concerns.
|
||||
>
|
||||
> This document is a **placeholder**. Recipes for the text domain wait for corpus calibration. The Rutify Telegram corpus (forthcoming) will be the labeling ground truth that drives the first concrete profiles.
|
||||
|
||||
|
||||
@@ -16,7 +16,7 @@ PII discipline notice (carried over from behave-core's envelope module):
|
||||
IS text content. Sensors must hash/aggregate before emitting.
|
||||
|
||||
Adding a new primitive is a deliberate registry edit. Drift between this file
|
||||
and `scratchpad.md` is a bug; v0 keeps the registry hand-written so PR review
|
||||
and `README.md` is a bug; v0 keeps the registry hand-written so PR review
|
||||
catches drift, v0.x may auto-extract from the markdown if drift becomes a
|
||||
maintenance issue.
|
||||
|
||||
@@ -109,10 +109,71 @@ def _array(of: ValueKind, notes: Optional[str] = None) -> ValueTypeSpec:
|
||||
|
||||
# ─── The registry ───────────────────────────────────────────────────────────
|
||||
#
|
||||
# 28 primitives across 4 layers. Mirrors scratchpad.md row-for-row.
|
||||
# 47 primitives across 7 layers. Mirrors README.md row-for-row.
|
||||
|
||||
PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = {
|
||||
# ── stylometric.* (motor analog — 8) ──────────────────────────────────
|
||||
# ── meta.* (corpus-snapshot footprint — 8) ────────────────────────────
|
||||
"meta.total_messages": _num(
|
||||
min_val=0.0,
|
||||
notes="Raw message count for this actor in the corpus snapshot. Integer in "
|
||||
"practice; stored as float for spec uniformity. Dependency anchor: "
|
||||
"msg_per_day is derived from this; fingerprint_confidence is informed "
|
||||
"by this. Emit before deriving rates.",
|
||||
),
|
||||
"meta.corpus_span_days": _num(
|
||||
min_val=0.0,
|
||||
notes="Wall-clock duration in fractional days between the actor's earliest "
|
||||
"and latest message in the corpus snapshot. First-to-last only — blind "
|
||||
"to silence in between (a 47-day span with 5 active days still yields "
|
||||
"47). Complement with active_days and activity_density to get presence "
|
||||
"shape. Recomputable from first_seen_ts and last_seen_ts.",
|
||||
),
|
||||
"meta.msg_per_day": _num(
|
||||
min_val=0.0,
|
||||
notes="total_messages / corpus_span_days. The key rate that separates a "
|
||||
"bursty single-session visitor (53 msgs in 0.3 days → 53/day) from a "
|
||||
"long-tail lurker (53 msgs in 47 days → 1.1/day). Undefined when "
|
||||
"corpus_span_days = 0; extractors should emit null/omit rather than "
|
||||
"divide-by-zero in that edge case.",
|
||||
),
|
||||
"meta.active_days": _num(
|
||||
min_val=0.0,
|
||||
notes="Count of distinct calendar days (UTC) on which the actor sent at "
|
||||
"least one message. Always ≤ corpus_span_days. An actor with span=47 "
|
||||
"and active_days=3 is a periodic visitor who appears rarely; one with "
|
||||
"span=47 and active_days=40 is a near-daily regular. Use alongside "
|
||||
"activity_density for full presence shape.",
|
||||
),
|
||||
"meta.activity_density": _num(
|
||||
min_val=0.0, max_val=1.0,
|
||||
notes="active_days / corpus_span_days. Single scalar capturing 'how filled "
|
||||
"is the span?'. 1.0 = present every day of the window. Near-0 = "
|
||||
"appeared once or twice across a long window. Undefined when "
|
||||
"corpus_span_days = 0; emit null/omit for single-day actors.",
|
||||
),
|
||||
"meta.first_seen_ts": _str(
|
||||
notes="ISO 8601 timestamp (with UTC offset, e.g. '2025-11-03T14:22:07+00:00') "
|
||||
"of the actor's earliest message in the corpus snapshot. Combined with "
|
||||
"last_seen_ts, this anchors corpus_span_days in absolute time so "
|
||||
"observations from different extractions can be compared temporally.",
|
||||
),
|
||||
"meta.last_seen_ts": _str(
|
||||
notes="ISO 8601 timestamp (with UTC offset, e.g. '2025-12-20T09:11:43+00:00') "
|
||||
"of the actor's latest message in the corpus snapshot. See first_seen_ts.",
|
||||
),
|
||||
"meta.fingerprint_confidence": _cat(
|
||||
"low", "medium", "high",
|
||||
notes="Qualitative reliability rating for this actor's full fingerprint. "
|
||||
"An attribution engine should weight all other observations from this "
|
||||
"actor proportionally to this value before compositing. Derivation is "
|
||||
"EXTRACTOR-DEFINED — the registry specifies the semantic contract, not "
|
||||
"the formula. Extractors must declare their heuristic in the source "
|
||||
"label (e.g. '#confidence-v1'). Typical inputs: total_messages, "
|
||||
"corpus_span_days, active_days, and any domain-specific thresholds "
|
||||
"the extractor authors have calibrated.",
|
||||
),
|
||||
|
||||
# ── stylometric.* (motor analog — 13) ─────────────────────────────────
|
||||
"stylometric.punctuation_style": _hash(notes="canonical punctuation-pattern fingerprint"),
|
||||
"stylometric.capitalization_habit": _cat(
|
||||
"lowercase", "proper", "random_caps", "mixed_i",
|
||||
@@ -200,8 +261,23 @@ PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = {
|
||||
"computation, performed once per extraction. Source label declares the "
|
||||
"top-K size and corpus tag (e.g. `#tfidf-top50`).",
|
||||
),
|
||||
"stylometric.pos_ngram_signature": _hash(
|
||||
notes="64-bit simhash over a POS n-gram (default bigram) frequency vector "
|
||||
"from the author's text corpus. Captures syntactic skeleton independent "
|
||||
"of vocabulary — an author can change every word they use and still "
|
||||
"retain the same POS-bigram fingerprint. ORTHOGONAL to character_ngram "
|
||||
"and function_word distributions: those capture surface form, this "
|
||||
"captures grammatical structure. Example signal: consistent ADJ-NOUN vs "
|
||||
"NOUN-ADJ ordering in Spanish, habitual ADV-VERB pre-position. "
|
||||
"TAGGER-DEPENDENT: source label MUST declare the tagger, language model, "
|
||||
"and n value (e.g. `#spacy-es_core_news_sm-bi` for spaCy Spanish "
|
||||
"small model, bigrams). Calibration note: chat-domain text is noisy — "
|
||||
"abbreviations, misspellings, and code-switching cause tagger errors "
|
||||
"that introduce fingerprint noise. Engines should weight low until "
|
||||
"calibrated against labelled chat corpora.",
|
||||
),
|
||||
|
||||
# ── lexical.* (cognitive analog — 8) ──────────────────────────────────
|
||||
# ── lexical.* (cognitive analog — 11) ─────────────────────────────────
|
||||
"lexical.vocabulary_richness": _num(
|
||||
min_val=0.0, max_val=1.0,
|
||||
notes="Moving-Average Type-Token Ratio (MATTR) over a sliding window "
|
||||
@@ -242,6 +318,52 @@ PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = {
|
||||
"market contexts where hierarchical and peer relationships are expressed "
|
||||
"through register choice.",
|
||||
),
|
||||
"lexical.dialect_region": _str(
|
||||
notes="Dominant regional variety of the actor's matrix language, expressed as "
|
||||
"a BCP-47 language-region tag (e.g. `es-CL`, `es-AR`, `es-MX`, `es-ES`, "
|
||||
"`en-US`). Detected from lexical marker density against per-region "
|
||||
"vocabulary tables; detection method and marker set version declared in "
|
||||
"source label (e.g. `#dialect-markers-v1`). Emit the literal string "
|
||||
"`unknown` when the extractor falls below its confidence threshold — do "
|
||||
"not omit the observation, so downstream engines can distinguish "
|
||||
"'undetected' from 'not extracted'. Language-agnostic in concept; the "
|
||||
"marker vocabulary is language-specific. COMPLEMENTARY to "
|
||||
"lexical.code_switching_matrix_language, which captures the dominant "
|
||||
"language via switching analysis rather than direct regional-marker lookup.",
|
||||
),
|
||||
"lexical.evaluative_morphology_density": _num(
|
||||
min_val=0.0, max_val=1.0,
|
||||
notes="Rate of evaluative morpheme tokens / total tokens. Evaluative morphology "
|
||||
"encompasses suffixes that add expressive/emotional loading to a stem: "
|
||||
"diminutives (`-ito`/`-ita`/`-cito`/`-cita` — affection, minimization, "
|
||||
"softening), augmentatives (`-ón`/`-ona`/`-ote`/`-ota` — intensification), "
|
||||
"pejoratives (`-ejo`/`-eja`/`-ucho`/`-ucha` — contempt), and intensives "
|
||||
"(`-azo`/`-aza` — force or admiration by context). Heavy diminutive use "
|
||||
"is characteristic of Mexican and Central American Spanish; River Plate "
|
||||
"speakers use them significantly less. The density is stable per-author "
|
||||
"and hard to consciously suppress — it is baked into language acquisition. "
|
||||
"Language-agnostic in concept; detection (suffix rules or morphological "
|
||||
"analyser) is language-specific. Source label declares the morpheme set "
|
||||
"and tool version (e.g. `#eval-morph-es-v1`).",
|
||||
),
|
||||
"lexical.optional_grammar_signature": _hash(
|
||||
notes="64-bit simhash over a vector of the author's preference probabilities "
|
||||
"at optional-grammar choice points — positions where the language offers "
|
||||
"multiple grammatically correct options and individual authors make stable "
|
||||
"idiosyncratic choices. For Spanish: compound past vs simple past ratio "
|
||||
"(`he comido` vs `comí` — Spain strongly prefers compound for recent "
|
||||
"actions; Latin America almost universally uses simple past, making this "
|
||||
"a high-reliability Spain/LatAm discriminator), subjunctive usage rate "
|
||||
"(avoidance correlates with informal register or non-native acquisition), "
|
||||
"leísmo/laísmo/loísmo clitic patterns (`le vi` vs `lo vi` for masculine "
|
||||
"accusative — leísmo is characteristic of Castilian Spain), and relative "
|
||||
"pronoun choice (`que` vs `el cual/la cual` — register marker). Each "
|
||||
"choice point is a scalar [0,1] probability; the simhash is computed over "
|
||||
"the concatenated vector. EXTRACTOR-DEFINED: choice-point set declared in "
|
||||
"source label (e.g. `#optgrammar-es-v1`). Requires sufficient corpus "
|
||||
"volume for stable probability estimates — thin corpora produce noisy "
|
||||
"hashes; engines should gate on meta.fingerprint_confidence before use.",
|
||||
),
|
||||
|
||||
# ── temporal_evolution.* (lifecycle / change-over-time — 1) ───────────
|
||||
"temporal_evolution.lifecycle_phase": _cat(
|
||||
|
||||
@@ -4,12 +4,13 @@ build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "behave-text"
|
||||
version = "0.1.0"
|
||||
version = "0.1.3"
|
||||
description = "BEHAVE-TEXT — text/messaging-domain behavioral observation registry, layered on behave-core"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.11"
|
||||
license = { text = "GPL-3.0-or-later" }
|
||||
authors = [{ name = "ANTI" }]
|
||||
dependencies = ["pydantic>=2.6", "behave-core>=0.1.0"]
|
||||
dependencies = ["pydantic>=2.6", "behave-core>=0.1.1"]
|
||||
|
||||
[project.optional-dependencies]
|
||||
dev = ["pytest>=8", "pytest-cov", "ruff"]
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
# SPDX-License-Identifier: GPL-3.0-or-later
|
||||
"""Registry coverage tests for BEHAVE-TEXT.
|
||||
|
||||
Asserts that every primitive listed in scratchpad.md's tables has exactly one
|
||||
Asserts that every primitive listed in README.md's tables has exactly one
|
||||
entry in PRIMITIVE_REGISTRY. Drift-detector — failing this test means
|
||||
scratchpad.md and the registry have diverged.
|
||||
README.md and the registry have diverged.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
@@ -13,9 +13,18 @@ from pathlib import Path
|
||||
|
||||
from behave_text.spec import PRIMITIVE_REGISTRY, ValueKind
|
||||
|
||||
# Primitive paths expected by scratchpad.md (hand-extracted; v0).
|
||||
# Primitive paths expected by README.md (hand-extracted; v0).
|
||||
EXPECTED_PRIMITIVES = {
|
||||
# stylometric.* (motor analog — 8)
|
||||
# meta.* (corpus-snapshot footprint — 8)
|
||||
"meta.total_messages",
|
||||
"meta.corpus_span_days",
|
||||
"meta.msg_per_day",
|
||||
"meta.active_days",
|
||||
"meta.activity_density",
|
||||
"meta.first_seen_ts",
|
||||
"meta.last_seen_ts",
|
||||
"meta.fingerprint_confidence",
|
||||
# stylometric.* (motor analog — 13)
|
||||
"stylometric.punctuation_style",
|
||||
"stylometric.capitalization_habit",
|
||||
"stylometric.emoji_usage",
|
||||
@@ -28,7 +37,8 @@ EXPECTED_PRIMITIVES = {
|
||||
"stylometric.function_word_distribution_top200",
|
||||
"stylometric.character_ngram_simhash",
|
||||
"stylometric.distinctive_vocabulary_signature",
|
||||
# lexical.* (cognitive analog — 8)
|
||||
"stylometric.pos_ngram_signature",
|
||||
# lexical.* (cognitive analog — 11)
|
||||
"lexical.vocabulary_richness",
|
||||
"lexical.slang_density",
|
||||
"lexical.code_switching_rate",
|
||||
@@ -37,6 +47,9 @@ EXPECTED_PRIMITIVES = {
|
||||
"lexical.sentence_complexity_class",
|
||||
"lexical.question_formation_style",
|
||||
"lexical.imperative_style",
|
||||
"lexical.dialect_region",
|
||||
"lexical.evaluative_morphology_density",
|
||||
"lexical.optional_grammar_signature",
|
||||
# temporal_evolution.* (lifecycle/change-over-time — 1, added v0.2)
|
||||
"temporal_evolution.lifecycle_phase",
|
||||
# network.* (governance/role-shape — 2, added v0.3)
|
||||
|
||||
67
CHANGELOG.md
Normal file
67
CHANGELOG.md
Normal file
@@ -0,0 +1,67 @@
|
||||
# Changelog
|
||||
|
||||
All notable changes to BEHAVE packages are documented here.
|
||||
Format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
|
||||
Versions follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
---
|
||||
|
||||
## [behave-text 0.1.3] — 2026-05-23
|
||||
|
||||
### behave-text
|
||||
|
||||
#### Added
|
||||
- `stylometric.pos_ngram_signature` — 64-bit SimHash over POS n-gram (default bigram)
|
||||
frequency vector. Captures syntactic skeleton independent of vocabulary. Tagger-dependent;
|
||||
source label must declare tagger + model + n. Calibration note: noisy on chat-domain text,
|
||||
weight low until validated.
|
||||
- `lexical.dialect_region` — BCP-47 language-region free_string (`es-CL`, `es-AR`, `es-MX`,
|
||||
`es-ES`, `en-US`, etc.) for the actor's dominant regional variety, detected from lexical
|
||||
marker density. Emit `unknown` below confidence threshold. Designed for EYENET integration
|
||||
with INGEOTEC `regional-spanish-models` vocabulary tables (MIT).
|
||||
- `lexical.evaluative_morphology_density` — numeric [0,1] rate of evaluative morpheme tokens
|
||||
(diminutives, augmentatives, pejoratives, intensives) per total tokens. Stable per-author
|
||||
trait baked into language acquisition; strong Spain/LatAm regional discriminator.
|
||||
- `lexical.optional_grammar_signature` — 64-bit SimHash over author preference probabilities
|
||||
at optional-grammar choice points (for Spanish: compound vs simple past, subjunctive usage,
|
||||
leísmo/laísmo/loísmo, relative pronoun choice). Choice-point set is extractor-defined and
|
||||
declared in source label.
|
||||
|
||||
---
|
||||
|
||||
## [behave-text 0.1.2] — 2026-05-23
|
||||
|
||||
### behave-text
|
||||
|
||||
#### Added
|
||||
- `meta.*` layer — 8 new corpus-snapshot primitives: `total_messages`, `corpus_span_days`,
|
||||
`msg_per_day`, `active_days`, `activity_density`, `first_seen_ts`, `last_seen_ts`,
|
||||
`fingerprint_confidence`. Fills the gap between actors with identical message counts but
|
||||
radically different presence shapes (bursty single-session vs long-tail lurker).
|
||||
|
||||
#### Fixed
|
||||
- Stale `scratchpad.md` references in `primitives.py` docstring, `tests/test_primitives.py`
|
||||
docstring, and `attribution-recipes.md` — `README.md` is now the authority.
|
||||
|
||||
---
|
||||
|
||||
## [0.1.0] — 2026-05-17
|
||||
|
||||
Initial public release of all three packages.
|
||||
|
||||
### behave-core
|
||||
|
||||
- Shared observation envelope and schema contract (`BehaveObservation`)
|
||||
- Pydantic v2 base models for domain-agnostic behavioral records
|
||||
|
||||
### behave-shell
|
||||
|
||||
- Shell-session behavioral observation registry
|
||||
- Primitive catalog covering command execution, session lifecycle, environment, and navigation events
|
||||
- Layered on `behave-core`
|
||||
|
||||
### behave-text
|
||||
|
||||
- Text/messaging-domain behavioral observation registry
|
||||
- Primitive catalog covering message composition, conversation, and metadata events
|
||||
- Layered on `behave-core`
|
||||
12
README.md
12
README.md
@@ -13,15 +13,17 @@ BEHAVE defines a structured observation envelope (`core`) and two domain registr
|
||||
|
||||
## Install
|
||||
|
||||
Each package is independently installable. Install `core` first as both registries depend on it.
|
||||
All packages are available on PyPI:
|
||||
|
||||
```bash
|
||||
pip install -e core/
|
||||
pip install -e BEHAVE-SHELL/
|
||||
pip install -e BEHAVE-TEXT/
|
||||
pip install behave-core
|
||||
pip install behave-shell
|
||||
pip install behave-text
|
||||
```
|
||||
|
||||
For development (includes pytest + ruff):
|
||||
`behave-core` is pulled in automatically as a dependency of the registries.
|
||||
|
||||
For local development (includes pytest + ruff):
|
||||
|
||||
```bash
|
||||
pip install -e "core/[dev]"
|
||||
|
||||
@@ -4,8 +4,9 @@ build-backend = "setuptools.build_meta"
|
||||
|
||||
[project]
|
||||
name = "behave-core"
|
||||
version = "0.1.0"
|
||||
version = "0.1.1"
|
||||
description = "BEHAVE shared observation envelope — schema contract used by BEHAVE-SHELL and BEHAVE-TEXT"
|
||||
readme = "README.md"
|
||||
requires-python = ">=3.11"
|
||||
license = { text = "GPL-3.0-or-later" }
|
||||
authors = [{ name = "ANTI" }]
|
||||
|
||||
Reference in New Issue
Block a user