2 Commits

Author SHA1 Message Date
0d18a3e30d modified: readme 2026-05-29 18:13:41 -04:00
b182e2fe3b feat(text): add meta.* corpus-footprint layer and 4 language-aware primitives (v0.1.3)
Adds 12 new primitives across two waves of spec work this session.

meta.* layer (8 primitives) — corpus-snapshot footprint:
  total_messages, corpus_span_days, msg_per_day, active_days,
  activity_density, first_seen_ts, last_seen_ts, fingerprint_confidence.
  Motivated by two actors with identical message counts (53 each) producing
  indistinguishable profiles despite radically different presence shapes
  (0.3-day burst vs 47-day long tail).

Language-aware characterization primitives (4 primitives):
  stylometric.pos_ngram_signature — SimHash over POS bigram frequency vector;
    syntactic skeleton fingerprint that survives full vocabulary paraphrase.
  lexical.dialect_region — BCP-47 free_string (es-CL, es-AR, es-MX, …);
    designed for EYENET integration with INGEOTEC regional-spanish-models.
  lexical.evaluative_morphology_density — diminutive/augmentative/pejorative
    suffix density; stable per-author trait baked into language acquisition.
  lexical.optional_grammar_signature — SimHash over optional-grammar choice
    points (compound/simple past, subjunctive, leísmo, relative pronoun);
    high-reliability Spain vs LatAm discriminator.

Also fixes stale scratchpad.md references throughout (README.md is now the
authority), bumps behave-text to 0.1.3, and updates CHANGELOG.
2026-05-23 01:54:12 -04:00
6 changed files with 258 additions and 15 deletions

View File

@@ -51,7 +51,7 @@ topic = event_topic_for("stylometric.capitalization_habit")
| `Observation` | Registry-aware subclass of `behave_core.spec.Observation`. Validates `primitive` and `value` against `PRIMITIVE_REGISTRY`. | | `Observation` | Registry-aware subclass of `behave_core.spec.Observation`. Validates `primitive` and `value` against `PRIMITIVE_REGISTRY`. |
| `Window` | Re-exported from `behave_core`. | | `Window` | Re-exported from `behave_core`. |
| `ObservationValue` | Re-exported union type. | | `ObservationValue` | Re-exported union type. |
| `PRIMITIVE_REGISTRY` | `dict[str, ValueTypeSpec]` — the full primitive catalog (35 entries). | | `PRIMITIVE_REGISTRY` | `dict[str, ValueTypeSpec]` — the full primitive catalog (47 entries). |
| `ValueKind` | Enum: `CATEGORICAL`, `NUMERIC`, `HASH`, `ARRAY`, `FREE_STRING`, `BOOL`. | | `ValueKind` | Enum: `CATEGORICAL`, `NUMERIC`, `HASH`, `ARRAY`, `FREE_STRING`, `BOOL`. |
| `ValueTypeSpec` | Pydantic model: kind, allowed values, bounds, notes. | | `ValueTypeSpec` | Pydantic model: kind, allowed values, bounds, notes. |
| `is_known(primitive)` | `bool` — whether a primitive path is registered. | | `is_known(primitive)` | `bool` — whether a primitive path is registered. |
@@ -64,11 +64,33 @@ present in `behave-shell` but not yet implemented here — `status: planned`.
## Primitives ## Primitives
35 primitives across 6 categories. 47 primitives across 7 categories.
--- ---
### `stylometric.*` — Writing style fingerprints (12 primitives) ### `meta.*` — Corpus-snapshot footprint (8 primitives)
Meta primitives describe the actor's presence in the corpus window itself —
how many messages, how long a span, how densely distributed. They are not
stylometric features; they are the scaffolding that other primitives assume.
Several primitives (notably `temporal_evolution.lifecycle_phase`) implicitly
depend on these quantities; `meta.*` makes them first-class so downstream
attribution engines can access and weight them explicitly.
| Primitive | Kind | Description |
|---|---|---|
| `meta.total_messages` | numeric | Raw message count for this actor in the corpus snapshot. Anchor for `msg_per_day` and `fingerprint_confidence`. |
| `meta.corpus_span_days` | numeric | Wall-clock fractional days between first and last message. First-to-last only — blind to gaps. A 47-day span with 5 active days still yields 47. Recomputable from `first_seen_ts` / `last_seen_ts`. |
| `meta.msg_per_day` | numeric | `total_messages / corpus_span_days`. Separates bursty visitors (53 msgs / 0.3 days = 53/day) from long-tail lurkers (53 msgs / 47 days = 1.1/day). Undefined when span = 0; extractors emit null/omit rather than divide-by-zero. |
| `meta.active_days` | numeric | Distinct calendar days (UTC) with ≥1 message. Always ≤ `corpus_span_days`. Distinguishes a periodic visitor (span=47, active=3) from a near-daily regular (span=47, active=40). |
| `meta.activity_density` | numeric [0,1] | `active_days / corpus_span_days`. 1.0 = present every day of the window. Near-0 = appeared once or twice across a long window. Undefined when span = 0; emit null/omit for single-day actors. |
| `meta.first_seen_ts` | free_string | ISO 8601 timestamp (UTC offset) of the actor's earliest message. Anchors `corpus_span_days` in absolute time for cross-extraction comparison. |
| `meta.last_seen_ts` | free_string | ISO 8601 timestamp (UTC offset) of the actor's latest message. See `first_seen_ts`. |
| `meta.fingerprint_confidence` | categorical | Qualitative reliability of this actor's full fingerprint: `low`, `medium`, `high`. Attribution engines should weight all other observations by this before compositing. Derivation is **extractor-defined** — extractors declare their heuristic in the source label (e.g. `#confidence-v1`). |
---
### `stylometric.*` — Writing style fingerprints (13 primitives)
Stylometric primitives capture the unconscious writing habits that distinguish Stylometric primitives capture the unconscious writing habits that distinguish
one author from another. The field goes back to the Mosteller-Wallace Federalist one author from another. The field goes back to the Mosteller-Wallace Federalist
@@ -92,10 +114,11 @@ the Rutify corpus are noted inline where they affect interpretation.
| `stylometric.function_word_distribution_top200` | hash | 64-bit SimHash over the 200 most common Spanish function words. The wider list reaches into the long tail (rare-but-individual words like `tampoco`, `aunque`, `mientras`) that carry more discriminating signal in short-message corpora. Not yet emitted by v0 prototype — populated in v0.2. | | `stylometric.function_word_distribution_top200` | hash | 64-bit SimHash over the 200 most common Spanish function words. The wider list reaches into the long tail (rare-but-individual words like `tampoco`, `aunque`, `mientras`) that carry more discriminating signal in short-message corpora. Not yet emitted by v0 prototype — populated in v0.2. |
| `stylometric.character_ngram_simhash` | hash | 64-bit SimHash over character n-gram frequencies (default n=3), lowercased. Orthogonal to function-word distributions: captures punctuation tics, accent-stripping habits, typo patterns, and idiom fragments that survive paraphrase. Accents are preserved because accent-stripping is itself a stylistic tic. Source label declares n size (e.g. `#char3gram`). | | `stylometric.character_ngram_simhash` | hash | 64-bit SimHash over character n-gram frequencies (default n=3), lowercased. Orthogonal to function-word distributions: captures punctuation tics, accent-stripping habits, typo patterns, and idiom fragments that survive paraphrase. Accents are preserved because accent-stripping is itself a stylistic tic. Source label declares n size (e.g. `#char3gram`). |
| `stylometric.distinctive_vocabulary_signature` | hash | 64-bit SimHash over a TF-IDF-weighted top-K rare-word vector. Captures the author's distinctive lexicon — words they use that other authors in the same corpus do not. Complementary to function-word distributions: where `function_word_*` captures common-word style, this captures individual lexical choice. Requires the full corpus for IDF computation. Source label declares top-K and corpus tag (e.g. `#tfidf-top50`). | | `stylometric.distinctive_vocabulary_signature` | hash | 64-bit SimHash over a TF-IDF-weighted top-K rare-word vector. Captures the author's distinctive lexicon — words they use that other authors in the same corpus do not. Complementary to function-word distributions: where `function_word_*` captures common-word style, this captures individual lexical choice. Requires the full corpus for IDF computation. Source label declares top-K and corpus tag (e.g. `#tfidf-top50`). |
| `stylometric.pos_ngram_signature` | hash | 64-bit SimHash over a POS n-gram (default bigram) frequency vector. Captures syntactic skeleton independent of vocabulary — an author can change every word and retain the same grammatical fingerprint. Orthogonal to character n-grams and function-word distributions. Tagger-dependent: source label must declare tagger, language model, and n (e.g. `#spacy-es_core_news_sm-bi`). Calibration note: chat-domain text produces tagger noise — weight low until validated on labelled chat corpora. |
--- ---
### `lexical.*` — Vocabulary and linguistic patterns (8 primitives) ### `lexical.*` — Vocabulary and linguistic patterns (11 primitives)
Lexical primitives characterize *what* and *how* an actor writes at the word and Lexical primitives characterize *what* and *how* an actor writes at the word and
sentence level. Where stylometric primitives fingerprint unconscious micro-habits, sentence level. Where stylometric primitives fingerprint unconscious micro-habits,
@@ -112,6 +135,9 @@ how questions are formed, register.
| `lexical.sentence_complexity_class` | categorical | Dominant clause structure. `simple` = single-clause. `compound` = two independent clauses joined by coordinating conjunctions (pero, y, o). `complex` = dependent clauses and subordination (aunque, porque, cuando). Reflects education level and cognitive investment. | | `lexical.sentence_complexity_class` | categorical | Dominant clause structure. `simple` = single-clause. `compound` = two independent clauses joined by coordinating conjunctions (pero, y, o). `complex` = dependent clauses and subordination (aunque, porque, cuando). Reflects education level and cognitive investment. |
| `lexical.question_formation_style` | categorical | How questions are formed. `punctuation_only` = question mark without interrogative words ('¿Cuánto?') — very common in Spanish chat. `lexical` = explicit interrogatives (¿qué, cómo, cuándo). `formal` = inverted subject-verb or formal register. | | `lexical.question_formation_style` | categorical | How questions are formed. `punctuation_only` = question mark without interrogative words ('¿Cuánto?') — very common in Spanish chat. `lexical` = explicit interrogatives (¿qué, cómo, cuándo). `formal` = inverted subject-verb or formal register. |
| `lexical.imperative_style` | categorical | How commands and requests are framed. `informal_directive` = tú/vos imperative (dame, hazlo). `formal_directive` = usted imperative (hágame el favor). `polite` = conditional/modal softening (¿podría...?). Stable per-author trait in hierarchical contexts. | | `lexical.imperative_style` | categorical | How commands and requests are framed. `informal_directive` = tú/vos imperative (dame, hazlo). `formal_directive` = usted imperative (hágame el favor). `polite` = conditional/modal softening (¿podría...?). Stable per-author trait in hierarchical contexts. |
| `lexical.dialect_region` | free_string | Dominant regional variety of the actor's matrix language as a BCP-47 language-region tag (e.g. `es-CL`, `es-AR`, `es-MX`, `es-ES`, `en-US`). Detected from lexical marker density against per-region vocabulary tables. Emit literal `unknown` below confidence threshold. Detection method declared in source label (e.g. `#dialect-markers-v1`). Complementary to `code_switching_matrix_language`, which derives language via switching analysis rather than direct marker lookup. |
| `lexical.evaluative_morphology_density` | numeric [0,1] | Rate of evaluative morpheme tokens / total tokens. Covers Spanish diminutives (`-ito`/`-ita`), augmentatives (`-ón`/`-ote`), pejoratives (`-ejo`/`-ucho`), and intensives (`-azo`). Heavy diminutive use is characteristic of Mexican/Central American Spanish; River Plate speakers use them significantly less. Stable per-author — baked into language acquisition and hard to consciously suppress. Source label declares morpheme set and tool version (e.g. `#eval-morph-es-v1`). |
| `lexical.optional_grammar_signature` | hash | 64-bit SimHash over the author's preference probability vector at optional-grammar choice points. For Spanish: compound vs simple past (`he comido` vs `comí` — high-reliability Spain/LatAm discriminator), subjunctive usage rate, leísmo/laísmo/loísmo clitic patterns, and relative pronoun choice (`que` vs `el cual`). Each choice point is a scalar [0,1]; the SimHash is computed over the concatenated vector. Choice-point set is extractor-defined and declared in source label (e.g. `#optgrammar-es-v1`). Requires sufficient corpus volume for stable probabilities — gate on `meta.fingerprint_confidence` before use. |
--- ---
@@ -171,6 +197,49 @@ choose to weight these at zero until field-validated against labeled data.
--- ---
## Engine implementation notes
### Cross-alphabet fingerprint comparisons are undefined
Several primitives produce hash-based fingerprints by hashing over character or
syntactic sequences:
- `stylometric.character_ngram_simhash`
- `stylometric.pos_ngram_signature`
- `stylometric.function_word_distribution_top50` / `top200`
- `stylometric.distinctive_vocabulary_signature`
- `lexical.optional_grammar_signature`
These fingerprints are only **meaningful within a single writing-system boundary**.
A Latin-script actor (Spanish, English, French, Portuguese) and a Cyrillic-script
actor (Russian, Bulgarian, Serbian) share zero character n-grams by definition.
Comparing their `character_ngram_simhash` values produces a Hamming distance that
is numerically valid but semantically undefined — it does not measure dissimilarity,
it measures incomparability.
The same applies to any other script boundary: Arabic vs Latin, Hangul vs Hiragana,
Devanagari vs Cyrillic, and so on.
**Engine rule:** before compositing or comparing any hash-based fingerprint between
two actors, gate on script/language compatibility. Use `lexical.dialect_region` or
`lexical.code_switching_matrix_language` to determine whether two actors share a
writing system. If they do not, treat the fingerprint distance as `undefined` rather
than as evidence of dissimilarity — do not include it in the similarity composite.
Primitives that are **not** subject to this constraint (safe to compare across
writing systems without gating):
- All `meta.*` primitives — corpus-footprint metrics are script-agnostic.
- All `interaction.*` primitives — timing and graph-structure signals are
script-agnostic.
- `stylometric.emoji_usage`, `stylometric.emoji_placement` — Unicode emoji are
shared across scripts.
- `stylometric.capitalization_habit` — only meaningful within scripts that have
case; emit `unknown` for caseless scripts (Arabic, CJK, etc.).
- `network.*`, `temporal_evolution.*` — structural signals, script-agnostic.
---
## Schema ## Schema
Machine-readable JSON Schema: Machine-readable JSON Schema:

View File

@@ -2,7 +2,7 @@
# BEHAVE-TEXT Attribution Recipes # BEHAVE-TEXT Attribution Recipes
> **This document is not part of BEHAVE-TEXT.** BEHAVE-TEXT (`scratchpad.md`) defines the observation taxonomy and emission envelope. It does **not** assert who an actor is, link sessions, or assign profiles. Those are attribution-engine concerns. > **This document is not part of BEHAVE-TEXT.** BEHAVE-TEXT (`README.md`) defines the observation taxonomy and emission envelope. It does **not** assert who an actor is, link sessions, or assign profiles. Those are attribution-engine concerns.
> >
> This document is a **placeholder**. Recipes for the text domain wait for corpus calibration. The Rutify Telegram corpus (forthcoming) will be the labeling ground truth that drives the first concrete profiles. > This document is a **placeholder**. Recipes for the text domain wait for corpus calibration. The Rutify Telegram corpus (forthcoming) will be the labeling ground truth that drives the first concrete profiles.

View File

@@ -16,7 +16,7 @@ PII discipline notice (carried over from behave-core's envelope module):
IS text content. Sensors must hash/aggregate before emitting. IS text content. Sensors must hash/aggregate before emitting.
Adding a new primitive is a deliberate registry edit. Drift between this file Adding a new primitive is a deliberate registry edit. Drift between this file
and `scratchpad.md` is a bug; v0 keeps the registry hand-written so PR review and `README.md` is a bug; v0 keeps the registry hand-written so PR review
catches drift, v0.x may auto-extract from the markdown if drift becomes a catches drift, v0.x may auto-extract from the markdown if drift becomes a
maintenance issue. maintenance issue.
@@ -109,10 +109,71 @@ def _array(of: ValueKind, notes: Optional[str] = None) -> ValueTypeSpec:
# ─── The registry ─────────────────────────────────────────────────────────── # ─── The registry ───────────────────────────────────────────────────────────
# #
# 28 primitives across 4 layers. Mirrors scratchpad.md row-for-row. # 47 primitives across 7 layers. Mirrors README.md row-for-row.
PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = { PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = {
# ── stylometric.* (motor analog — 8) ────────────────────────────────── # ── meta.* (corpus-snapshot footprint — 8) ────────────────────────────
"meta.total_messages": _num(
min_val=0.0,
notes="Raw message count for this actor in the corpus snapshot. Integer in "
"practice; stored as float for spec uniformity. Dependency anchor: "
"msg_per_day is derived from this; fingerprint_confidence is informed "
"by this. Emit before deriving rates.",
),
"meta.corpus_span_days": _num(
min_val=0.0,
notes="Wall-clock duration in fractional days between the actor's earliest "
"and latest message in the corpus snapshot. First-to-last only — blind "
"to silence in between (a 47-day span with 5 active days still yields "
"47). Complement with active_days and activity_density to get presence "
"shape. Recomputable from first_seen_ts and last_seen_ts.",
),
"meta.msg_per_day": _num(
min_val=0.0,
notes="total_messages / corpus_span_days. The key rate that separates a "
"bursty single-session visitor (53 msgs in 0.3 days → 53/day) from a "
"long-tail lurker (53 msgs in 47 days → 1.1/day). Undefined when "
"corpus_span_days = 0; extractors should emit null/omit rather than "
"divide-by-zero in that edge case.",
),
"meta.active_days": _num(
min_val=0.0,
notes="Count of distinct calendar days (UTC) on which the actor sent at "
"least one message. Always ≤ corpus_span_days. An actor with span=47 "
"and active_days=3 is a periodic visitor who appears rarely; one with "
"span=47 and active_days=40 is a near-daily regular. Use alongside "
"activity_density for full presence shape.",
),
"meta.activity_density": _num(
min_val=0.0, max_val=1.0,
notes="active_days / corpus_span_days. Single scalar capturing 'how filled "
"is the span?'. 1.0 = present every day of the window. Near-0 = "
"appeared once or twice across a long window. Undefined when "
"corpus_span_days = 0; emit null/omit for single-day actors.",
),
"meta.first_seen_ts": _str(
notes="ISO 8601 timestamp (with UTC offset, e.g. '2025-11-03T14:22:07+00:00') "
"of the actor's earliest message in the corpus snapshot. Combined with "
"last_seen_ts, this anchors corpus_span_days in absolute time so "
"observations from different extractions can be compared temporally.",
),
"meta.last_seen_ts": _str(
notes="ISO 8601 timestamp (with UTC offset, e.g. '2025-12-20T09:11:43+00:00') "
"of the actor's latest message in the corpus snapshot. See first_seen_ts.",
),
"meta.fingerprint_confidence": _cat(
"low", "medium", "high",
notes="Qualitative reliability rating for this actor's full fingerprint. "
"An attribution engine should weight all other observations from this "
"actor proportionally to this value before compositing. Derivation is "
"EXTRACTOR-DEFINED — the registry specifies the semantic contract, not "
"the formula. Extractors must declare their heuristic in the source "
"label (e.g. '#confidence-v1'). Typical inputs: total_messages, "
"corpus_span_days, active_days, and any domain-specific thresholds "
"the extractor authors have calibrated.",
),
# ── stylometric.* (motor analog — 13) ─────────────────────────────────
"stylometric.punctuation_style": _hash(notes="canonical punctuation-pattern fingerprint"), "stylometric.punctuation_style": _hash(notes="canonical punctuation-pattern fingerprint"),
"stylometric.capitalization_habit": _cat( "stylometric.capitalization_habit": _cat(
"lowercase", "proper", "random_caps", "mixed_i", "lowercase", "proper", "random_caps", "mixed_i",
@@ -200,8 +261,23 @@ PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = {
"computation, performed once per extraction. Source label declares the " "computation, performed once per extraction. Source label declares the "
"top-K size and corpus tag (e.g. `#tfidf-top50`).", "top-K size and corpus tag (e.g. `#tfidf-top50`).",
), ),
"stylometric.pos_ngram_signature": _hash(
notes="64-bit simhash over a POS n-gram (default bigram) frequency vector "
"from the author's text corpus. Captures syntactic skeleton independent "
"of vocabulary — an author can change every word they use and still "
"retain the same POS-bigram fingerprint. ORTHOGONAL to character_ngram "
"and function_word distributions: those capture surface form, this "
"captures grammatical structure. Example signal: consistent ADJ-NOUN vs "
"NOUN-ADJ ordering in Spanish, habitual ADV-VERB pre-position. "
"TAGGER-DEPENDENT: source label MUST declare the tagger, language model, "
"and n value (e.g. `#spacy-es_core_news_sm-bi` for spaCy Spanish "
"small model, bigrams). Calibration note: chat-domain text is noisy — "
"abbreviations, misspellings, and code-switching cause tagger errors "
"that introduce fingerprint noise. Engines should weight low until "
"calibrated against labelled chat corpora.",
),
# ── lexical.* (cognitive analog — 8) ───────────────────────────────── # ── lexical.* (cognitive analog — 11) ─────────────────────────────────
"lexical.vocabulary_richness": _num( "lexical.vocabulary_richness": _num(
min_val=0.0, max_val=1.0, min_val=0.0, max_val=1.0,
notes="Moving-Average Type-Token Ratio (MATTR) over a sliding window " notes="Moving-Average Type-Token Ratio (MATTR) over a sliding window "
@@ -242,6 +318,52 @@ PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = {
"market contexts where hierarchical and peer relationships are expressed " "market contexts where hierarchical and peer relationships are expressed "
"through register choice.", "through register choice.",
), ),
"lexical.dialect_region": _str(
notes="Dominant regional variety of the actor's matrix language, expressed as "
"a BCP-47 language-region tag (e.g. `es-CL`, `es-AR`, `es-MX`, `es-ES`, "
"`en-US`). Detected from lexical marker density against per-region "
"vocabulary tables; detection method and marker set version declared in "
"source label (e.g. `#dialect-markers-v1`). Emit the literal string "
"`unknown` when the extractor falls below its confidence threshold — do "
"not omit the observation, so downstream engines can distinguish "
"'undetected' from 'not extracted'. Language-agnostic in concept; the "
"marker vocabulary is language-specific. COMPLEMENTARY to "
"lexical.code_switching_matrix_language, which captures the dominant "
"language via switching analysis rather than direct regional-marker lookup.",
),
"lexical.evaluative_morphology_density": _num(
min_val=0.0, max_val=1.0,
notes="Rate of evaluative morpheme tokens / total tokens. Evaluative morphology "
"encompasses suffixes that add expressive/emotional loading to a stem: "
"diminutives (`-ito`/`-ita`/`-cito`/`-cita` — affection, minimization, "
"softening), augmentatives (`-ón`/`-ona`/`-ote`/`-ota` — intensification), "
"pejoratives (`-ejo`/`-eja`/`-ucho`/`-ucha` — contempt), and intensives "
"(`-azo`/`-aza` — force or admiration by context). Heavy diminutive use "
"is characteristic of Mexican and Central American Spanish; River Plate "
"speakers use them significantly less. The density is stable per-author "
"and hard to consciously suppress — it is baked into language acquisition. "
"Language-agnostic in concept; detection (suffix rules or morphological "
"analyser) is language-specific. Source label declares the morpheme set "
"and tool version (e.g. `#eval-morph-es-v1`).",
),
"lexical.optional_grammar_signature": _hash(
notes="64-bit simhash over a vector of the author's preference probabilities "
"at optional-grammar choice points — positions where the language offers "
"multiple grammatically correct options and individual authors make stable "
"idiosyncratic choices. For Spanish: compound past vs simple past ratio "
"(`he comido` vs `comí` — Spain strongly prefers compound for recent "
"actions; Latin America almost universally uses simple past, making this "
"a high-reliability Spain/LatAm discriminator), subjunctive usage rate "
"(avoidance correlates with informal register or non-native acquisition), "
"leísmo/laísmo/loísmo clitic patterns (`le vi` vs `lo vi` for masculine "
"accusative — leísmo is characteristic of Castilian Spain), and relative "
"pronoun choice (`que` vs `el cual/la cual` — register marker). Each "
"choice point is a scalar [0,1] probability; the simhash is computed over "
"the concatenated vector. EXTRACTOR-DEFINED: choice-point set declared in "
"source label (e.g. `#optgrammar-es-v1`). Requires sufficient corpus "
"volume for stable probability estimates — thin corpora produce noisy "
"hashes; engines should gate on meta.fingerprint_confidence before use.",
),
# ── temporal_evolution.* (lifecycle / change-over-time — 1) ─────────── # ── temporal_evolution.* (lifecycle / change-over-time — 1) ───────────
"temporal_evolution.lifecycle_phase": _cat( "temporal_evolution.lifecycle_phase": _cat(

View File

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
[project] [project]
name = "behave-text" name = "behave-text"
version = "0.1.1" version = "0.1.3"
description = "BEHAVE-TEXT — text/messaging-domain behavioral observation registry, layered on behave-core" description = "BEHAVE-TEXT — text/messaging-domain behavioral observation registry, layered on behave-core"
readme = "README.md" readme = "README.md"
requires-python = ">=3.11" requires-python = ">=3.11"

View File

@@ -1,9 +1,9 @@
# SPDX-License-Identifier: GPL-3.0-or-later # SPDX-License-Identifier: GPL-3.0-or-later
"""Registry coverage tests for BEHAVE-TEXT. """Registry coverage tests for BEHAVE-TEXT.
Asserts that every primitive listed in scratchpad.md's tables has exactly one Asserts that every primitive listed in README.md's tables has exactly one
entry in PRIMITIVE_REGISTRY. Drift-detector — failing this test means entry in PRIMITIVE_REGISTRY. Drift-detector — failing this test means
scratchpad.md and the registry have diverged. README.md and the registry have diverged.
""" """
from __future__ import annotations from __future__ import annotations
@@ -13,9 +13,18 @@ from pathlib import Path
from behave_text.spec import PRIMITIVE_REGISTRY, ValueKind from behave_text.spec import PRIMITIVE_REGISTRY, ValueKind
# Primitive paths expected by scratchpad.md (hand-extracted; v0). # Primitive paths expected by README.md (hand-extracted; v0).
EXPECTED_PRIMITIVES = { EXPECTED_PRIMITIVES = {
# stylometric.* (motor analog — 8) # meta.* (corpus-snapshot footprint — 8)
"meta.total_messages",
"meta.corpus_span_days",
"meta.msg_per_day",
"meta.active_days",
"meta.activity_density",
"meta.first_seen_ts",
"meta.last_seen_ts",
"meta.fingerprint_confidence",
# stylometric.* (motor analog — 13)
"stylometric.punctuation_style", "stylometric.punctuation_style",
"stylometric.capitalization_habit", "stylometric.capitalization_habit",
"stylometric.emoji_usage", "stylometric.emoji_usage",
@@ -28,7 +37,8 @@ EXPECTED_PRIMITIVES = {
"stylometric.function_word_distribution_top200", "stylometric.function_word_distribution_top200",
"stylometric.character_ngram_simhash", "stylometric.character_ngram_simhash",
"stylometric.distinctive_vocabulary_signature", "stylometric.distinctive_vocabulary_signature",
# lexical.* (cognitive analog — 8) "stylometric.pos_ngram_signature",
# lexical.* (cognitive analog — 11)
"lexical.vocabulary_richness", "lexical.vocabulary_richness",
"lexical.slang_density", "lexical.slang_density",
"lexical.code_switching_rate", "lexical.code_switching_rate",
@@ -37,6 +47,9 @@ EXPECTED_PRIMITIVES = {
"lexical.sentence_complexity_class", "lexical.sentence_complexity_class",
"lexical.question_formation_style", "lexical.question_formation_style",
"lexical.imperative_style", "lexical.imperative_style",
"lexical.dialect_region",
"lexical.evaluative_morphology_density",
"lexical.optional_grammar_signature",
# temporal_evolution.* (lifecycle/change-over-time — 1, added v0.2) # temporal_evolution.* (lifecycle/change-over-time — 1, added v0.2)
"temporal_evolution.lifecycle_phase", "temporal_evolution.lifecycle_phase",
# network.* (governance/role-shape — 2, added v0.3) # network.* (governance/role-shape — 2, added v0.3)

View File

@@ -6,6 +6,45 @@ Versions follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
--- ---
## [behave-text 0.1.3] — 2026-05-23
### behave-text
#### Added
- `stylometric.pos_ngram_signature` — 64-bit SimHash over POS n-gram (default bigram)
frequency vector. Captures syntactic skeleton independent of vocabulary. Tagger-dependent;
source label must declare tagger + model + n. Calibration note: noisy on chat-domain text,
weight low until validated.
- `lexical.dialect_region` — BCP-47 language-region free_string (`es-CL`, `es-AR`, `es-MX`,
`es-ES`, `en-US`, etc.) for the actor's dominant regional variety, detected from lexical
marker density. Emit `unknown` below confidence threshold. Designed for EYENET integration
with INGEOTEC `regional-spanish-models` vocabulary tables (MIT).
- `lexical.evaluative_morphology_density` — numeric [0,1] rate of evaluative morpheme tokens
(diminutives, augmentatives, pejoratives, intensives) per total tokens. Stable per-author
trait baked into language acquisition; strong Spain/LatAm regional discriminator.
- `lexical.optional_grammar_signature` — 64-bit SimHash over author preference probabilities
at optional-grammar choice points (for Spanish: compound vs simple past, subjunctive usage,
leísmo/laísmo/loísmo, relative pronoun choice). Choice-point set is extractor-defined and
declared in source label.
---
## [behave-text 0.1.2] — 2026-05-23
### behave-text
#### Added
- `meta.*` layer — 8 new corpus-snapshot primitives: `total_messages`, `corpus_span_days`,
`msg_per_day`, `active_days`, `activity_density`, `first_seen_ts`, `last_seen_ts`,
`fingerprint_confidence`. Fills the gap between actors with identical message counts but
radically different presence shapes (bursty single-session vs long-tail lurker).
#### Fixed
- Stale `scratchpad.md` references in `primitives.py` docstring, `tests/test_primitives.py`
docstring, and `attribution-recipes.md``README.md` is now the authority.
---
## [0.1.0] — 2026-05-17 ## [0.1.0] — 2026-05-17
Initial public release of all three packages. Initial public release of all three packages.