feat(text): add meta.* corpus-footprint layer and 4 language-aware primitives (v0.1.3)

Adds 12 new primitives across two waves of spec work this session. meta.* layer (8 primitives) — corpus-snapshot footprint: total_messages, corpus_span_days, msg_per_day, active_days, activity_density, first_seen_ts, last_seen_ts, fingerprint_confidence. Motivated by two actors with identical message counts (53 each) producing indistinguishable profiles despite radically different presence shapes (0.3-day burst vs 47-day long tail). Language-aware characterization primitives (4 primitives): stylometric.pos_ngram_signature — SimHash over POS bigram frequency vector; syntactic skeleton fingerprint that survives full vocabulary paraphrase. lexical.dialect_region — BCP-47 free_string (es-CL, es-AR, es-MX, …); designed for EYENET integration with INGEOTEC regional-spanish-models. lexical.evaluative_morphology_density — diminutive/augmentative/pejorative suffix density; stable per-author trait baked into language acquisition. lexical.optional_grammar_signature — SimHash over optional-grammar choice points (compound/simple past, subjunctive, leísmo, relative pronoun); high-reliability Spain vs LatAm discriminator. Also fixes stale scratchpad.md references throughout (README.md is now the authority), bumps behave-text to 0.1.3, and updates CHANGELOG.
2026-05-23 01:54:12 -04:00
parent 214ce50941
commit b182e2fe3b
6 changed files with 215 additions and 15 deletions
--- a/BEHAVE-TEXT/README.md
+++ b/BEHAVE-TEXT/README.md
@@ -51,7 +51,7 @@ topic = event_topic_for("stylometric.capitalization_habit")
 | `Observation` | Registry-aware subclass of `behave_core.spec.Observation`. Validates `primitive` and `value` against `PRIMITIVE_REGISTRY`. |
 | `Window` | Re-exported from `behave_core`. |
 | `ObservationValue` | Re-exported union type. |
-| `PRIMITIVE_REGISTRY` | `dict[str, ValueTypeSpec]` — the full primitive catalog (35 entries). |
+| `PRIMITIVE_REGISTRY` | `dict[str, ValueTypeSpec]` — the full primitive catalog (47 entries). |
 | `ValueKind` | Enum: `CATEGORICAL`, `NUMERIC`, `HASH`, `ARRAY`, `FREE_STRING`, `BOOL`. |
 | `ValueTypeSpec` | Pydantic model: kind, allowed values, bounds, notes. |
 | `is_known(primitive)` | `bool` — whether a primitive path is registered. |
@@ -64,11 +64,33 @@ present in `behave-shell` but not yet implemented here — `status: planned`.

 ## Primitives

-35 primitives across 6 categories.
+47 primitives across 7 categories.

 ---

-### `stylometric.*` — Writing style fingerprints (12 primitives)
+### `meta.*` — Corpus-snapshot footprint (8 primitives)
+
+Meta primitives describe the actor's presence in the corpus window itself —
+how many messages, how long a span, how densely distributed. They are not
+stylometric features; they are the scaffolding that other primitives assume.
+Several primitives (notably `temporal_evolution.lifecycle_phase`) implicitly
+depend on these quantities; `meta.*` makes them first-class so downstream
+attribution engines can access and weight them explicitly.
+
+| Primitive | Kind | Description |
+|---|---|---|
+| `meta.total_messages` | numeric | Raw message count for this actor in the corpus snapshot. Anchor for `msg_per_day` and `fingerprint_confidence`. |
+| `meta.corpus_span_days` | numeric | Wall-clock fractional days between first and last message. First-to-last only — blind to gaps. A 47-day span with 5 active days still yields 47. Recomputable from `first_seen_ts` / `last_seen_ts`. |
+| `meta.msg_per_day` | numeric | `total_messages / corpus_span_days`. Separates bursty visitors (53 msgs / 0.3 days = 53/day) from long-tail lurkers (53 msgs / 47 days = 1.1/day). Undefined when span = 0; extractors emit null/omit rather than divide-by-zero. |
+| `meta.active_days` | numeric | Distinct calendar days (UTC) with ≥1 message. Always ≤ `corpus_span_days`. Distinguishes a periodic visitor (span=47, active=3) from a near-daily regular (span=47, active=40). |
+| `meta.activity_density` | numeric [0,1] | `active_days / corpus_span_days`. 1.0 = present every day of the window. Near-0 = appeared once or twice across a long window. Undefined when span = 0; emit null/omit for single-day actors. |
+| `meta.first_seen_ts` | free_string | ISO 8601 timestamp (UTC offset) of the actor's earliest message. Anchors `corpus_span_days` in absolute time for cross-extraction comparison. |
+| `meta.last_seen_ts` | free_string | ISO 8601 timestamp (UTC offset) of the actor's latest message. See `first_seen_ts`. |
+| `meta.fingerprint_confidence` | categorical | Qualitative reliability of this actor's full fingerprint: `low`, `medium`, `high`. Attribution engines should weight all other observations by this before compositing. Derivation is **extractor-defined** — extractors declare their heuristic in the source label (e.g. `#confidence-v1`). |
+
+---
+
+### `stylometric.*` — Writing style fingerprints (13 primitives)

 Stylometric primitives capture the unconscious writing habits that distinguish
 one author from another. The field goes back to the Mosteller-Wallace Federalist
@@ -92,10 +114,11 @@ the Rutify corpus are noted inline where they affect interpretation.
 | `stylometric.function_word_distribution_top200` | hash | 64-bit SimHash over the 200 most common Spanish function words. The wider list reaches into the long tail (rare-but-individual words like `tampoco`, `aunque`, `mientras`) that carry more discriminating signal in short-message corpora. Not yet emitted by v0 prototype — populated in v0.2. |
 | `stylometric.character_ngram_simhash` | hash | 64-bit SimHash over character n-gram frequencies (default n=3), lowercased. Orthogonal to function-word distributions: captures punctuation tics, accent-stripping habits, typo patterns, and idiom fragments that survive paraphrase. Accents are preserved because accent-stripping is itself a stylistic tic. Source label declares n size (e.g. `#char3gram`). |
 | `stylometric.distinctive_vocabulary_signature` | hash | 64-bit SimHash over a TF-IDF-weighted top-K rare-word vector. Captures the author's distinctive lexicon — words they use that other authors in the same corpus do not. Complementary to function-word distributions: where `function_word_*` captures common-word style, this captures individual lexical choice. Requires the full corpus for IDF computation. Source label declares top-K and corpus tag (e.g. `#tfidf-top50`). |
+| `stylometric.pos_ngram_signature` | hash | 64-bit SimHash over a POS n-gram (default bigram) frequency vector. Captures syntactic skeleton independent of vocabulary — an author can change every word and retain the same grammatical fingerprint. Orthogonal to character n-grams and function-word distributions. Tagger-dependent: source label must declare tagger, language model, and n (e.g. `#spacy-es_core_news_sm-bi`). Calibration note: chat-domain text produces tagger noise — weight low until validated on labelled chat corpora. |

 ---

-### `lexical.*` — Vocabulary and linguistic patterns (8 primitives)
+### `lexical.*` — Vocabulary and linguistic patterns (11 primitives)

 Lexical primitives characterize *what* and *how* an actor writes at the word and
 sentence level. Where stylometric primitives fingerprint unconscious micro-habits,
@@ -112,6 +135,9 @@ how questions are formed, register.
 | `lexical.sentence_complexity_class` | categorical | Dominant clause structure. `simple` = single-clause. `compound` = two independent clauses joined by coordinating conjunctions (pero, y, o). `complex` = dependent clauses and subordination (aunque, porque, cuando). Reflects education level and cognitive investment. |
 | `lexical.question_formation_style` | categorical | How questions are formed. `punctuation_only` = question mark without interrogative words ('¿Cuánto?') — very common in Spanish chat. `lexical` = explicit interrogatives (¿qué, cómo, cuándo). `formal` = inverted subject-verb or formal register. |
 | `lexical.imperative_style` | categorical | How commands and requests are framed. `informal_directive` = tú/vos imperative (dame, hazlo). `formal_directive` = usted imperative (hágame el favor). `polite` = conditional/modal softening (¿podría...?). Stable per-author trait in hierarchical contexts. |
+| `lexical.dialect_region` | free_string | Dominant regional variety of the actor's matrix language as a BCP-47 language-region tag (e.g. `es-CL`, `es-AR`, `es-MX`, `es-ES`, `en-US`). Detected from lexical marker density against per-region vocabulary tables. Emit literal `unknown` below confidence threshold. Detection method declared in source label (e.g. `#dialect-markers-v1`). Complementary to `code_switching_matrix_language`, which derives language via switching analysis rather than direct marker lookup. |
+| `lexical.evaluative_morphology_density` | numeric [0,1] | Rate of evaluative morpheme tokens / total tokens. Covers Spanish diminutives (`-ito`/`-ita`), augmentatives (`-ón`/`-ote`), pejoratives (`-ejo`/`-ucho`), and intensives (`-azo`). Heavy diminutive use is characteristic of Mexican/Central American Spanish; River Plate speakers use them significantly less. Stable per-author — baked into language acquisition and hard to consciously suppress. Source label declares morpheme set and tool version (e.g. `#eval-morph-es-v1`). |
+| `lexical.optional_grammar_signature` | hash | 64-bit SimHash over the author's preference probability vector at optional-grammar choice points. For Spanish: compound vs simple past (`he comido` vs `comí` — high-reliability Spain/LatAm discriminator), subjunctive usage rate, leísmo/laísmo/loísmo clitic patterns, and relative pronoun choice (`que` vs `el cual`). Each choice point is a scalar [0,1]; the SimHash is computed over the concatenated vector. Choice-point set is extractor-defined and declared in source label (e.g. `#optgrammar-es-v1`). Requires sufficient corpus volume for stable probabilities — gate on `meta.fingerprint_confidence` before use. |

 ---