feat(text): add meta.* corpus-footprint layer and 4 language-aware primitives (v0.1.3)

Adds 12 new primitives across two waves of spec work this session. meta.* layer (8 primitives) — corpus-snapshot footprint: total_messages, corpus_span_days, msg_per_day, active_days, activity_density, first_seen_ts, last_seen_ts, fingerprint_confidence. Motivated by two actors with identical message counts (53 each) producing indistinguishable profiles despite radically different presence shapes (0.3-day burst vs 47-day long tail). Language-aware characterization primitives (4 primitives): stylometric.pos_ngram_signature — SimHash over POS bigram frequency vector; syntactic skeleton fingerprint that survives full vocabulary paraphrase. lexical.dialect_region — BCP-47 free_string (es-CL, es-AR, es-MX, …); designed for EYENET integration with INGEOTEC regional-spanish-models. lexical.evaluative_morphology_density — diminutive/augmentative/pejorative suffix density; stable per-author trait baked into language acquisition. lexical.optional_grammar_signature — SimHash over optional-grammar choice points (compound/simple past, subjunctive, leísmo, relative pronoun); high-reliability Spain vs LatAm discriminator. Also fixes stale scratchpad.md references throughout (README.md is now the authority), bumps behave-text to 0.1.3, and updates CHANGELOG.
2026-05-23 01:54:12 -04:00
parent 214ce50941
commit b182e2fe3b
6 changed files with 215 additions and 15 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,6 +6,45 @@ Versions follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

 ---

+## [behave-text 0.1.3] — 2026-05-23
+
+### behave-text
+
+#### Added
+- `stylometric.pos_ngram_signature` — 64-bit SimHash over POS n-gram (default bigram)
+  frequency vector. Captures syntactic skeleton independent of vocabulary. Tagger-dependent;
+  source label must declare tagger + model + n. Calibration note: noisy on chat-domain text,
+  weight low until validated.
+- `lexical.dialect_region` — BCP-47 language-region free_string (`es-CL`, `es-AR`, `es-MX`,
+  `es-ES`, `en-US`, etc.) for the actor's dominant regional variety, detected from lexical
+  marker density. Emit `unknown` below confidence threshold. Designed for EYENET integration
+  with INGEOTEC `regional-spanish-models` vocabulary tables (MIT).
+- `lexical.evaluative_morphology_density` — numeric [0,1] rate of evaluative morpheme tokens
+  (diminutives, augmentatives, pejoratives, intensives) per total tokens. Stable per-author
+  trait baked into language acquisition; strong Spain/LatAm regional discriminator.
+- `lexical.optional_grammar_signature` — 64-bit SimHash over author preference probabilities
+  at optional-grammar choice points (for Spanish: compound vs simple past, subjunctive usage,
+  leísmo/laísmo/loísmo, relative pronoun choice). Choice-point set is extractor-defined and
+  declared in source label.
+
+---
+
+## [behave-text 0.1.2] — 2026-05-23
+
+### behave-text
+
+#### Added
+- `meta.*` layer — 8 new corpus-snapshot primitives: `total_messages`, `corpus_span_days`,
+  `msg_per_day`, `active_days`, `activity_density`, `first_seen_ts`, `last_seen_ts`,
+  `fingerprint_confidence`. Fills the gap between actors with identical message counts but
+  radically different presence shapes (bursty single-session vs long-tail lurker).
+
+#### Fixed
+- Stale `scratchpad.md` references in `primitives.py` docstring, `tests/test_primitives.py`
+  docstring, and `attribution-recipes.md` — `README.md` is now the authority.
+
+---
+
 ## [0.1.0] — 2026-05-17

 Initial public release of all three packages.