feat(text): add meta.* corpus-footprint layer and 4 language-aware primitives (v0.1.3)
Adds 12 new primitives across two waves of spec work this session.
meta.* layer (8 primitives) — corpus-snapshot footprint:
total_messages, corpus_span_days, msg_per_day, active_days,
activity_density, first_seen_ts, last_seen_ts, fingerprint_confidence.
Motivated by two actors with identical message counts (53 each) producing
indistinguishable profiles despite radically different presence shapes
(0.3-day burst vs 47-day long tail).
Language-aware characterization primitives (4 primitives):
stylometric.pos_ngram_signature — SimHash over POS bigram frequency vector;
syntactic skeleton fingerprint that survives full vocabulary paraphrase.
lexical.dialect_region — BCP-47 free_string (es-CL, es-AR, es-MX, …);
designed for EYENET integration with INGEOTEC regional-spanish-models.
lexical.evaluative_morphology_density — diminutive/augmentative/pejorative
suffix density; stable per-author trait baked into language acquisition.
lexical.optional_grammar_signature — SimHash over optional-grammar choice
points (compound/simple past, subjunctive, leísmo, relative pronoun);
high-reliability Spain vs LatAm discriminator.
Also fixes stale scratchpad.md references throughout (README.md is now the
authority), bumps behave-text to 0.1.3, and updates CHANGELOG.
This commit is contained in:
39
CHANGELOG.md
39
CHANGELOG.md
@@ -6,6 +6,45 @@ Versions follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
---
|
||||
|
||||
## [behave-text 0.1.3] — 2026-05-23
|
||||
|
||||
### behave-text
|
||||
|
||||
#### Added
|
||||
- `stylometric.pos_ngram_signature` — 64-bit SimHash over POS n-gram (default bigram)
|
||||
frequency vector. Captures syntactic skeleton independent of vocabulary. Tagger-dependent;
|
||||
source label must declare tagger + model + n. Calibration note: noisy on chat-domain text,
|
||||
weight low until validated.
|
||||
- `lexical.dialect_region` — BCP-47 language-region free_string (`es-CL`, `es-AR`, `es-MX`,
|
||||
`es-ES`, `en-US`, etc.) for the actor's dominant regional variety, detected from lexical
|
||||
marker density. Emit `unknown` below confidence threshold. Designed for EYENET integration
|
||||
with INGEOTEC `regional-spanish-models` vocabulary tables (MIT).
|
||||
- `lexical.evaluative_morphology_density` — numeric [0,1] rate of evaluative morpheme tokens
|
||||
(diminutives, augmentatives, pejoratives, intensives) per total tokens. Stable per-author
|
||||
trait baked into language acquisition; strong Spain/LatAm regional discriminator.
|
||||
- `lexical.optional_grammar_signature` — 64-bit SimHash over author preference probabilities
|
||||
at optional-grammar choice points (for Spanish: compound vs simple past, subjunctive usage,
|
||||
leísmo/laísmo/loísmo, relative pronoun choice). Choice-point set is extractor-defined and
|
||||
declared in source label.
|
||||
|
||||
---
|
||||
|
||||
## [behave-text 0.1.2] — 2026-05-23
|
||||
|
||||
### behave-text
|
||||
|
||||
#### Added
|
||||
- `meta.*` layer — 8 new corpus-snapshot primitives: `total_messages`, `corpus_span_days`,
|
||||
`msg_per_day`, `active_days`, `activity_density`, `first_seen_ts`, `last_seen_ts`,
|
||||
`fingerprint_confidence`. Fills the gap between actors with identical message counts but
|
||||
radically different presence shapes (bursty single-session vs long-tail lurker).
|
||||
|
||||
#### Fixed
|
||||
- Stale `scratchpad.md` references in `primitives.py` docstring, `tests/test_primitives.py`
|
||||
docstring, and `attribution-recipes.md` — `README.md` is now the authority.
|
||||
|
||||
---
|
||||
|
||||
## [0.1.0] — 2026-05-17
|
||||
|
||||
Initial public release of all three packages.
|
||||
|
||||
Reference in New Issue
Block a user