From b182e2fe3bcb4cb2e063783211236b1781c76d6b Mon Sep 17 00:00:00 2001 From: anti Date: Sat, 23 May 2026 01:54:12 -0400 Subject: [PATCH] feat(text): add meta.* corpus-footprint layer and 4 language-aware primitives (v0.1.3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds 12 new primitives across two waves of spec work this session. meta.* layer (8 primitives) — corpus-snapshot footprint: total_messages, corpus_span_days, msg_per_day, active_days, activity_density, first_seen_ts, last_seen_ts, fingerprint_confidence. Motivated by two actors with identical message counts (53 each) producing indistinguishable profiles despite radically different presence shapes (0.3-day burst vs 47-day long tail). Language-aware characterization primitives (4 primitives): stylometric.pos_ngram_signature — SimHash over POS bigram frequency vector; syntactic skeleton fingerprint that survives full vocabulary paraphrase. lexical.dialect_region — BCP-47 free_string (es-CL, es-AR, es-MX, …); designed for EYENET integration with INGEOTEC regional-spanish-models. lexical.evaluative_morphology_density — diminutive/augmentative/pejorative suffix density; stable per-author trait baked into language acquisition. lexical.optional_grammar_signature — SimHash over optional-grammar choice points (compound/simple past, subjunctive, leísmo, relative pronoun); high-reliability Spain vs LatAm discriminator. Also fixes stale scratchpad.md references throughout (README.md is now the authority), bumps behave-text to 0.1.3, and updates CHANGELOG. --- BEHAVE-TEXT/README.md | 34 +++++- BEHAVE-TEXT/attribution-recipes.md | 2 +- BEHAVE-TEXT/behave_text/spec/primitives.py | 130 ++++++++++++++++++++- BEHAVE-TEXT/pyproject.toml | 2 +- BEHAVE-TEXT/tests/test_primitives.py | 23 +++- CHANGELOG.md | 39 +++++++ 6 files changed, 215 insertions(+), 15 deletions(-) diff --git a/BEHAVE-TEXT/README.md b/BEHAVE-TEXT/README.md index 7ff928f..b0e6e3e 100644 --- a/BEHAVE-TEXT/README.md +++ b/BEHAVE-TEXT/README.md @@ -51,7 +51,7 @@ topic = event_topic_for("stylometric.capitalization_habit") | `Observation` | Registry-aware subclass of `behave_core.spec.Observation`. Validates `primitive` and `value` against `PRIMITIVE_REGISTRY`. | | `Window` | Re-exported from `behave_core`. | | `ObservationValue` | Re-exported union type. | -| `PRIMITIVE_REGISTRY` | `dict[str, ValueTypeSpec]` — the full primitive catalog (35 entries). | +| `PRIMITIVE_REGISTRY` | `dict[str, ValueTypeSpec]` — the full primitive catalog (47 entries). | | `ValueKind` | Enum: `CATEGORICAL`, `NUMERIC`, `HASH`, `ARRAY`, `FREE_STRING`, `BOOL`. | | `ValueTypeSpec` | Pydantic model: kind, allowed values, bounds, notes. | | `is_known(primitive)` | `bool` — whether a primitive path is registered. | @@ -64,11 +64,33 @@ present in `behave-shell` but not yet implemented here — `status: planned`. ## Primitives -35 primitives across 6 categories. +47 primitives across 7 categories. --- -### `stylometric.*` — Writing style fingerprints (12 primitives) +### `meta.*` — Corpus-snapshot footprint (8 primitives) + +Meta primitives describe the actor's presence in the corpus window itself — +how many messages, how long a span, how densely distributed. They are not +stylometric features; they are the scaffolding that other primitives assume. +Several primitives (notably `temporal_evolution.lifecycle_phase`) implicitly +depend on these quantities; `meta.*` makes them first-class so downstream +attribution engines can access and weight them explicitly. + +| Primitive | Kind | Description | +|---|---|---| +| `meta.total_messages` | numeric | Raw message count for this actor in the corpus snapshot. Anchor for `msg_per_day` and `fingerprint_confidence`. | +| `meta.corpus_span_days` | numeric | Wall-clock fractional days between first and last message. First-to-last only — blind to gaps. A 47-day span with 5 active days still yields 47. Recomputable from `first_seen_ts` / `last_seen_ts`. | +| `meta.msg_per_day` | numeric | `total_messages / corpus_span_days`. Separates bursty visitors (53 msgs / 0.3 days = 53/day) from long-tail lurkers (53 msgs / 47 days = 1.1/day). Undefined when span = 0; extractors emit null/omit rather than divide-by-zero. | +| `meta.active_days` | numeric | Distinct calendar days (UTC) with ≥1 message. Always ≤ `corpus_span_days`. Distinguishes a periodic visitor (span=47, active=3) from a near-daily regular (span=47, active=40). | +| `meta.activity_density` | numeric [0,1] | `active_days / corpus_span_days`. 1.0 = present every day of the window. Near-0 = appeared once or twice across a long window. Undefined when span = 0; emit null/omit for single-day actors. | +| `meta.first_seen_ts` | free_string | ISO 8601 timestamp (UTC offset) of the actor's earliest message. Anchors `corpus_span_days` in absolute time for cross-extraction comparison. | +| `meta.last_seen_ts` | free_string | ISO 8601 timestamp (UTC offset) of the actor's latest message. See `first_seen_ts`. | +| `meta.fingerprint_confidence` | categorical | Qualitative reliability of this actor's full fingerprint: `low`, `medium`, `high`. Attribution engines should weight all other observations by this before compositing. Derivation is **extractor-defined** — extractors declare their heuristic in the source label (e.g. `#confidence-v1`). | + +--- + +### `stylometric.*` — Writing style fingerprints (13 primitives) Stylometric primitives capture the unconscious writing habits that distinguish one author from another. The field goes back to the Mosteller-Wallace Federalist @@ -92,10 +114,11 @@ the Rutify corpus are noted inline where they affect interpretation. | `stylometric.function_word_distribution_top200` | hash | 64-bit SimHash over the 200 most common Spanish function words. The wider list reaches into the long tail (rare-but-individual words like `tampoco`, `aunque`, `mientras`) that carry more discriminating signal in short-message corpora. Not yet emitted by v0 prototype — populated in v0.2. | | `stylometric.character_ngram_simhash` | hash | 64-bit SimHash over character n-gram frequencies (default n=3), lowercased. Orthogonal to function-word distributions: captures punctuation tics, accent-stripping habits, typo patterns, and idiom fragments that survive paraphrase. Accents are preserved because accent-stripping is itself a stylistic tic. Source label declares n size (e.g. `#char3gram`). | | `stylometric.distinctive_vocabulary_signature` | hash | 64-bit SimHash over a TF-IDF-weighted top-K rare-word vector. Captures the author's distinctive lexicon — words they use that other authors in the same corpus do not. Complementary to function-word distributions: where `function_word_*` captures common-word style, this captures individual lexical choice. Requires the full corpus for IDF computation. Source label declares top-K and corpus tag (e.g. `#tfidf-top50`). | +| `stylometric.pos_ngram_signature` | hash | 64-bit SimHash over a POS n-gram (default bigram) frequency vector. Captures syntactic skeleton independent of vocabulary — an author can change every word and retain the same grammatical fingerprint. Orthogonal to character n-grams and function-word distributions. Tagger-dependent: source label must declare tagger, language model, and n (e.g. `#spacy-es_core_news_sm-bi`). Calibration note: chat-domain text produces tagger noise — weight low until validated on labelled chat corpora. | --- -### `lexical.*` — Vocabulary and linguistic patterns (8 primitives) +### `lexical.*` — Vocabulary and linguistic patterns (11 primitives) Lexical primitives characterize *what* and *how* an actor writes at the word and sentence level. Where stylometric primitives fingerprint unconscious micro-habits, @@ -112,6 +135,9 @@ how questions are formed, register. | `lexical.sentence_complexity_class` | categorical | Dominant clause structure. `simple` = single-clause. `compound` = two independent clauses joined by coordinating conjunctions (pero, y, o). `complex` = dependent clauses and subordination (aunque, porque, cuando). Reflects education level and cognitive investment. | | `lexical.question_formation_style` | categorical | How questions are formed. `punctuation_only` = question mark without interrogative words ('¿Cuánto?') — very common in Spanish chat. `lexical` = explicit interrogatives (¿qué, cómo, cuándo). `formal` = inverted subject-verb or formal register. | | `lexical.imperative_style` | categorical | How commands and requests are framed. `informal_directive` = tú/vos imperative (dame, hazlo). `formal_directive` = usted imperative (hágame el favor). `polite` = conditional/modal softening (¿podría...?). Stable per-author trait in hierarchical contexts. | +| `lexical.dialect_region` | free_string | Dominant regional variety of the actor's matrix language as a BCP-47 language-region tag (e.g. `es-CL`, `es-AR`, `es-MX`, `es-ES`, `en-US`). Detected from lexical marker density against per-region vocabulary tables. Emit literal `unknown` below confidence threshold. Detection method declared in source label (e.g. `#dialect-markers-v1`). Complementary to `code_switching_matrix_language`, which derives language via switching analysis rather than direct marker lookup. | +| `lexical.evaluative_morphology_density` | numeric [0,1] | Rate of evaluative morpheme tokens / total tokens. Covers Spanish diminutives (`-ito`/`-ita`), augmentatives (`-ón`/`-ote`), pejoratives (`-ejo`/`-ucho`), and intensives (`-azo`). Heavy diminutive use is characteristic of Mexican/Central American Spanish; River Plate speakers use them significantly less. Stable per-author — baked into language acquisition and hard to consciously suppress. Source label declares morpheme set and tool version (e.g. `#eval-morph-es-v1`). | +| `lexical.optional_grammar_signature` | hash | 64-bit SimHash over the author's preference probability vector at optional-grammar choice points. For Spanish: compound vs simple past (`he comido` vs `comí` — high-reliability Spain/LatAm discriminator), subjunctive usage rate, leísmo/laísmo/loísmo clitic patterns, and relative pronoun choice (`que` vs `el cual`). Each choice point is a scalar [0,1]; the SimHash is computed over the concatenated vector. Choice-point set is extractor-defined and declared in source label (e.g. `#optgrammar-es-v1`). Requires sufficient corpus volume for stable probabilities — gate on `meta.fingerprint_confidence` before use. | --- diff --git a/BEHAVE-TEXT/attribution-recipes.md b/BEHAVE-TEXT/attribution-recipes.md index 14836f5..5681a5d 100644 --- a/BEHAVE-TEXT/attribution-recipes.md +++ b/BEHAVE-TEXT/attribution-recipes.md @@ -2,7 +2,7 @@ # BEHAVE-TEXT Attribution Recipes -> **This document is not part of BEHAVE-TEXT.** BEHAVE-TEXT (`scratchpad.md`) defines the observation taxonomy and emission envelope. It does **not** assert who an actor is, link sessions, or assign profiles. Those are attribution-engine concerns. +> **This document is not part of BEHAVE-TEXT.** BEHAVE-TEXT (`README.md`) defines the observation taxonomy and emission envelope. It does **not** assert who an actor is, link sessions, or assign profiles. Those are attribution-engine concerns. > > This document is a **placeholder**. Recipes for the text domain wait for corpus calibration. The Rutify Telegram corpus (forthcoming) will be the labeling ground truth that drives the first concrete profiles. diff --git a/BEHAVE-TEXT/behave_text/spec/primitives.py b/BEHAVE-TEXT/behave_text/spec/primitives.py index 4435875..03b5cb9 100644 --- a/BEHAVE-TEXT/behave_text/spec/primitives.py +++ b/BEHAVE-TEXT/behave_text/spec/primitives.py @@ -16,7 +16,7 @@ PII discipline notice (carried over from behave-core's envelope module): IS text content. Sensors must hash/aggregate before emitting. Adding a new primitive is a deliberate registry edit. Drift between this file -and `scratchpad.md` is a bug; v0 keeps the registry hand-written so PR review +and `README.md` is a bug; v0 keeps the registry hand-written so PR review catches drift, v0.x may auto-extract from the markdown if drift becomes a maintenance issue. @@ -109,10 +109,71 @@ def _array(of: ValueKind, notes: Optional[str] = None) -> ValueTypeSpec: # ─── The registry ─────────────────────────────────────────────────────────── # -# 28 primitives across 4 layers. Mirrors scratchpad.md row-for-row. +# 47 primitives across 7 layers. Mirrors README.md row-for-row. PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = { - # ── stylometric.* (motor analog — 8) ────────────────────────────────── + # ── meta.* (corpus-snapshot footprint — 8) ──────────────────────────── + "meta.total_messages": _num( + min_val=0.0, + notes="Raw message count for this actor in the corpus snapshot. Integer in " + "practice; stored as float for spec uniformity. Dependency anchor: " + "msg_per_day is derived from this; fingerprint_confidence is informed " + "by this. Emit before deriving rates.", + ), + "meta.corpus_span_days": _num( + min_val=0.0, + notes="Wall-clock duration in fractional days between the actor's earliest " + "and latest message in the corpus snapshot. First-to-last only — blind " + "to silence in between (a 47-day span with 5 active days still yields " + "47). Complement with active_days and activity_density to get presence " + "shape. Recomputable from first_seen_ts and last_seen_ts.", + ), + "meta.msg_per_day": _num( + min_val=0.0, + notes="total_messages / corpus_span_days. The key rate that separates a " + "bursty single-session visitor (53 msgs in 0.3 days → 53/day) from a " + "long-tail lurker (53 msgs in 47 days → 1.1/day). Undefined when " + "corpus_span_days = 0; extractors should emit null/omit rather than " + "divide-by-zero in that edge case.", + ), + "meta.active_days": _num( + min_val=0.0, + notes="Count of distinct calendar days (UTC) on which the actor sent at " + "least one message. Always ≤ corpus_span_days. An actor with span=47 " + "and active_days=3 is a periodic visitor who appears rarely; one with " + "span=47 and active_days=40 is a near-daily regular. Use alongside " + "activity_density for full presence shape.", + ), + "meta.activity_density": _num( + min_val=0.0, max_val=1.0, + notes="active_days / corpus_span_days. Single scalar capturing 'how filled " + "is the span?'. 1.0 = present every day of the window. Near-0 = " + "appeared once or twice across a long window. Undefined when " + "corpus_span_days = 0; emit null/omit for single-day actors.", + ), + "meta.first_seen_ts": _str( + notes="ISO 8601 timestamp (with UTC offset, e.g. '2025-11-03T14:22:07+00:00') " + "of the actor's earliest message in the corpus snapshot. Combined with " + "last_seen_ts, this anchors corpus_span_days in absolute time so " + "observations from different extractions can be compared temporally.", + ), + "meta.last_seen_ts": _str( + notes="ISO 8601 timestamp (with UTC offset, e.g. '2025-12-20T09:11:43+00:00') " + "of the actor's latest message in the corpus snapshot. See first_seen_ts.", + ), + "meta.fingerprint_confidence": _cat( + "low", "medium", "high", + notes="Qualitative reliability rating for this actor's full fingerprint. " + "An attribution engine should weight all other observations from this " + "actor proportionally to this value before compositing. Derivation is " + "EXTRACTOR-DEFINED — the registry specifies the semantic contract, not " + "the formula. Extractors must declare their heuristic in the source " + "label (e.g. '#confidence-v1'). Typical inputs: total_messages, " + "corpus_span_days, active_days, and any domain-specific thresholds " + "the extractor authors have calibrated.", + ), + + # ── stylometric.* (motor analog — 13) ───────────────────────────────── "stylometric.punctuation_style": _hash(notes="canonical punctuation-pattern fingerprint"), "stylometric.capitalization_habit": _cat( "lowercase", "proper", "random_caps", "mixed_i", @@ -200,8 +261,23 @@ PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = { "computation, performed once per extraction. Source label declares the " "top-K size and corpus tag (e.g. `#tfidf-top50`).", ), + "stylometric.pos_ngram_signature": _hash( + notes="64-bit simhash over a POS n-gram (default bigram) frequency vector " + "from the author's text corpus. Captures syntactic skeleton independent " + "of vocabulary — an author can change every word they use and still " + "retain the same POS-bigram fingerprint. ORTHOGONAL to character_ngram " + "and function_word distributions: those capture surface form, this " + "captures grammatical structure. Example signal: consistent ADJ-NOUN vs " + "NOUN-ADJ ordering in Spanish, habitual ADV-VERB pre-position. " + "TAGGER-DEPENDENT: source label MUST declare the tagger, language model, " + "and n value (e.g. `#spacy-es_core_news_sm-bi` for spaCy Spanish " + "small model, bigrams). Calibration note: chat-domain text is noisy — " + "abbreviations, misspellings, and code-switching cause tagger errors " + "that introduce fingerprint noise. Engines should weight low until " + "calibrated against labelled chat corpora.", + ), - # ── lexical.* (cognitive analog — 8) ────────────────────────────────── + # ── lexical.* (cognitive analog — 11) ───────────────────────────────── "lexical.vocabulary_richness": _num( min_val=0.0, max_val=1.0, notes="Moving-Average Type-Token Ratio (MATTR) over a sliding window " @@ -242,6 +318,52 @@ PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = { "market contexts where hierarchical and peer relationships are expressed " "through register choice.", ), + "lexical.dialect_region": _str( + notes="Dominant regional variety of the actor's matrix language, expressed as " + "a BCP-47 language-region tag (e.g. `es-CL`, `es-AR`, `es-MX`, `es-ES`, " + "`en-US`). Detected from lexical marker density against per-region " + "vocabulary tables; detection method and marker set version declared in " + "source label (e.g. `#dialect-markers-v1`). Emit the literal string " + "`unknown` when the extractor falls below its confidence threshold — do " + "not omit the observation, so downstream engines can distinguish " + "'undetected' from 'not extracted'. Language-agnostic in concept; the " + "marker vocabulary is language-specific. COMPLEMENTARY to " + "lexical.code_switching_matrix_language, which captures the dominant " + "language via switching analysis rather than direct regional-marker lookup.", + ), + "lexical.evaluative_morphology_density": _num( + min_val=0.0, max_val=1.0, + notes="Rate of evaluative morpheme tokens / total tokens. Evaluative morphology " + "encompasses suffixes that add expressive/emotional loading to a stem: " + "diminutives (`-ito`/`-ita`/`-cito`/`-cita` — affection, minimization, " + "softening), augmentatives (`-ón`/`-ona`/`-ote`/`-ota` — intensification), " + "pejoratives (`-ejo`/`-eja`/`-ucho`/`-ucha` — contempt), and intensives " + "(`-azo`/`-aza` — force or admiration by context). Heavy diminutive use " + "is characteristic of Mexican and Central American Spanish; River Plate " + "speakers use them significantly less. The density is stable per-author " + "and hard to consciously suppress — it is baked into language acquisition. " + "Language-agnostic in concept; detection (suffix rules or morphological " + "analyser) is language-specific. Source label declares the morpheme set " + "and tool version (e.g. `#eval-morph-es-v1`).", + ), + "lexical.optional_grammar_signature": _hash( + notes="64-bit simhash over a vector of the author's preference probabilities " + "at optional-grammar choice points — positions where the language offers " + "multiple grammatically correct options and individual authors make stable " + "idiosyncratic choices. For Spanish: compound past vs simple past ratio " + "(`he comido` vs `comí` — Spain strongly prefers compound for recent " + "actions; Latin America almost universally uses simple past, making this " + "a high-reliability Spain/LatAm discriminator), subjunctive usage rate " + "(avoidance correlates with informal register or non-native acquisition), " + "leísmo/laísmo/loísmo clitic patterns (`le vi` vs `lo vi` for masculine " + "accusative — leísmo is characteristic of Castilian Spain), and relative " + "pronoun choice (`que` vs `el cual/la cual` — register marker). Each " + "choice point is a scalar [0,1] probability; the simhash is computed over " + "the concatenated vector. EXTRACTOR-DEFINED: choice-point set declared in " + "source label (e.g. `#optgrammar-es-v1`). Requires sufficient corpus " + "volume for stable probability estimates — thin corpora produce noisy " + "hashes; engines should gate on meta.fingerprint_confidence before use.", + ), # ── temporal_evolution.* (lifecycle / change-over-time — 1) ─────────── "temporal_evolution.lifecycle_phase": _cat( diff --git a/BEHAVE-TEXT/pyproject.toml b/BEHAVE-TEXT/pyproject.toml index 214adf0..0966503 100644 --- a/BEHAVE-TEXT/pyproject.toml +++ b/BEHAVE-TEXT/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "behave-text" -version = "0.1.1" +version = "0.1.3" description = "BEHAVE-TEXT — text/messaging-domain behavioral observation registry, layered on behave-core" readme = "README.md" requires-python = ">=3.11" diff --git a/BEHAVE-TEXT/tests/test_primitives.py b/BEHAVE-TEXT/tests/test_primitives.py index cea4cf9..7d8f2ba 100644 --- a/BEHAVE-TEXT/tests/test_primitives.py +++ b/BEHAVE-TEXT/tests/test_primitives.py @@ -1,9 +1,9 @@ # SPDX-License-Identifier: GPL-3.0-or-later """Registry coverage tests for BEHAVE-TEXT. -Asserts that every primitive listed in scratchpad.md's tables has exactly one +Asserts that every primitive listed in README.md's tables has exactly one entry in PRIMITIVE_REGISTRY. Drift-detector — failing this test means -scratchpad.md and the registry have diverged. +README.md and the registry have diverged. """ from __future__ import annotations @@ -13,9 +13,18 @@ from pathlib import Path from behave_text.spec import PRIMITIVE_REGISTRY, ValueKind -# Primitive paths expected by scratchpad.md (hand-extracted; v0). +# Primitive paths expected by README.md (hand-extracted; v0). EXPECTED_PRIMITIVES = { - # stylometric.* (motor analog — 8) + # meta.* (corpus-snapshot footprint — 8) + "meta.total_messages", + "meta.corpus_span_days", + "meta.msg_per_day", + "meta.active_days", + "meta.activity_density", + "meta.first_seen_ts", + "meta.last_seen_ts", + "meta.fingerprint_confidence", + # stylometric.* (motor analog — 13) "stylometric.punctuation_style", "stylometric.capitalization_habit", "stylometric.emoji_usage", @@ -28,7 +37,8 @@ EXPECTED_PRIMITIVES = { "stylometric.function_word_distribution_top200", "stylometric.character_ngram_simhash", "stylometric.distinctive_vocabulary_signature", - # lexical.* (cognitive analog — 8) + "stylometric.pos_ngram_signature", + # lexical.* (cognitive analog — 11) "lexical.vocabulary_richness", "lexical.slang_density", "lexical.code_switching_rate", @@ -37,6 +47,9 @@ EXPECTED_PRIMITIVES = { "lexical.sentence_complexity_class", "lexical.question_formation_style", "lexical.imperative_style", + "lexical.dialect_region", + "lexical.evaluative_morphology_density", + "lexical.optional_grammar_signature", # temporal_evolution.* (lifecycle/change-over-time — 1, added v0.2) "temporal_evolution.lifecycle_phase", # network.* (governance/role-shape — 2, added v0.3) diff --git a/CHANGELOG.md b/CHANGELOG.md index 5b3fbb0..b30a899 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,45 @@ Versions follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html). --- +## [behave-text 0.1.3] — 2026-05-23 + +### behave-text + +#### Added +- `stylometric.pos_ngram_signature` — 64-bit SimHash over POS n-gram (default bigram) + frequency vector. Captures syntactic skeleton independent of vocabulary. Tagger-dependent; + source label must declare tagger + model + n. Calibration note: noisy on chat-domain text, + weight low until validated. +- `lexical.dialect_region` — BCP-47 language-region free_string (`es-CL`, `es-AR`, `es-MX`, + `es-ES`, `en-US`, etc.) for the actor's dominant regional variety, detected from lexical + marker density. Emit `unknown` below confidence threshold. Designed for EYENET integration + with INGEOTEC `regional-spanish-models` vocabulary tables (MIT). +- `lexical.evaluative_morphology_density` — numeric [0,1] rate of evaluative morpheme tokens + (diminutives, augmentatives, pejoratives, intensives) per total tokens. Stable per-author + trait baked into language acquisition; strong Spain/LatAm regional discriminator. +- `lexical.optional_grammar_signature` — 64-bit SimHash over author preference probabilities + at optional-grammar choice points (for Spanish: compound vs simple past, subjunctive usage, + leísmo/laísmo/loísmo, relative pronoun choice). Choice-point set is extractor-defined and + declared in source label. + +--- + +## [behave-text 0.1.2] — 2026-05-23 + +### behave-text + +#### Added +- `meta.*` layer — 8 new corpus-snapshot primitives: `total_messages`, `corpus_span_days`, + `msg_per_day`, `active_days`, `activity_density`, `first_seen_ts`, `last_seen_ts`, + `fingerprint_confidence`. Fills the gap between actors with identical message counts but + radically different presence shapes (bursty single-session vs long-tail lurker). + +#### Fixed +- Stale `scratchpad.md` references in `primitives.py` docstring, `tests/test_primitives.py` + docstring, and `attribution-recipes.md` — `README.md` is now the authority. + +--- + ## [0.1.0] — 2026-05-17 Initial public release of all three packages.