feat(text): add meta.* corpus-footprint layer and 4 language-aware primitives (v0.1.3)
Adds 12 new primitives across two waves of spec work this session.
meta.* layer (8 primitives) — corpus-snapshot footprint:
total_messages, corpus_span_days, msg_per_day, active_days,
activity_density, first_seen_ts, last_seen_ts, fingerprint_confidence.
Motivated by two actors with identical message counts (53 each) producing
indistinguishable profiles despite radically different presence shapes
(0.3-day burst vs 47-day long tail).
Language-aware characterization primitives (4 primitives):
stylometric.pos_ngram_signature — SimHash over POS bigram frequency vector;
syntactic skeleton fingerprint that survives full vocabulary paraphrase.
lexical.dialect_region — BCP-47 free_string (es-CL, es-AR, es-MX, …);
designed for EYENET integration with INGEOTEC regional-spanish-models.
lexical.evaluative_morphology_density — diminutive/augmentative/pejorative
suffix density; stable per-author trait baked into language acquisition.
lexical.optional_grammar_signature — SimHash over optional-grammar choice
points (compound/simple past, subjunctive, leísmo, relative pronoun);
high-reliability Spain vs LatAm discriminator.
Also fixes stale scratchpad.md references throughout (README.md is now the
authority), bumps behave-text to 0.1.3, and updates CHANGELOG.
This commit is contained in:
@@ -16,7 +16,7 @@ PII discipline notice (carried over from behave-core's envelope module):
|
||||
IS text content. Sensors must hash/aggregate before emitting.
|
||||
|
||||
Adding a new primitive is a deliberate registry edit. Drift between this file
|
||||
and `scratchpad.md` is a bug; v0 keeps the registry hand-written so PR review
|
||||
and `README.md` is a bug; v0 keeps the registry hand-written so PR review
|
||||
catches drift, v0.x may auto-extract from the markdown if drift becomes a
|
||||
maintenance issue.
|
||||
|
||||
@@ -109,10 +109,71 @@ def _array(of: ValueKind, notes: Optional[str] = None) -> ValueTypeSpec:
|
||||
|
||||
# ─── The registry ───────────────────────────────────────────────────────────
|
||||
#
|
||||
# 28 primitives across 4 layers. Mirrors scratchpad.md row-for-row.
|
||||
# 47 primitives across 7 layers. Mirrors README.md row-for-row.
|
||||
|
||||
PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = {
|
||||
# ── stylometric.* (motor analog — 8) ──────────────────────────────────
|
||||
# ── meta.* (corpus-snapshot footprint — 8) ────────────────────────────
|
||||
"meta.total_messages": _num(
|
||||
min_val=0.0,
|
||||
notes="Raw message count for this actor in the corpus snapshot. Integer in "
|
||||
"practice; stored as float for spec uniformity. Dependency anchor: "
|
||||
"msg_per_day is derived from this; fingerprint_confidence is informed "
|
||||
"by this. Emit before deriving rates.",
|
||||
),
|
||||
"meta.corpus_span_days": _num(
|
||||
min_val=0.0,
|
||||
notes="Wall-clock duration in fractional days between the actor's earliest "
|
||||
"and latest message in the corpus snapshot. First-to-last only — blind "
|
||||
"to silence in between (a 47-day span with 5 active days still yields "
|
||||
"47). Complement with active_days and activity_density to get presence "
|
||||
"shape. Recomputable from first_seen_ts and last_seen_ts.",
|
||||
),
|
||||
"meta.msg_per_day": _num(
|
||||
min_val=0.0,
|
||||
notes="total_messages / corpus_span_days. The key rate that separates a "
|
||||
"bursty single-session visitor (53 msgs in 0.3 days → 53/day) from a "
|
||||
"long-tail lurker (53 msgs in 47 days → 1.1/day). Undefined when "
|
||||
"corpus_span_days = 0; extractors should emit null/omit rather than "
|
||||
"divide-by-zero in that edge case.",
|
||||
),
|
||||
"meta.active_days": _num(
|
||||
min_val=0.0,
|
||||
notes="Count of distinct calendar days (UTC) on which the actor sent at "
|
||||
"least one message. Always ≤ corpus_span_days. An actor with span=47 "
|
||||
"and active_days=3 is a periodic visitor who appears rarely; one with "
|
||||
"span=47 and active_days=40 is a near-daily regular. Use alongside "
|
||||
"activity_density for full presence shape.",
|
||||
),
|
||||
"meta.activity_density": _num(
|
||||
min_val=0.0, max_val=1.0,
|
||||
notes="active_days / corpus_span_days. Single scalar capturing 'how filled "
|
||||
"is the span?'. 1.0 = present every day of the window. Near-0 = "
|
||||
"appeared once or twice across a long window. Undefined when "
|
||||
"corpus_span_days = 0; emit null/omit for single-day actors.",
|
||||
),
|
||||
"meta.first_seen_ts": _str(
|
||||
notes="ISO 8601 timestamp (with UTC offset, e.g. '2025-11-03T14:22:07+00:00') "
|
||||
"of the actor's earliest message in the corpus snapshot. Combined with "
|
||||
"last_seen_ts, this anchors corpus_span_days in absolute time so "
|
||||
"observations from different extractions can be compared temporally.",
|
||||
),
|
||||
"meta.last_seen_ts": _str(
|
||||
notes="ISO 8601 timestamp (with UTC offset, e.g. '2025-12-20T09:11:43+00:00') "
|
||||
"of the actor's latest message in the corpus snapshot. See first_seen_ts.",
|
||||
),
|
||||
"meta.fingerprint_confidence": _cat(
|
||||
"low", "medium", "high",
|
||||
notes="Qualitative reliability rating for this actor's full fingerprint. "
|
||||
"An attribution engine should weight all other observations from this "
|
||||
"actor proportionally to this value before compositing. Derivation is "
|
||||
"EXTRACTOR-DEFINED — the registry specifies the semantic contract, not "
|
||||
"the formula. Extractors must declare their heuristic in the source "
|
||||
"label (e.g. '#confidence-v1'). Typical inputs: total_messages, "
|
||||
"corpus_span_days, active_days, and any domain-specific thresholds "
|
||||
"the extractor authors have calibrated.",
|
||||
),
|
||||
|
||||
# ── stylometric.* (motor analog — 13) ─────────────────────────────────
|
||||
"stylometric.punctuation_style": _hash(notes="canonical punctuation-pattern fingerprint"),
|
||||
"stylometric.capitalization_habit": _cat(
|
||||
"lowercase", "proper", "random_caps", "mixed_i",
|
||||
@@ -200,8 +261,23 @@ PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = {
|
||||
"computation, performed once per extraction. Source label declares the "
|
||||
"top-K size and corpus tag (e.g. `#tfidf-top50`).",
|
||||
),
|
||||
"stylometric.pos_ngram_signature": _hash(
|
||||
notes="64-bit simhash over a POS n-gram (default bigram) frequency vector "
|
||||
"from the author's text corpus. Captures syntactic skeleton independent "
|
||||
"of vocabulary — an author can change every word they use and still "
|
||||
"retain the same POS-bigram fingerprint. ORTHOGONAL to character_ngram "
|
||||
"and function_word distributions: those capture surface form, this "
|
||||
"captures grammatical structure. Example signal: consistent ADJ-NOUN vs "
|
||||
"NOUN-ADJ ordering in Spanish, habitual ADV-VERB pre-position. "
|
||||
"TAGGER-DEPENDENT: source label MUST declare the tagger, language model, "
|
||||
"and n value (e.g. `#spacy-es_core_news_sm-bi` for spaCy Spanish "
|
||||
"small model, bigrams). Calibration note: chat-domain text is noisy — "
|
||||
"abbreviations, misspellings, and code-switching cause tagger errors "
|
||||
"that introduce fingerprint noise. Engines should weight low until "
|
||||
"calibrated against labelled chat corpora.",
|
||||
),
|
||||
|
||||
# ── lexical.* (cognitive analog — 8) ──────────────────────────────────
|
||||
# ── lexical.* (cognitive analog — 11) ─────────────────────────────────
|
||||
"lexical.vocabulary_richness": _num(
|
||||
min_val=0.0, max_val=1.0,
|
||||
notes="Moving-Average Type-Token Ratio (MATTR) over a sliding window "
|
||||
@@ -242,6 +318,52 @@ PRIMITIVE_REGISTRY: dict[str, ValueTypeSpec] = {
|
||||
"market contexts where hierarchical and peer relationships are expressed "
|
||||
"through register choice.",
|
||||
),
|
||||
"lexical.dialect_region": _str(
|
||||
notes="Dominant regional variety of the actor's matrix language, expressed as "
|
||||
"a BCP-47 language-region tag (e.g. `es-CL`, `es-AR`, `es-MX`, `es-ES`, "
|
||||
"`en-US`). Detected from lexical marker density against per-region "
|
||||
"vocabulary tables; detection method and marker set version declared in "
|
||||
"source label (e.g. `#dialect-markers-v1`). Emit the literal string "
|
||||
"`unknown` when the extractor falls below its confidence threshold — do "
|
||||
"not omit the observation, so downstream engines can distinguish "
|
||||
"'undetected' from 'not extracted'. Language-agnostic in concept; the "
|
||||
"marker vocabulary is language-specific. COMPLEMENTARY to "
|
||||
"lexical.code_switching_matrix_language, which captures the dominant "
|
||||
"language via switching analysis rather than direct regional-marker lookup.",
|
||||
),
|
||||
"lexical.evaluative_morphology_density": _num(
|
||||
min_val=0.0, max_val=1.0,
|
||||
notes="Rate of evaluative morpheme tokens / total tokens. Evaluative morphology "
|
||||
"encompasses suffixes that add expressive/emotional loading to a stem: "
|
||||
"diminutives (`-ito`/`-ita`/`-cito`/`-cita` — affection, minimization, "
|
||||
"softening), augmentatives (`-ón`/`-ona`/`-ote`/`-ota` — intensification), "
|
||||
"pejoratives (`-ejo`/`-eja`/`-ucho`/`-ucha` — contempt), and intensives "
|
||||
"(`-azo`/`-aza` — force or admiration by context). Heavy diminutive use "
|
||||
"is characteristic of Mexican and Central American Spanish; River Plate "
|
||||
"speakers use them significantly less. The density is stable per-author "
|
||||
"and hard to consciously suppress — it is baked into language acquisition. "
|
||||
"Language-agnostic in concept; detection (suffix rules or morphological "
|
||||
"analyser) is language-specific. Source label declares the morpheme set "
|
||||
"and tool version (e.g. `#eval-morph-es-v1`).",
|
||||
),
|
||||
"lexical.optional_grammar_signature": _hash(
|
||||
notes="64-bit simhash over a vector of the author's preference probabilities "
|
||||
"at optional-grammar choice points — positions where the language offers "
|
||||
"multiple grammatically correct options and individual authors make stable "
|
||||
"idiosyncratic choices. For Spanish: compound past vs simple past ratio "
|
||||
"(`he comido` vs `comí` — Spain strongly prefers compound for recent "
|
||||
"actions; Latin America almost universally uses simple past, making this "
|
||||
"a high-reliability Spain/LatAm discriminator), subjunctive usage rate "
|
||||
"(avoidance correlates with informal register or non-native acquisition), "
|
||||
"leísmo/laísmo/loísmo clitic patterns (`le vi` vs `lo vi` for masculine "
|
||||
"accusative — leísmo is characteristic of Castilian Spain), and relative "
|
||||
"pronoun choice (`que` vs `el cual/la cual` — register marker). Each "
|
||||
"choice point is a scalar [0,1] probability; the simhash is computed over "
|
||||
"the concatenated vector. EXTRACTOR-DEFINED: choice-point set declared in "
|
||||
"source label (e.g. `#optgrammar-es-v1`). Requires sufficient corpus "
|
||||
"volume for stable probability estimates — thin corpora produce noisy "
|
||||
"hashes; engines should gate on meta.fingerprint_confidence before use.",
|
||||
),
|
||||
|
||||
# ── temporal_evolution.* (lifecycle / change-over-time — 1) ───────────
|
||||
"temporal_evolution.lifecycle_phase": _cat(
|
||||
|
||||
Reference in New Issue
Block a user