From 0d18a3e30d08965395bee693cb89d9ee47b2defb Mon Sep 17 00:00:00 2001 From: anti Date: Fri, 29 May 2026 18:13:41 -0400 Subject: [PATCH] modified: readme --- BEHAVE-TEXT/README.md | 43 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/BEHAVE-TEXT/README.md b/BEHAVE-TEXT/README.md index b0e6e3e..ef4aaf8 100644 --- a/BEHAVE-TEXT/README.md +++ b/BEHAVE-TEXT/README.md @@ -197,6 +197,49 @@ choose to weight these at zero until field-validated against labeled data. --- +## Engine implementation notes + +### Cross-alphabet fingerprint comparisons are undefined + +Several primitives produce hash-based fingerprints by hashing over character or +syntactic sequences: + +- `stylometric.character_ngram_simhash` +- `stylometric.pos_ngram_signature` +- `stylometric.function_word_distribution_top50` / `top200` +- `stylometric.distinctive_vocabulary_signature` +- `lexical.optional_grammar_signature` + +These fingerprints are only **meaningful within a single writing-system boundary**. +A Latin-script actor (Spanish, English, French, Portuguese) and a Cyrillic-script +actor (Russian, Bulgarian, Serbian) share zero character n-grams by definition. +Comparing their `character_ngram_simhash` values produces a Hamming distance that +is numerically valid but semantically undefined — it does not measure dissimilarity, +it measures incomparability. + +The same applies to any other script boundary: Arabic vs Latin, Hangul vs Hiragana, +Devanagari vs Cyrillic, and so on. + +**Engine rule:** before compositing or comparing any hash-based fingerprint between +two actors, gate on script/language compatibility. Use `lexical.dialect_region` or +`lexical.code_switching_matrix_language` to determine whether two actors share a +writing system. If they do not, treat the fingerprint distance as `undefined` rather +than as evidence of dissimilarity — do not include it in the similarity composite. + +Primitives that are **not** subject to this constraint (safe to compare across +writing systems without gating): + +- All `meta.*` primitives — corpus-footprint metrics are script-agnostic. +- All `interaction.*` primitives — timing and graph-structure signals are + script-agnostic. +- `stylometric.emoji_usage`, `stylometric.emoji_placement` — Unicode emoji are + shared across scripts. +- `stylometric.capitalization_habit` — only meaningful within scripts that have + case; emit `unknown` for caseless scripts (Arabic, CJK, etc.). +- `network.*`, `temporal_evolution.*` — structural signals, script-agnostic. + +--- + ## Schema Machine-readable JSON Schema: