modified: readme

This commit is contained in:
2026-05-29 18:13:41 -04:00
parent b182e2fe3b
commit 0d18a3e30d

View File

@@ -197,6 +197,49 @@ choose to weight these at zero until field-validated against labeled data.
--- ---
## Engine implementation notes
### Cross-alphabet fingerprint comparisons are undefined
Several primitives produce hash-based fingerprints by hashing over character or
syntactic sequences:
- `stylometric.character_ngram_simhash`
- `stylometric.pos_ngram_signature`
- `stylometric.function_word_distribution_top50` / `top200`
- `stylometric.distinctive_vocabulary_signature`
- `lexical.optional_grammar_signature`
These fingerprints are only **meaningful within a single writing-system boundary**.
A Latin-script actor (Spanish, English, French, Portuguese) and a Cyrillic-script
actor (Russian, Bulgarian, Serbian) share zero character n-grams by definition.
Comparing their `character_ngram_simhash` values produces a Hamming distance that
is numerically valid but semantically undefined — it does not measure dissimilarity,
it measures incomparability.
The same applies to any other script boundary: Arabic vs Latin, Hangul vs Hiragana,
Devanagari vs Cyrillic, and so on.
**Engine rule:** before compositing or comparing any hash-based fingerprint between
two actors, gate on script/language compatibility. Use `lexical.dialect_region` or
`lexical.code_switching_matrix_language` to determine whether two actors share a
writing system. If they do not, treat the fingerprint distance as `undefined` rather
than as evidence of dissimilarity — do not include it in the similarity composite.
Primitives that are **not** subject to this constraint (safe to compare across
writing systems without gating):
- All `meta.*` primitives — corpus-footprint metrics are script-agnostic.
- All `interaction.*` primitives — timing and graph-structure signals are
script-agnostic.
- `stylometric.emoji_usage`, `stylometric.emoji_placement` — Unicode emoji are
shared across scripts.
- `stylometric.capitalization_habit` — only meaningful within scripts that have
case; emit `unknown` for caseless scripts (Arabic, CJK, etc.).
- `network.*`, `temporal_evolution.*` — structural signals, script-agnostic.
---
## Schema ## Schema
Machine-readable JSON Schema: Machine-readable JSON Schema: