modified: readme
This commit is contained in:
@@ -197,6 +197,49 @@ choose to weight these at zero until field-validated against labeled data.
|
||||
|
||||
---
|
||||
|
||||
## Engine implementation notes
|
||||
|
||||
### Cross-alphabet fingerprint comparisons are undefined
|
||||
|
||||
Several primitives produce hash-based fingerprints by hashing over character or
|
||||
syntactic sequences:
|
||||
|
||||
- `stylometric.character_ngram_simhash`
|
||||
- `stylometric.pos_ngram_signature`
|
||||
- `stylometric.function_word_distribution_top50` / `top200`
|
||||
- `stylometric.distinctive_vocabulary_signature`
|
||||
- `lexical.optional_grammar_signature`
|
||||
|
||||
These fingerprints are only **meaningful within a single writing-system boundary**.
|
||||
A Latin-script actor (Spanish, English, French, Portuguese) and a Cyrillic-script
|
||||
actor (Russian, Bulgarian, Serbian) share zero character n-grams by definition.
|
||||
Comparing their `character_ngram_simhash` values produces a Hamming distance that
|
||||
is numerically valid but semantically undefined — it does not measure dissimilarity,
|
||||
it measures incomparability.
|
||||
|
||||
The same applies to any other script boundary: Arabic vs Latin, Hangul vs Hiragana,
|
||||
Devanagari vs Cyrillic, and so on.
|
||||
|
||||
**Engine rule:** before compositing or comparing any hash-based fingerprint between
|
||||
two actors, gate on script/language compatibility. Use `lexical.dialect_region` or
|
||||
`lexical.code_switching_matrix_language` to determine whether two actors share a
|
||||
writing system. If they do not, treat the fingerprint distance as `undefined` rather
|
||||
than as evidence of dissimilarity — do not include it in the similarity composite.
|
||||
|
||||
Primitives that are **not** subject to this constraint (safe to compare across
|
||||
writing systems without gating):
|
||||
|
||||
- All `meta.*` primitives — corpus-footprint metrics are script-agnostic.
|
||||
- All `interaction.*` primitives — timing and graph-structure signals are
|
||||
script-agnostic.
|
||||
- `stylometric.emoji_usage`, `stylometric.emoji_placement` — Unicode emoji are
|
||||
shared across scripts.
|
||||
- `stylometric.capitalization_habit` — only meaningful within scripts that have
|
||||
case; emit `unknown` for caseless scripts (Arabic, CJK, etc.).
|
||||
- `network.*`, `temporal_evolution.*` — structural signals, script-agnostic.
|
||||
|
||||
---
|
||||
|
||||
## Schema
|
||||
|
||||
Machine-readable JSON Schema:
|
||||
|
||||
Reference in New Issue
Block a user