modified: readme
This commit is contained in:
@@ -197,6 +197,49 @@ choose to weight these at zero until field-validated against labeled data.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Engine implementation notes
|
||||||
|
|
||||||
|
### Cross-alphabet fingerprint comparisons are undefined
|
||||||
|
|
||||||
|
Several primitives produce hash-based fingerprints by hashing over character or
|
||||||
|
syntactic sequences:
|
||||||
|
|
||||||
|
- `stylometric.character_ngram_simhash`
|
||||||
|
- `stylometric.pos_ngram_signature`
|
||||||
|
- `stylometric.function_word_distribution_top50` / `top200`
|
||||||
|
- `stylometric.distinctive_vocabulary_signature`
|
||||||
|
- `lexical.optional_grammar_signature`
|
||||||
|
|
||||||
|
These fingerprints are only **meaningful within a single writing-system boundary**.
|
||||||
|
A Latin-script actor (Spanish, English, French, Portuguese) and a Cyrillic-script
|
||||||
|
actor (Russian, Bulgarian, Serbian) share zero character n-grams by definition.
|
||||||
|
Comparing their `character_ngram_simhash` values produces a Hamming distance that
|
||||||
|
is numerically valid but semantically undefined — it does not measure dissimilarity,
|
||||||
|
it measures incomparability.
|
||||||
|
|
||||||
|
The same applies to any other script boundary: Arabic vs Latin, Hangul vs Hiragana,
|
||||||
|
Devanagari vs Cyrillic, and so on.
|
||||||
|
|
||||||
|
**Engine rule:** before compositing or comparing any hash-based fingerprint between
|
||||||
|
two actors, gate on script/language compatibility. Use `lexical.dialect_region` or
|
||||||
|
`lexical.code_switching_matrix_language` to determine whether two actors share a
|
||||||
|
writing system. If they do not, treat the fingerprint distance as `undefined` rather
|
||||||
|
than as evidence of dissimilarity — do not include it in the similarity composite.
|
||||||
|
|
||||||
|
Primitives that are **not** subject to this constraint (safe to compare across
|
||||||
|
writing systems without gating):
|
||||||
|
|
||||||
|
- All `meta.*` primitives — corpus-footprint metrics are script-agnostic.
|
||||||
|
- All `interaction.*` primitives — timing and graph-structure signals are
|
||||||
|
script-agnostic.
|
||||||
|
- `stylometric.emoji_usage`, `stylometric.emoji_placement` — Unicode emoji are
|
||||||
|
shared across scripts.
|
||||||
|
- `stylometric.capitalization_habit` — only meaningful within scripts that have
|
||||||
|
case; emit `unknown` for caseless scripts (Arabic, CJK, etc.).
|
||||||
|
- `network.*`, `temporal_evolution.*` — structural signals, script-agnostic.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Schema
|
## Schema
|
||||||
|
|
||||||
Machine-readable JSON Schema:
|
Machine-readable JSON Schema:
|
||||||
|
|||||||
Reference in New Issue
Block a user