# utils/scorer.py Severity scoring for credential hits. No Telegram deps. Pure logic. ## Public API ```python from utils.scorer import score_hit, score_hits, summarize, ScoredHit from utils.scorer import CRITICAL, HIGH, MEDIUM, LOW, SEVERITY_EMOJI, SEVERITY_SCORES ``` ### `score_hit(line: str) -> ScoredHit` Score a single raw credential line. Parses ULP format (`url:user:pass`), runs all checks, returns a `ScoredHit`. ### `score_hits(lines: list[str]) -> list[ScoredHit]` Score a list of lines. Returns sorted descending by score. ### `summarize(scored: list[ScoredHit]) -> dict` Returns `{CRITICAL: n, HIGH: n, MEDIUM: n, LOW: n}`. --- ## ScoredHit dataclass | Field | Type | Description | |-------|------|-------------| | `raw` | str | Original credential line | | `severity` | str | CRITICAL / HIGH / MEDIUM / LOW | | `score` | int | 40 / 30 / 20 / 10 | | `reasons` | list[str] | Human-readable match reasons | | `url` | str\|None | Parsed URL field | | `username` | str\|None | Parsed username/email field | | `password` | str\|None | Parsed password field | | `.emoji` | property | 🔴🟠🟡🟢 | --- ## Scoring rules (highest match wins) | Severity | Triggers | |----------|----------| | CRITICAL | Employee email domain after `@` in username/line · Privileged service URL (admin, vpn, ssh, rdp, gitlab, jira…) | | HIGH | Internal service URL (intranet, erp, crm, sso, owa, sharepoint…) | | MEDIUM | Client-facing URL (app, patient, booking, helpdesk…) | | LOW | Org domain appears anywhere in line (baseline) | Check 6 (no severity change): flags weak passwords ≤6 chars or common strings. --- ## Employee domain matching Keywords in `config.TARGET_KEYWORDS` containing `@` become employee patterns. Pattern: `@(?:[^a-zA-Z0-9.\-]|$)` — requires literal `@` before the domain. **`user@gmail.com` on a URL containing `myorg.cl` does NOT trigger CRITICAL.** Keywords without `@` go only to `ORG_DOMAINS` (LOW baseline). --- ## ULP line parser (`ULP_PATTERN`) Separators: `:` `;` `,` `|` `\t` (any of these between the three fields). The URL field handles two common stealer-log complications: 1. **`://` not treated as separator** — the optional scheme prefix `(?:https?|ftp)://` is consumed before the character-class match, so `https://` never gets split at the colon. 2. **Port + path consumed into the URL** — the optional group `(?::\d+/[^\s:;,|\t]*)` absorbs `:port/path` when the port is pure digits immediately followed by `/`. This correctly handles `http://host:8085/path/:user:pass` but intentionally skips patterns like `:24145487-8` (RUT number — hyphen after digits, no `/`). **Known limitation:** A bare port with no path (e.g. `https://host:8080:user:pass`) will mis-parse `8080` as the username. This is not observed in practice — stealer logs always include at least a trailing `/`. --- ## Module-level globals (rebuilt on import + via reload_from_config) | Name | Type | Description | |------|------|-------------| | `EMPLOYEE_DOMAINS` | `list[tuple[str, Pattern]]` | `(domain_str, anchored_pattern)` for `@`-keywords | | `ORG_DOMAINS` | `list[Pattern]` | Plain domain patterns for all keywords | scorer uses `import config as _config` (not `from config import TARGET_KEYWORDS`), so patching `config.TARGET_KEYWORDS` at runtime is sufficient — `_build_*` reads the live module attribute. To rebuild after editing `config.TARGET_KEYWORDS` at runtime: ```python import utils.scorer as scorer scorer.reload_from_config() ``` ### `reload_from_config() -> None` Rebuilds `EMPLOYEE_DOMAINS` and `ORG_DOMAINS` from the current `config.TARGET_KEYWORDS`. Called by web config routes after `config.save_runtime_config()` writes new keyword groups.