# core/processor.py Archive extraction and hit searching. No Telegram deps, no async. ## Public API ```python from core.processor import compile_patterns, process_file ``` ### `compile_patterns(keywords: list[str]) -> list[re.Pattern]` Compiles a list of keyword strings into case-insensitive regex patterns. Call once at startup; pass the result everywhere patterns are needed. ```python patterns = compile_patterns(config.TARGET_KEYWORDS) ``` ### `process_file(filepath: Path, patterns, password=None) -> list[str]` Full pipeline: unpack → search each `.txt` → recurse into nested archives → clean up everything. Returns list of matching raw lines (hits). Deletes the original file and all extracted contents on completion. ```python hits = process_file(Path("data/tmp/combo.zip"), patterns, password="infected") ``` --- ## Internal functions | Function | Signature | Description | |----------|-----------|-------------| | `search_file` | `(filepath, patterns) -> list[str]` | Stream-reads `.txt` line by line; ignores encoding errors | | `unpack` | `(filepath, extra_password) -> (files, extract_dir\|None)` | Dispatches to correct extractor; plain `.txt` returned as-is | | `extract_zip` | `(filepath, dest, extra_password)` | Tries no password first, then `ARCHIVE_PASSWORDS` list | | `extract_7z` | `(filepath, dest, extra_password)` | Requires `py7zr`; skips if not installed | | `extract_rar` | `(filepath, dest, extra_password)` | Requires `rarfile` + `unrar` binary | | `_try_passwords` | `(extract_fn, passwords)` | Iterates password list, stops on first success | --- ## Supported formats | Extension | Library | Notes | |-----------|---------|-------| | `.txt` | built-in | Stream-read, no load into memory | | `.zip` | `zipfile` | stdlib | | `.7z` | `py7zr` | optional; skipped if not installed | | `.rar` | `rarfile` | optional; requires `unrar` system binary | Nested archives are recursed **one level** only. --- ## Password order 1. `extra_password` (from message/channel carry-forward) - tried first 2. `config.ARCHIVE_PASSWORDS` - tried in order --- ## Cleanup guarantee `process_file` always deletes: - Extracted individual files - Extract subdirectory - Original downloaded file Even if no hits are found.