- Core Telegram monitoring pipeline (scraper, processor, notifier, downloaders) - Textual TUI frontend with thread-safe event bus - SQLite persistence, severity scoring, dedup cache - Fixed ULP parser: handles https:// truncation, port+path URLs, semicolon separator - Test suite: 88 tests across scorer, cache, database, processor
2.2 KiB
2.2 KiB
core/processor.py
Archive extraction and hit searching. No Telegram deps, no async.
Public API
from core.processor import compile_patterns, process_file
compile_patterns(keywords: list[str]) -> list[re.Pattern]
Compiles a list of keyword strings into case-insensitive regex patterns.
Call once at startup; pass the result everywhere patterns are needed.
patterns = compile_patterns(config.TARGET_KEYWORDS)
process_file(filepath: Path, patterns, password=None) -> list[str]
Full pipeline: unpack → search each .txt → recurse into nested archives → clean up everything.
Returns list of matching raw lines (hits). Deletes the original file and all extracted contents on completion.
hits = process_file(Path("data/tmp/combo.zip"), patterns, password="infected")
Internal functions
| Function | Signature | Description |
|---|---|---|
search_file |
(filepath, patterns) -> list[str] |
Stream-reads .txt line by line; ignores encoding errors |
unpack |
(filepath, extra_password) -> (files, extract_dir|None) |
Dispatches to correct extractor; plain .txt returned as-is |
extract_zip |
(filepath, dest, extra_password) |
Tries no password first, then ARCHIVE_PASSWORDS list |
extract_7z |
(filepath, dest, extra_password) |
Requires py7zr; skips if not installed |
extract_rar |
(filepath, dest, extra_password) |
Requires rarfile + unrar binary |
_try_passwords |
(extract_fn, passwords) |
Iterates password list, stops on first success |
Supported formats
| Extension | Library | Notes |
|---|---|---|
.txt |
built-in | Stream-read, no load into memory |
.zip |
zipfile |
stdlib |
.7z |
py7zr |
optional; skipped if not installed |
.rar |
rarfile |
optional; requires unrar system binary |
Nested archives are recursed one level only.
Password order
extra_password(from message/channel carry-forward) — tried firstconfig.ARCHIVE_PASSWORDS— tried in order
Cleanup guarantee
process_file always deletes:
- Extracted individual files
- Extract subdirectory
- Original downloaded file
Even if no hits are found.