Files
stealergram/core/processor.md
anti 741e6bb0d3 Rename to stealergram, add pyproject.toml, purge em-dashes
- Rename project to stealergram throughout
- Add pyproject.toml (replaces requirements.txt split, folds pytest.ini)
- Replace all em-dashes with hyphens across all source files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 10:06:30 -04:00

70 lines
2.2 KiB
Markdown

# core/processor.py
Archive extraction and hit searching. No Telegram deps, no async.
## Public API
```python
from core.processor import compile_patterns, process_file
```
### `compile_patterns(keywords: list[str]) -> list[re.Pattern]`
Compiles a list of keyword strings into case-insensitive regex patterns.
Call once at startup; pass the result everywhere patterns are needed.
```python
patterns = compile_patterns(config.TARGET_KEYWORDS)
```
### `process_file(filepath: Path, patterns, password=None) -> list[str]`
Full pipeline: unpack → search each `.txt` → recurse into nested archives → clean up everything.
Returns list of matching raw lines (hits). Deletes the original file and all extracted contents on completion.
```python
hits = process_file(Path("data/tmp/combo.zip"), patterns, password="infected")
```
---
## Internal functions
| Function | Signature | Description |
|----------|-----------|-------------|
| `search_file` | `(filepath, patterns) -> list[str]` | Stream-reads `.txt` line by line; ignores encoding errors |
| `unpack` | `(filepath, extra_password) -> (files, extract_dir\|None)` | Dispatches to correct extractor; plain `.txt` returned as-is |
| `extract_zip` | `(filepath, dest, extra_password)` | Tries no password first, then `ARCHIVE_PASSWORDS` list |
| `extract_7z` | `(filepath, dest, extra_password)` | Requires `py7zr`; skips if not installed |
| `extract_rar` | `(filepath, dest, extra_password)` | Requires `rarfile` + `unrar` binary |
| `_try_passwords` | `(extract_fn, passwords)` | Iterates password list, stops on first success |
---
## Supported formats
| Extension | Library | Notes |
|-----------|---------|-------|
| `.txt` | built-in | Stream-read, no load into memory |
| `.zip` | `zipfile` | stdlib |
| `.7z` | `py7zr` | optional; skipped if not installed |
| `.rar` | `rarfile` | optional; requires `unrar` system binary |
Nested archives are recursed **one level** only.
---
## Password order
1. `extra_password` (from message/channel carry-forward) - tried first
2. `config.ARCHIVE_PASSWORDS` - tried in order
---
## Cleanup guarantee
`process_file` always deletes:
- Extracted individual files
- Extract subdirectory
- Original downloaded file
Even if no hits are found.