Initial commit: ULPgrammer
- Core Telegram monitoring pipeline (scraper, processor, notifier, downloaders) - Textual TUI frontend with thread-safe event bus - SQLite persistence, severity scoring, dedup cache - Fixed ULP parser: handles https:// truncation, port+path URLs, semicolon separator - Test suite: 88 tests across scorer, cache, database, processor
This commit is contained in:
69
core/processor.md
Normal file
69
core/processor.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# core/processor.py
|
||||
|
||||
Archive extraction and hit searching. No Telegram deps, no async.
|
||||
|
||||
## Public API
|
||||
|
||||
```python
|
||||
from core.processor import compile_patterns, process_file
|
||||
```
|
||||
|
||||
### `compile_patterns(keywords: list[str]) -> list[re.Pattern]`
|
||||
Compiles a list of keyword strings into case-insensitive regex patterns.
|
||||
Call once at startup; pass the result everywhere patterns are needed.
|
||||
|
||||
```python
|
||||
patterns = compile_patterns(config.TARGET_KEYWORDS)
|
||||
```
|
||||
|
||||
### `process_file(filepath: Path, patterns, password=None) -> list[str]`
|
||||
Full pipeline: unpack → search each `.txt` → recurse into nested archives → clean up everything.
|
||||
Returns list of matching raw lines (hits). Deletes the original file and all extracted contents on completion.
|
||||
|
||||
```python
|
||||
hits = process_file(Path("data/tmp/combo.zip"), patterns, password="infected")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Internal functions
|
||||
|
||||
| Function | Signature | Description |
|
||||
|----------|-----------|-------------|
|
||||
| `search_file` | `(filepath, patterns) -> list[str]` | Stream-reads `.txt` line by line; ignores encoding errors |
|
||||
| `unpack` | `(filepath, extra_password) -> (files, extract_dir\|None)` | Dispatches to correct extractor; plain `.txt` returned as-is |
|
||||
| `extract_zip` | `(filepath, dest, extra_password)` | Tries no password first, then `ARCHIVE_PASSWORDS` list |
|
||||
| `extract_7z` | `(filepath, dest, extra_password)` | Requires `py7zr`; skips if not installed |
|
||||
| `extract_rar` | `(filepath, dest, extra_password)` | Requires `rarfile` + `unrar` binary |
|
||||
| `_try_passwords` | `(extract_fn, passwords)` | Iterates password list, stops on first success |
|
||||
|
||||
---
|
||||
|
||||
## Supported formats
|
||||
|
||||
| Extension | Library | Notes |
|
||||
|-----------|---------|-------|
|
||||
| `.txt` | built-in | Stream-read, no load into memory |
|
||||
| `.zip` | `zipfile` | stdlib |
|
||||
| `.7z` | `py7zr` | optional; skipped if not installed |
|
||||
| `.rar` | `rarfile` | optional; requires `unrar` system binary |
|
||||
|
||||
Nested archives are recursed **one level** only.
|
||||
|
||||
---
|
||||
|
||||
## Password order
|
||||
|
||||
1. `extra_password` (from message/channel carry-forward) — tried first
|
||||
2. `config.ARCHIVE_PASSWORDS` — tried in order
|
||||
|
||||
---
|
||||
|
||||
## Cleanup guarantee
|
||||
|
||||
`process_file` always deletes:
|
||||
- Extracted individual files
|
||||
- Extract subdirectory
|
||||
- Original downloaded file
|
||||
|
||||
Even if no hits are found.
|
||||
Reference in New Issue
Block a user