Initial commit: ULPgrammer

- Core Telegram monitoring pipeline (scraper, processor, notifier, downloaders)
- Textual TUI frontend with thread-safe event bus
- SQLite persistence, severity scoring, dedup cache
- Fixed ULP parser: handles https:// truncation, port+path URLs, semicolon separator
- Test suite: 88 tests across scorer, cache, database, processor
This commit is contained in:
2026-04-02 01:58:49 -03:00
commit 48f486ac97
41 changed files with 5270 additions and 0 deletions

69
core/processor.md Normal file
View File

@@ -0,0 +1,69 @@
# core/processor.py
Archive extraction and hit searching. No Telegram deps, no async.
## Public API
```python
from core.processor import compile_patterns, process_file
```
### `compile_patterns(keywords: list[str]) -> list[re.Pattern]`
Compiles a list of keyword strings into case-insensitive regex patterns.
Call once at startup; pass the result everywhere patterns are needed.
```python
patterns = compile_patterns(config.TARGET_KEYWORDS)
```
### `process_file(filepath: Path, patterns, password=None) -> list[str]`
Full pipeline: unpack → search each `.txt` → recurse into nested archives → clean up everything.
Returns list of matching raw lines (hits). Deletes the original file and all extracted contents on completion.
```python
hits = process_file(Path("data/tmp/combo.zip"), patterns, password="infected")
```
---
## Internal functions
| Function | Signature | Description |
|----------|-----------|-------------|
| `search_file` | `(filepath, patterns) -> list[str]` | Stream-reads `.txt` line by line; ignores encoding errors |
| `unpack` | `(filepath, extra_password) -> (files, extract_dir\|None)` | Dispatches to correct extractor; plain `.txt` returned as-is |
| `extract_zip` | `(filepath, dest, extra_password)` | Tries no password first, then `ARCHIVE_PASSWORDS` list |
| `extract_7z` | `(filepath, dest, extra_password)` | Requires `py7zr`; skips if not installed |
| `extract_rar` | `(filepath, dest, extra_password)` | Requires `rarfile` + `unrar` binary |
| `_try_passwords` | `(extract_fn, passwords)` | Iterates password list, stops on first success |
---
## Supported formats
| Extension | Library | Notes |
|-----------|---------|-------|
| `.txt` | built-in | Stream-read, no load into memory |
| `.zip` | `zipfile` | stdlib |
| `.7z` | `py7zr` | optional; skipped if not installed |
| `.rar` | `rarfile` | optional; requires `unrar` system binary |
Nested archives are recursed **one level** only.
---
## Password order
1. `extra_password` (from message/channel carry-forward) — tried first
2. `config.ARCHIVE_PASSWORDS` — tried in order
---
## Cleanup guarantee
`process_file` always deletes:
- Extracted individual files
- Extract subdirectory
- Original downloaded file
Even if no hits are found.