Initial commit: ULPgrammer

- Core Telegram monitoring pipeline (scraper, processor, notifier, downloaders)
- Textual TUI frontend with thread-safe event bus
- SQLite persistence, severity scoring, dedup cache
- Fixed ULP parser: handles https:// truncation, port+path URLs, semicolon separator
- Test suite: 88 tests across scorer, cache, database, processor
This commit is contained in:
2026-04-02 01:58:49 -03:00
commit 48f486ac97
41 changed files with 5270 additions and 0 deletions

65
core/scraper.md Normal file
View File

@@ -0,0 +1,65 @@
# core/scraper.py
Telethon user-client layer. Handles live listening, backfill, and the single-message download pipeline.
## Public API
```python
from core.scraper import handle_message, backfill_all, register_handlers, warm_entity_cache
```
### `handle_message(client, bot, msg, source_name, patterns, password=None)`
**async.** Full pipeline for one document message:
1. Extract filename + size, check allowlist + size guard
2. Check `utils.cache` — skip if already seen
3. Try `tdl` download → Telethon fallback
4. `core.processor.process_file()` → hits
5. `core.notifier.notify()` if hits found
6. `utils.cache.mark_seen()`
Called by: live handler, `bot_downloader`, backfill fallback path.
### `backfill_all(client, bot, patterns)`
**async.** Iterates `config.WATCHED_CHANNELS`, calls `backfill_channel()` for each.
No-op if `config.BACKFILL_LIMIT == 0`.
### `register_handlers(client, bot, patterns)`
Registers a `NewMessage` Telethon event handler on `config.WATCHED_CHANNELS`.
Used in **CLI mode only** (`--no-tui`). The TUI manages its own handler via `_make_handler()` in `tui/app.py`.
### `warm_entity_cache(client)`
**async.** Iterates `client.iter_dialogs()` so Telethon caches entity mappings.
Must be called before using raw numeric channel IDs.
---
## Internal functions
| Function | Description |
|----------|-------------|
| `get_filename(msg)` | Extracts filename from `MessageMediaDocument`; falls back to `{msg_id}{ext}` from MIME |
| `get_filesize(msg)` | Returns document size in bytes |
| `is_processable(filename, size)` | Checks extension allowlist + size limit; returns `(bool, reason)` |
| `_make_dest(msg, filename)` | Resolves temp path, handles collision with `{msg_id}_{filename}` |
| `_telethon_download(client, msg, dest, ...)` | Telethon fallback with tqdm progress + flood-wait handling. Posts `EvDownload*` bus events |
| `backfill_channel(client, bot, channel, patterns, limit)` | Scans history with password carry-forward; batches via tdl |
| `_process_batch(client, bot, batch, patterns)` | One tdl invocation for up to `TDL_AMOUNT` messages; per-file Telethon fallback |
---
## Password carry-forward (backfill)
Channels often post the archive password as a separate text message.
`backfill_channel` iterates newest→oldest, carrying `last_password` so both older and newer file messages in the same scan pick it up.
---
## Download strategy
```
is_tdl_available()?
yes → download_single_with_tdl() / download_batch_with_tdl()
↓ failed?
_telethon_download()
no → _telethon_download() directly
```