- Core Telegram monitoring pipeline (scraper, processor, notifier, downloaders) - Textual TUI frontend with thread-safe event bus - SQLite persistence, severity scoring, dedup cache - Fixed ULP parser: handles https:// truncation, port+path URLs, semicolon separator - Test suite: 88 tests across scorer, cache, database, processor
66 lines
2.5 KiB
Markdown
66 lines
2.5 KiB
Markdown
# core/scraper.py
|
|
|
|
Telethon user-client layer. Handles live listening, backfill, and the single-message download pipeline.
|
|
|
|
## Public API
|
|
|
|
```python
|
|
from core.scraper import handle_message, backfill_all, register_handlers, warm_entity_cache
|
|
```
|
|
|
|
### `handle_message(client, bot, msg, source_name, patterns, password=None)`
|
|
**async.** Full pipeline for one document message:
|
|
1. Extract filename + size, check allowlist + size guard
|
|
2. Check `utils.cache` — skip if already seen
|
|
3. Try `tdl` download → Telethon fallback
|
|
4. `core.processor.process_file()` → hits
|
|
5. `core.notifier.notify()` if hits found
|
|
6. `utils.cache.mark_seen()`
|
|
|
|
Called by: live handler, `bot_downloader`, backfill fallback path.
|
|
|
|
### `backfill_all(client, bot, patterns)`
|
|
**async.** Iterates `config.WATCHED_CHANNELS`, calls `backfill_channel()` for each.
|
|
No-op if `config.BACKFILL_LIMIT == 0`.
|
|
|
|
### `register_handlers(client, bot, patterns)`
|
|
Registers a `NewMessage` Telethon event handler on `config.WATCHED_CHANNELS`.
|
|
Used in **CLI mode only** (`--no-tui`). The TUI manages its own handler via `_make_handler()` in `tui/app.py`.
|
|
|
|
### `warm_entity_cache(client)`
|
|
**async.** Iterates `client.iter_dialogs()` so Telethon caches entity mappings.
|
|
Must be called before using raw numeric channel IDs.
|
|
|
|
---
|
|
|
|
## Internal functions
|
|
|
|
| Function | Description |
|
|
|----------|-------------|
|
|
| `get_filename(msg)` | Extracts filename from `MessageMediaDocument`; falls back to `{msg_id}{ext}` from MIME |
|
|
| `get_filesize(msg)` | Returns document size in bytes |
|
|
| `is_processable(filename, size)` | Checks extension allowlist + size limit; returns `(bool, reason)` |
|
|
| `_make_dest(msg, filename)` | Resolves temp path, handles collision with `{msg_id}_{filename}` |
|
|
| `_telethon_download(client, msg, dest, ...)` | Telethon fallback with tqdm progress + flood-wait handling. Posts `EvDownload*` bus events |
|
|
| `backfill_channel(client, bot, channel, patterns, limit)` | Scans history with password carry-forward; batches via tdl |
|
|
| `_process_batch(client, bot, batch, patterns)` | One tdl invocation for up to `TDL_AMOUNT` messages; per-file Telethon fallback |
|
|
|
|
---
|
|
|
|
## Password carry-forward (backfill)
|
|
|
|
Channels often post the archive password as a separate text message.
|
|
`backfill_channel` iterates newest→oldest, carrying `last_password` so both older and newer file messages in the same scan pick it up.
|
|
|
|
---
|
|
|
|
## Download strategy
|
|
|
|
```
|
|
is_tdl_available()?
|
|
yes → download_single_with_tdl() / download_batch_with_tdl()
|
|
↓ failed?
|
|
_telethon_download()
|
|
no → _telethon_download() directly
|
|
```
|