Initial commit: ULPgrammer
- Core Telegram monitoring pipeline (scraper, processor, notifier, downloaders) - Textual TUI frontend with thread-safe event bus - SQLite persistence, severity scoring, dedup cache - Fixed ULP parser: handles https:// truncation, port+path URLs, semicolon separator - Test suite: 88 tests across scorer, cache, database, processor
This commit is contained in:
65
core/scraper.md
Normal file
65
core/scraper.md
Normal file
@@ -0,0 +1,65 @@
|
||||
# core/scraper.py
|
||||
|
||||
Telethon user-client layer. Handles live listening, backfill, and the single-message download pipeline.
|
||||
|
||||
## Public API
|
||||
|
||||
```python
|
||||
from core.scraper import handle_message, backfill_all, register_handlers, warm_entity_cache
|
||||
```
|
||||
|
||||
### `handle_message(client, bot, msg, source_name, patterns, password=None)`
|
||||
**async.** Full pipeline for one document message:
|
||||
1. Extract filename + size, check allowlist + size guard
|
||||
2. Check `utils.cache` — skip if already seen
|
||||
3. Try `tdl` download → Telethon fallback
|
||||
4. `core.processor.process_file()` → hits
|
||||
5. `core.notifier.notify()` if hits found
|
||||
6. `utils.cache.mark_seen()`
|
||||
|
||||
Called by: live handler, `bot_downloader`, backfill fallback path.
|
||||
|
||||
### `backfill_all(client, bot, patterns)`
|
||||
**async.** Iterates `config.WATCHED_CHANNELS`, calls `backfill_channel()` for each.
|
||||
No-op if `config.BACKFILL_LIMIT == 0`.
|
||||
|
||||
### `register_handlers(client, bot, patterns)`
|
||||
Registers a `NewMessage` Telethon event handler on `config.WATCHED_CHANNELS`.
|
||||
Used in **CLI mode only** (`--no-tui`). The TUI manages its own handler via `_make_handler()` in `tui/app.py`.
|
||||
|
||||
### `warm_entity_cache(client)`
|
||||
**async.** Iterates `client.iter_dialogs()` so Telethon caches entity mappings.
|
||||
Must be called before using raw numeric channel IDs.
|
||||
|
||||
---
|
||||
|
||||
## Internal functions
|
||||
|
||||
| Function | Description |
|
||||
|----------|-------------|
|
||||
| `get_filename(msg)` | Extracts filename from `MessageMediaDocument`; falls back to `{msg_id}{ext}` from MIME |
|
||||
| `get_filesize(msg)` | Returns document size in bytes |
|
||||
| `is_processable(filename, size)` | Checks extension allowlist + size limit; returns `(bool, reason)` |
|
||||
| `_make_dest(msg, filename)` | Resolves temp path, handles collision with `{msg_id}_{filename}` |
|
||||
| `_telethon_download(client, msg, dest, ...)` | Telethon fallback with tqdm progress + flood-wait handling. Posts `EvDownload*` bus events |
|
||||
| `backfill_channel(client, bot, channel, patterns, limit)` | Scans history with password carry-forward; batches via tdl |
|
||||
| `_process_batch(client, bot, batch, patterns)` | One tdl invocation for up to `TDL_AMOUNT` messages; per-file Telethon fallback |
|
||||
|
||||
---
|
||||
|
||||
## Password carry-forward (backfill)
|
||||
|
||||
Channels often post the archive password as a separate text message.
|
||||
`backfill_channel` iterates newest→oldest, carrying `last_password` so both older and newer file messages in the same scan pick it up.
|
||||
|
||||
---
|
||||
|
||||
## Download strategy
|
||||
|
||||
```
|
||||
is_tdl_available()?
|
||||
yes → download_single_with_tdl() / download_batch_with_tdl()
|
||||
↓ failed?
|
||||
_telethon_download()
|
||||
no → _telethon_download() directly
|
||||
```
|
||||
Reference in New Issue
Block a user