Files
stealergram/core/scraper.md
anti 741e6bb0d3 Rename to stealergram, add pyproject.toml, purge em-dashes
- Rename project to stealergram throughout
- Add pyproject.toml (replaces requirements.txt split, folds pytest.ini)
- Replace all em-dashes with hyphens across all source files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 10:06:30 -04:00

66 lines
2.5 KiB
Markdown

# core/scraper.py
Telethon user-client layer. Handles live listening, backfill, and the single-message download pipeline.
## Public API
```python
from core.scraper import handle_message, backfill_all, register_handlers, warm_entity_cache
```
### `handle_message(client, bot, msg, source_name, patterns, password=None)`
**async.** Full pipeline for one document message:
1. Extract filename + size, check allowlist + size guard
2. Check `utils.cache` - skip if already seen
3. Try `tdl` download → Telethon fallback
4. `core.processor.process_file()` → hits
5. `core.notifier.notify()` if hits found
6. `utils.cache.mark_seen()`
Called by: live handler, `bot_downloader`, backfill fallback path.
### `backfill_all(client, bot, patterns)`
**async.** Iterates `config.WATCHED_CHANNELS`, calls `backfill_channel()` for each.
No-op if `config.BACKFILL_LIMIT == 0`.
### `register_handlers(client, bot, patterns)`
Registers a `NewMessage` Telethon event handler on `config.WATCHED_CHANNELS`.
Used in **CLI mode only** (`--no-tui`). The TUI manages its own handler via `_make_handler()` in `tui/app.py`.
### `warm_entity_cache(client)`
**async.** Iterates `client.iter_dialogs()` so Telethon caches entity mappings.
Must be called before using raw numeric channel IDs.
---
## Internal functions
| Function | Description |
|----------|-------------|
| `get_filename(msg)` | Extracts filename from `MessageMediaDocument`; falls back to `{msg_id}{ext}` from MIME |
| `get_filesize(msg)` | Returns document size in bytes |
| `is_processable(filename, size)` | Checks extension allowlist + size limit; returns `(bool, reason)` |
| `_make_dest(msg, filename)` | Resolves temp path, handles collision with `{msg_id}_{filename}` |
| `_telethon_download(client, msg, dest, ...)` | Telethon fallback with tqdm progress + flood-wait handling. Posts `EvDownload*` bus events |
| `backfill_channel(client, bot, channel, patterns, limit)` | Scans history with password carry-forward; batches via tdl |
| `_process_batch(client, bot, batch, patterns)` | One tdl invocation for up to `TDL_AMOUNT` messages; per-file Telethon fallback |
---
## Password carry-forward (backfill)
Channels often post the archive password as a separate text message.
`backfill_channel` iterates newest→oldest, carrying `last_password` so both older and newer file messages in the same scan pick it up.
---
## Download strategy
```
is_tdl_available()?
yes → download_single_with_tdl() / download_batch_with_tdl()
↓ failed?
_telethon_download()
no → _telethon_download() directly
```