- Rename project to stealergram throughout - Add pyproject.toml (replaces requirements.txt split, folds pytest.ini) - Replace all em-dashes with hyphens across all source files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2.5 KiB
core/scraper.py
Telethon user-client layer. Handles live listening, backfill, and the single-message download pipeline.
Public API
from core.scraper import handle_message, backfill_all, register_handlers, warm_entity_cache
handle_message(client, bot, msg, source_name, patterns, password=None)
async. Full pipeline for one document message:
- Extract filename + size, check allowlist + size guard
- Check
utils.cache- skip if already seen - Try
tdldownload → Telethon fallback core.processor.process_file()→ hitscore.notifier.notify()if hits foundutils.cache.mark_seen()
Called by: live handler, bot_downloader, backfill fallback path.
backfill_all(client, bot, patterns)
async. Iterates config.WATCHED_CHANNELS, calls backfill_channel() for each.
No-op if config.BACKFILL_LIMIT == 0.
register_handlers(client, bot, patterns)
Registers a NewMessage Telethon event handler on config.WATCHED_CHANNELS.
Used in CLI mode only (--no-tui). The TUI manages its own handler via _make_handler() in tui/app.py.
warm_entity_cache(client)
async. Iterates client.iter_dialogs() so Telethon caches entity mappings.
Must be called before using raw numeric channel IDs.
Internal functions
| Function | Description |
|---|---|
get_filename(msg) |
Extracts filename from MessageMediaDocument; falls back to {msg_id}{ext} from MIME |
get_filesize(msg) |
Returns document size in bytes |
is_processable(filename, size) |
Checks extension allowlist + size limit; returns (bool, reason) |
_make_dest(msg, filename) |
Resolves temp path, handles collision with {msg_id}_{filename} |
_telethon_download(client, msg, dest, ...) |
Telethon fallback with tqdm progress + flood-wait handling. Posts EvDownload* bus events |
backfill_channel(client, bot, channel, patterns, limit) |
Scans history with password carry-forward; batches via tdl |
_process_batch(client, bot, batch, patterns) |
One tdl invocation for up to TDL_AMOUNT messages; per-file Telethon fallback |
Password carry-forward (backfill)
Channels often post the archive password as a separate text message.
backfill_channel iterates newest→oldest, carrying last_password so both older and newer file messages in the same scan pick it up.
Download strategy
is_tdl_available()?
yes → download_single_with_tdl() / download_batch_with_tdl()
↓ failed?
_telethon_download()
no → _telethon_download() directly