Files
stealergram/core/scraper.md
anti 48f486ac97 Initial commit: ULPgrammer
- Core Telegram monitoring pipeline (scraper, processor, notifier, downloaders)
- Textual TUI frontend with thread-safe event bus
- SQLite persistence, severity scoring, dedup cache
- Fixed ULP parser: handles https:// truncation, port+path URLs, semicolon separator
- Test suite: 88 tests across scorer, cache, database, processor
2026-04-02 01:58:49 -03:00

2.5 KiB

core/scraper.py

Telethon user-client layer. Handles live listening, backfill, and the single-message download pipeline.

Public API

from core.scraper import handle_message, backfill_all, register_handlers, warm_entity_cache

handle_message(client, bot, msg, source_name, patterns, password=None)

async. Full pipeline for one document message:

  1. Extract filename + size, check allowlist + size guard
  2. Check utils.cache — skip if already seen
  3. Try tdl download → Telethon fallback
  4. core.processor.process_file() → hits
  5. core.notifier.notify() if hits found
  6. utils.cache.mark_seen()

Called by: live handler, bot_downloader, backfill fallback path.

backfill_all(client, bot, patterns)

async. Iterates config.WATCHED_CHANNELS, calls backfill_channel() for each.
No-op if config.BACKFILL_LIMIT == 0.

register_handlers(client, bot, patterns)

Registers a NewMessage Telethon event handler on config.WATCHED_CHANNELS.
Used in CLI mode only (--no-tui). The TUI manages its own handler via _make_handler() in tui/app.py.

warm_entity_cache(client)

async. Iterates client.iter_dialogs() so Telethon caches entity mappings.
Must be called before using raw numeric channel IDs.


Internal functions

Function Description
get_filename(msg) Extracts filename from MessageMediaDocument; falls back to {msg_id}{ext} from MIME
get_filesize(msg) Returns document size in bytes
is_processable(filename, size) Checks extension allowlist + size limit; returns (bool, reason)
_make_dest(msg, filename) Resolves temp path, handles collision with {msg_id}_{filename}
_telethon_download(client, msg, dest, ...) Telethon fallback with tqdm progress + flood-wait handling. Posts EvDownload* bus events
backfill_channel(client, bot, channel, patterns, limit) Scans history with password carry-forward; batches via tdl
_process_batch(client, bot, batch, patterns) One tdl invocation for up to TDL_AMOUNT messages; per-file Telethon fallback

Password carry-forward (backfill)

Channels often post the archive password as a separate text message.
backfill_channel iterates newest→oldest, carrying last_password so both older and newer file messages in the same scan pick it up.


Download strategy

is_tdl_available()?
  yes → download_single_with_tdl() / download_batch_with_tdl()
          ↓ failed?
        _telethon_download()
  no  → _telethon_download() directly