Initial commit: ULPgrammer

- Core Telegram monitoring pipeline (scraper, processor, notifier, downloaders)
- Textual TUI frontend with thread-safe event bus
- SQLite persistence, severity scoring, dedup cache
- Fixed ULP parser: handles https:// truncation, port+path URLs, semicolon separator
- Test suite: 88 tests across scorer, cache, database, processor
This commit is contained in:
2026-04-02 01:58:49 -03:00
commit 48f486ac97
41 changed files with 5270 additions and 0 deletions

182
QUICK_REF.md Normal file
View File

@@ -0,0 +1,182 @@
# ULP Monitor — Quick Reference
> For Claude Code: read the per-file `.md` alongside each `.py` before editing.
> Full docs in `README.md`.
---
## Project layout
```
ulp_monitor/
├── main.py Entry point (--no-tui flag for CLI mode)
├── config.py All settings — edit this for keywords, channels, paths
├── core/ Telegram I/O pipeline (all async, Telethon-dependent)
│ ├── scraper.py Live listener + backfill orchestration
│ ├── tdl_downloader.py tdl subprocess wrapper + Telethon fallback
│ ├── bot_downloader.py Inline "DOWNLOAD" button click flow
│ ├── processor.py Archive extraction (.zip/.7z/.rar) + line search
│ └── notifier.py Scoring → dedup → DB → hits.txt/csv → Telegram alert
├── utils/ Pure logic, no Telegram deps, no async
│ ├── scorer.py Severity scoring (CRITICAL/HIGH/MEDIUM/LOW)
│ ├── cache.py Seen file-ID dedup (data/cache.json)
│ └── database.py SQLite read/write (data/hits.db)
├── tui/ Textual TUI — runs in main thread
│ ├── app.py MonitorApp + all screens + bot thread launcher
│ └── events.py Thread-safe queue.Queue event bus
└── data/ Runtime output — gitignored
├── hits.db
├── hits.txt
├── hits.csv
├── cache.json
├── dedup.json
└── logs/monitor.log
```
---
## Data flow
```
Telegram channel
└─ new message with file / download button
├─ core/scraper.py detects + guards (size, extension, dedup)
├─ core/tdl_downloader.py downloads via tdl (batched)
│ └─ core/scraper.py Telethon fallback if tdl fails
├─ core/bot_downloader.py handles inline button → bot reply flow
├─ core/processor.py extracts archive → searches .txt line by line
└─ core/notifier.py scores → deduplicates → persists → alerts
├─ utils/scorer.py
├─ utils/database.py
└─ tui/events.py posts EvHit to TUI
```
---
## Threading architecture
```
main thread (Textual's event loop)
├─ MonitorApp.on_mount()
│ ├─ bus.init_bus() creates queue.Queue on THIS loop
│ ├─ threading.Thread → _run_bot_thread()
│ └─ set_interval(0.1, _drain_bus)
├─ _drain_bus() [every 100ms]
│ └─ queue.Queue.get_nowait() → dispatch to widgets
└─ Textual widgets, screens, keybindings
bot thread (own asyncio event loop)
└─ _bot_main()
├─ bot_client.connect() + sign_in()
├─ user_client.connect() + is_user_authorized()
├─ warm_entity_cache()
├─ _make_handler() → NewMessage handler registered
├─ backfill_all()
└─ run_until_disconnected() + _watch_channels() [gathered]
cross-thread communication
bot → TUI: bus.post(event) [queue.Queue.put_nowait, always safe]
TUI → bot: loop.call_soon_threadsafe() [asyncio.Event.set for channel changes]
```
---
## Config quick reference (`config.py`)
| Setting | Type | Description |
|---------|------|-------------|
| `API_ID` | int | From my.telegram.org |
| `API_HASH` | str | From my.telegram.org |
| `BOT_TOKEN` | str | From @BotFather |
| `NOTIFY_CHAT_ID` | int | Your Telegram user/group ID |
| `SESSION_NAME` | str | Session file name (default: `monitor_session`) |
| `TARGET_KEYWORDS` | list[str] | Regex patterns. `@`-prefixed → employee email (CRITICAL). Plain → domain match (LOW) |
| `WATCHED_CHANNELS` | list[str\|int] | Usernames or `-100xxxxxxxxxx` IDs |
| `BACKFILL_LIMIT` | int | Messages to scan per channel on startup (0 = off) |
| `ALLOWED_EXTENSIONS` | set | `.txt .zip .7z .rar` |
| `MAX_FILE_SIZE` | int | Bytes (default 4 GB) |
| `ARCHIVE_PASSWORDS` | list[bytes] | Tried in order on locked archives |
| `TDL_NAMESPACE` | str\|None | `tdl login -n <name>` namespace |
| `TDL_THREADS` | int | Chunk workers per file (`-t`) |
| `TDL_PERFILE` | int | Concurrent files per tdl call (`-l`) |
| `TDL_AMOUNT` | int | Messages per batch |
| `TEMP_DIR` | Path | `data/tmp` |
| `HITS_FILE` | Path | `data/hits.txt` |
| `LOG_FILE` | Path | `data/logs/monitor.log` |
---
## Severity scoring summary
| Severity | Score | Triggers |
|----------|-------|----------|
| CRITICAL | 40 | Employee email (`@myorg.cl` in username) · Privileged service URL (admin, vpn, rdp, gitlab…) |
| HIGH | 30 | Internal service URL (intranet, erp, sso, owa…) |
| MEDIUM | 20 | Client-facing URL (app, booking, helpdesk…) |
| LOW | 10 | Org domain appears anywhere in line |
`@`-keyword rule: pattern requires literal `@` before domain — `user@gmail.com` on a URL containing `myorg.cl` does **not** trigger CRITICAL.
---
## TUI keybindings
| Key | Action | Screen |
|-----|--------|--------|
| `s` | Search hits DB | → SearchScreen |
| `h` | Browse hits by severity | → HitsDBScreen |
| `k` | Edit keyword patterns live | → KeywordsScreen |
| `c` | Clear download + hits logs | main |
| `r` | Force-refresh stats bar | main |
| `q` / `ctrl+c` | Quit | any |
| `Escape` | Back to main | sub-screens |
| `1`/`2`/`3`/`4` | Filter CRITICAL/HIGH/MEDIUM/LOW | HitsDBScreen |
| `r` | Load recent 50 | HitsDBScreen |
---
## Per-file reference docs
| File | Reference |
|------|-----------|
| `utils/scorer.py` | `utils/scorer.md` |
| `utils/cache.py` | `utils/cache.md` |
| `utils/database.py` | `utils/database.md` |
| `core/scraper.py` | `core/scraper.md` |
| `core/processor.py` | `core/processor.md` |
| `core/notifier.py` | `core/notifier.md` |
| `core/tdl_downloader.py` | `core/tdl_downloader.md` |
| `core/bot_downloader.py` | `core/bot_downloader.md` |
| `tui/app.py` | `tui/app.md` |
| `tui/events.py` | `tui/events.md` |
---
## Common tasks
**Add a new keyword at runtime:** open the TUI → press `k` → add pattern → active immediately. Copy to `config.TARGET_KEYWORDS` to persist.
**Add a channel at runtime:** type username or numeric ID in the Channels panel → Add. Handler re-registers immediately. Edit `config.WATCHED_CHANNELS` to persist.
**Query hits from CLI:**
```bash
sqlite3 data/hits.db "SELECT severity, username, url FROM hits WHERE seen_before=0 ORDER BY score DESC LIMIT 20"
```
**Re-process all files** (wipe cache):
```bash
rm data/cache.json data/dedup.json
```
**Check what's happening:** `tail -f data/logs/monitor.log`