diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..022ff7b --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,130 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Development workflow + +After every code change: +1. Run `pytest` — all tests must pass at 100%. +2. If 100% pass: present the change to the user, then commit. +3. If any test fails: fix the bug and re-run before showing anything to the user. + +Never present code or commit while tests are failing. + +## Running tests + +```bash +pip install -r requirements-dev.txt +pytest # all tests +pytest -v # verbose +pytest tests/test_scorer.py # single file +``` + +Tests cover `utils/scorer`, `utils/cache`, `utils/database`, and `core/processor`. They are fully isolated — no `.env` required, no real DB or cache files touched. The `patched_keywords` fixture in `conftest.py` replaces `TARGET_KEYWORDS` with known test patterns; it must patch both `config.TARGET_KEYWORDS` and `scorer.TARGET_KEYWORDS` (the local `from config import` binding). + +## Running the monitor + +```bash +source .venv/bin/activate # initialize the python enviroment, if .venv exists +python main.py # TUI mode (default) +python main.py --no-tui # Plain CLI, logs to stdout + data/logs/monitor.log +``` + +First run will interactively prompt for Telegram phone + 2FA to create a session file. + +## Setup prerequisites + +```bash +pip install -r requirements.txt +# rarfile requires the unrar binary: sudo apt install unrar (Linux) or brew install rar (macOS) + +# tdl (strongly recommended for fast downloads): +curl -sSL https://raw.githubusercontent.com/iyear/tdl/main/scripts/install.sh | bash +tdl login -n monitor_session +``` + +If no `.env` file exists, ask the user to manually create the file. We cannot create it, because it contains personal information. + +## Architecture + +### Data flow + +``` +Telegram channel message with file attachment + └─ core/scraper.py detects attachment, guards (size/extension/dedup) + └─ core/tdl_downloader.py downloads via tdl subprocess (batched) + └─ core/scraper.py Telethon fallback if tdl fails + └─ core/bot_downloader.py handles inline "DOWNLOAD" button → bot reply flow + └─ core/processor.py extracts .zip/.7z/.rar, searches .txt line by line + └─ core/notifier.py scores → deduplicates → writes DB/txt/csv → Telegram alert + ├─ utils/scorer.py + ├─ utils/database.py + └─ tui/events.py posts EvHit to TUI event bus +``` + +### Threading model + +The TUI and Telegram bot run in separate threads with different event loops: + +- **Main thread**: Textual's event loop — runs `MonitorApp`, drains the event bus every 100ms via `_drain_bus()` +- **Bot thread**: own `asyncio` event loop — runs `_bot_main()` with both `user_client` and `bot_client` +- **Cross-thread communication**: bot → TUI via `bus.post()` (`queue.Queue.put_nowait`, always safe); TUI → bot via `loop.call_soon_threadsafe()` (e.g., to signal channel list changes) + +### Module responsibilities + +| Module | Role | +|--------|------| +| `config.py` | All settings — edit keywords, channels, paths, tdl tuning here | +| `core/scraper.py` | Live listener + backfill orchestration; registers Telethon `NewMessage` handlers | +| `core/tdl_downloader.py` | Wraps `tdl` subprocess for fast downloads; falls back to Telethon | +| `core/bot_downloader.py` | Handles inline button click flow where files come via bot reply | +| `core/processor.py` | Archive extraction (supports nested archives one level deep) + line-by-line search | +| `core/notifier.py` | Scoring → dedup → DB insert → hits.txt/csv write → Telegram bot alert | +| `utils/scorer.py` | Severity scoring; parses ULP lines (`url:user:pass`), classifies CRITICAL/HIGH/MEDIUM/LOW | +| `utils/cache.py` | Seen file-ID dedup stored in `data/cache.json` | +| `utils/database.py` | SQLite read/write for `data/hits.db` | +| `tui/app.py` | `MonitorApp` + all screens (Search, HitsDB, Keywords) | +| `tui/events.py` | Thread-safe `queue.Queue` event bus | + +### Severity scoring + +Keywords in `config.TARGET_KEYWORDS` with `@` (e.g. `r"@myorg\.cl"`) are **employee email domains** → CRITICAL on match. Keywords without `@` are plain domain matches → LOW baseline. + +| Severity | Score | Triggers | +|----------|-------|----------| +| CRITICAL | 40 | Employee email in username · Privileged service URL (admin, vpn, rdp, gitlab…) | +| HIGH | 30 | Internal service URL (intranet, erp, sso, owa…) | +| MEDIUM | 20 | Client-facing URL (app, booking, helpdesk…) | +| LOW | 10 | Org domain appears anywhere in line | + +Telegram alerts fire for CRITICAL/HIGH/MEDIUM only. LOW is stored silently. + +## Per-file reference docs + +Each `.py` has a companion `.md` with design notes. **Always read the `.md` first, then the `.py` only if needed.** After making code changes, update the companion `.md` to match. + +## Useful CLI queries + +```bash +# Query hits directly +sqlite3 data/hits.db "SELECT severity, username, url FROM hits WHERE seen_before=0 ORDER BY score DESC LIMIT 20" + +# Wipe dedup cache to re-process files +rm data/cache.json data/dedup.json + +# Follow live log +tail -f data/logs/monitor.log +``` + +## TUI keybindings + +| Key | Action | +|-----|--------| +| `s` | Search hits DB | +| `h` | Browse hits by severity (filter with `1`/`2`/`3`/`4`, recent with `r`) | +| `k` | Edit keyword patterns live (changes take effect immediately) | +| `c` | Clear logs | +| `r` | Refresh stats | +| `q` / `Escape` | Quit / back | + +Runtime keyword and channel changes are **not** persisted — copy them to `config.py` to survive restarts.