stealergram/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Development workflow

After every code change:
1. Run `pytest` — all tests must pass at 100%.
2. If 100% pass: present the change to the user, then commit.
3. If any test fails: fix the bug and re-run before showing anything to the user.

Never present code or commit while tests are failing.

## Running tests

```bash
pip install -r requirements-dev.txt
pytest           # all tests
pytest -v        # verbose
pytest tests/test_scorer.py  # single file
```

Tests cover `utils/scorer`, `utils/cache`, `utils/database`, and `core/processor`. They are fully isolated — no `.env` required, no real DB or cache files touched. The `patched_keywords` fixture in `conftest.py` replaces `TARGET_KEYWORDS` with known test patterns; it must patch both `config.TARGET_KEYWORDS` and `scorer.TARGET_KEYWORDS` (the local `from config import` binding).

## Running the monitor

```bash
source .venv/bin/activate  # initialize the python enviroment, if .venv exists
python main.py             # TUI mode (default)
python main.py --no-tui    # Plain CLI, logs to stdout + data/logs/monitor.log
```

First run will interactively prompt for Telegram phone + 2FA to create a session file.

## Setup prerequisites

```bash
pip install -r requirements.txt
# rarfile requires the unrar binary: sudo apt install unrar (Linux) or brew install rar (macOS)

# tdl (strongly recommended for fast downloads):
curl -sSL https://raw.githubusercontent.com/iyear/tdl/main/scripts/install.sh | bash
tdl login -n monitor_session
```

If no `.env` file exists, ask the user to manually create the file. We cannot create it, because it contains personal information.

## Architecture

### Data flow

```
Telegram channel message with file attachment
  └─ core/scraper.py          detects attachment, guards (size/extension/dedup)
       └─ core/tdl_downloader.py  downloads via tdl subprocess (batched)
           └─ core/scraper.py     Telethon fallback if tdl fails
       └─ core/bot_downloader.py  handles inline "DOWNLOAD" button → bot reply flow
       └─ core/processor.py       extracts .zip/.7z/.rar, searches .txt line by line
       └─ core/notifier.py        scores → deduplicates → writes DB/txt/csv → Telegram alert
            ├─ utils/scorer.py
            ├─ utils/database.py
            └─ tui/events.py      posts EvHit to TUI event bus
```

### Threading model

The TUI and Telegram bot run in separate threads with different event loops:

- **Main thread**: Textual's event loop — runs `MonitorApp`, drains the event bus every 100ms via `_drain_bus()`
- **Bot thread**: own `asyncio` event loop — runs `_bot_main()` with both `user_client` and `bot_client`
- **Cross-thread communication**: bot → TUI via `bus.post()` (`queue.Queue.put_nowait`, always safe); TUI → bot via `loop.call_soon_threadsafe()` (e.g., to signal channel list changes)

### Module responsibilities

| Module | Role |
|--------|------|
| `config.py` | All settings — edit keywords, channels, paths, tdl tuning here |
| `core/scraper.py` | Live listener + backfill orchestration; registers Telethon `NewMessage` handlers |
| `core/tdl_downloader.py` | Wraps `tdl` subprocess for fast downloads; falls back to Telethon |
| `core/bot_downloader.py` | Handles inline button click flow where files come via bot reply |
| `core/processor.py` | Archive extraction (supports nested archives one level deep) + line-by-line search |
| `core/notifier.py` | Scoring → dedup → DB insert → hits.txt/csv write → Telegram bot alert |
| `utils/scorer.py` | Severity scoring; parses ULP lines (`url:user:pass`), classifies CRITICAL/HIGH/MEDIUM/LOW |
| `utils/cache.py` | Seen file-ID dedup stored in `data/cache.json` |
| `utils/database.py` | SQLite read/write for `data/hits.db` |
| `tui/app.py` | `MonitorApp` + all screens (Search, HitsDB, Keywords) |
| `tui/events.py` | Thread-safe `queue.Queue` event bus |

### Severity scoring

Keywords in `config.TARGET_KEYWORDS` with `@` (e.g. `r"@myorg\.cl"`) are **employee email domains** → CRITICAL on match. Keywords without `@` are plain domain matches → LOW baseline.

| Severity | Score | Triggers |
|----------|-------|----------|
| CRITICAL | 40 | Employee email in username · Privileged service URL (admin, vpn, rdp, gitlab…) |
| HIGH | 30 | Internal service URL (intranet, erp, sso, owa…) |
| MEDIUM | 20 | Client-facing URL (app, booking, helpdesk…) |
| LOW | 10 | Org domain appears anywhere in line |

Telegram alerts fire for CRITICAL/HIGH/MEDIUM only. LOW is stored silently.

## Per-file reference docs

Each `.py` has a companion `.md` with design notes. **Always read the `.md` first, then the `.py` only if needed.** After making code changes, update the companion `.md` to match.

## Useful CLI queries

```bash
# Query hits directly
sqlite3 data/hits.db "SELECT severity, username, url FROM hits WHERE seen_before=0 ORDER BY score DESC LIMIT 20"

# Wipe dedup cache to re-process files
rm data/cache.json data/dedup.json

# Follow live log
tail -f data/logs/monitor.log
```

## TUI keybindings

| Key | Action |
|-----|--------|
| `s` | Search hits DB |
| `h` | Browse hits by severity (filter with `1`/`2`/`3`/`4`, recent with `r`) |
| `k` | Edit keyword patterns live (changes take effect immediately) |
| `c` | Clear logs |
| `r` | Refresh stats |
| `q` / `Escape` | Quit / back |

Runtime keyword and channel changes are **not** persisted — copy them to `config.py` to survive restarts.