CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Development workflow

After every code change:

Run pytest — all tests must pass at 100%.
If 100% pass: present the change to the user, then commit.
If any test fails: fix the bug and re-run before showing anything to the user.

Never present code or commit while tests are failing.

Running tests

pip install -r requirements-dev.txt
pytest           # all tests
pytest -v        # verbose
pytest tests/test_scorer.py  # single file

Tests cover utils/scorer, utils/cache, utils/database, and core/processor. They are fully isolated — no .env required, no real DB or cache files touched. The patched_keywords fixture in conftest.py replaces TARGET_KEYWORDS with known test patterns; it must patch both config.TARGET_KEYWORDS and scorer.TARGET_KEYWORDS (the local from config import binding).

Running the monitor

source .venv/bin/activate  # initialize the python enviroment, if .venv exists
python main.py             # TUI mode (default)
python main.py --no-tui    # Plain CLI, logs to stdout + data/logs/monitor.log

First run will interactively prompt for Telegram phone + 2FA to create a session file.

Setup prerequisites

pip install -r requirements.txt
# rarfile requires the unrar binary: sudo apt install unrar (Linux) or brew install rar (macOS)

# tdl (strongly recommended for fast downloads):
curl -sSL https://raw.githubusercontent.com/iyear/tdl/main/scripts/install.sh | bash
tdl login -n monitor_session

If no .env file exists, ask the user to manually create the file. We cannot create it, because it contains personal information.

Architecture

Data flow

Telegram channel message with file attachment
  └─ core/scraper.py          detects attachment, guards (size/extension/dedup)
       └─ core/tdl_downloader.py  downloads via tdl subprocess (batched)
           └─ core/scraper.py     Telethon fallback if tdl fails
       └─ core/bot_downloader.py  handles inline "DOWNLOAD" button → bot reply flow
       └─ core/processor.py       extracts .zip/.7z/.rar, searches .txt line by line
       └─ core/notifier.py        scores → deduplicates → writes DB/txt/csv → Telegram alert
            ├─ utils/scorer.py
            ├─ utils/database.py
            └─ tui/events.py      posts EvHit to TUI event bus

Threading model

The TUI and Telegram bot run in separate threads with different event loops:

Main thread: Textual's event loop — runs MonitorApp, drains the event bus every 100ms via _drain_bus()
Bot thread: own asyncio event loop — runs _bot_main() with both user_client and bot_client
Cross-thread communication: bot → TUI via bus.post() (queue.Queue.put_nowait, always safe); TUI → bot via loop.call_soon_threadsafe() (e.g., to signal channel list changes)

Module responsibilities

Module	Role
`config.py`	All settings — edit keywords, channels, paths, tdl tuning here
`core/scraper.py`	Live listener + backfill orchestration; registers Telethon `NewMessage` handlers
`core/tdl_downloader.py`	Wraps `tdl` subprocess for fast downloads; falls back to Telethon
`core/bot_downloader.py`	Handles inline button click flow where files come via bot reply
`core/processor.py`	Archive extraction (supports nested archives one level deep) + line-by-line search
`core/notifier.py`	Scoring → dedup → DB insert → hits.txt/csv write → Telegram bot alert
`utils/scorer.py`	Severity scoring; parses ULP lines (`url:user:pass`), classifies CRITICAL/HIGH/MEDIUM/LOW
`utils/cache.py`	Seen file-ID dedup stored in `data/cache.json`
`utils/database.py`	SQLite read/write for `data/hits.db`
`tui/app.py`	`MonitorApp` + all screens (Search, HitsDB, Keywords)
`tui/events.py`	Thread-safe `queue.Queue` event bus

Severity scoring

Keywords in config.TARGET_KEYWORDS with @ (e.g. r"@myorg\.cl") are employee email domains → CRITICAL on match. Keywords without @ are plain domain matches → LOW baseline.

Severity	Score	Triggers
CRITICAL	40	Employee email in username · Privileged service URL (admin, vpn, rdp, gitlab…)
HIGH	30	Internal service URL (intranet, erp, sso, owa…)
MEDIUM	20	Client-facing URL (app, booking, helpdesk…)
LOW	10	Org domain appears anywhere in line

Telegram alerts fire for CRITICAL/HIGH/MEDIUM only. LOW is stored silently.

Per-file reference docs

Each .py has a companion .md with design notes. Always read the .md first, then the .py only if needed. After making code changes, update the companion .md to match.

Useful CLI queries

# Query hits directly
sqlite3 data/hits.db "SELECT severity, username, url FROM hits WHERE seen_before=0 ORDER BY score DESC LIMIT 20"

# Wipe dedup cache to re-process files
rm data/cache.json data/dedup.json

# Follow live log
tail -f data/logs/monitor.log

TUI keybindings

Key	Action
`s`	Search hits DB
`h`	Browse hits by severity (filter with `1`/`2`/`3`/`4`, recent with `r`)
`k`	Edit keyword patterns live (changes take effect immediately)
`c`	Clear logs
`r`	Refresh stats
`q` / `Escape`	Quit / back

Runtime keyword and channel changes are not persisted — copy them to config.py to survive restarts.

5.5 KiB Raw Blame History