Rename to stealergram, add pyproject.toml, purge em-dashes

- Rename project to stealergram throughout
- Add pyproject.toml (replaces requirements.txt split, folds pytest.ini)
- Replace all em-dashes with hyphens across all source files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-19 10:06:30 -04:00
parent 4c104cddd2
commit 741e6bb0d3
46 changed files with 244 additions and 191 deletions

View File

@@ -15,8 +15,8 @@ NOTIFY_CHAT_ID=987654321
# ─── Session name (just a filename, no extension needed) ────────────────────
SESSION_NAME=monitor_session
# ─── tdl (fast Go downloader) optional but strongly recommended ───────────
# ─── tdl (fast Go downloader) - optional but strongly recommended ───────────
# Install: https://github.com/iyear/tdl
# After installing, run once: tdl login -n <SESSION_NAME>
# SESSION_NAME above is shared between Telethon and tdl no double login needed.
# SESSION_NAME above is shared between Telethon and tdl - no double login needed.
# If tdl is not on PATH the bot falls back to Telethon automatically.

View File

@@ -5,7 +5,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
## Development workflow
After every code change:
1. Run `pytest` all tests must pass at 100%.
1. Run `pytest` - all tests must pass at 100%.
2. If 100% pass: present the change to the user, then commit.
3. If any test fails: fix the bug and re-run before showing anything to the user.
@@ -20,7 +20,7 @@ pytest -v # verbose
pytest tests/test_scorer.py # single file
```
Tests cover `utils/scorer`, `utils/cache`, `utils/database`, and `core/processor`. They are fully isolated no `.env` required, no real DB or cache files touched. The `patched_keywords` fixture in `conftest.py` replaces `TARGET_KEYWORDS` with known test patterns; it must patch both `config.TARGET_KEYWORDS` and `scorer.TARGET_KEYWORDS` (the local `from config import` binding).
Tests cover `utils/scorer`, `utils/cache`, `utils/database`, and `core/processor`. They are fully isolated - no `.env` required, no real DB or cache files touched. The `patched_keywords` fixture in `conftest.py` replaces `TARGET_KEYWORDS` with known test patterns; it must patch both `config.TARGET_KEYWORDS` and `scorer.TARGET_KEYWORDS` (the local `from config import` binding).
## Running the monitor
@@ -66,15 +66,15 @@ Telegram channel message with file attachment
The TUI and Telegram bot run in separate threads with different event loops:
- **Main thread**: Textual's event loop runs `MonitorApp`, drains the event bus every 100ms via `_drain_bus()`
- **Bot thread**: own `asyncio` event loop runs `_bot_main()` with both `user_client` and `bot_client`
- **Main thread**: Textual's event loop - runs `MonitorApp`, drains the event bus every 100ms via `_drain_bus()`
- **Bot thread**: own `asyncio` event loop - runs `_bot_main()` with both `user_client` and `bot_client`
- **Cross-thread communication**: bot → TUI via `bus.post()` (`queue.Queue.put_nowait`, always safe); TUI → bot via `loop.call_soon_threadsafe()` (e.g., to signal channel list changes)
### Module responsibilities
| Module | Role |
|--------|------|
| `config.py` | All settings edit keywords, channels, paths, tdl tuning here |
| `config.py` | All settings - edit keywords, channels, paths, tdl tuning here |
| `core/scraper.py` | Live listener + backfill orchestration; registers Telethon `NewMessage` handlers |
| `core/tdl_downloader.py` | Wraps `tdl` subprocess for fast downloads; falls back to Telethon |
| `core/bot_downloader.py` | Handles inline button click flow where files come via bot reply |
@@ -127,4 +127,4 @@ tail -f data/logs/monitor.log
| `r` | Refresh stats |
| `q` / `Escape` | Quit / back |
Runtime keyword and channel changes are **not** persisted copy them to `config.py` to survive restarts.
Runtime keyword and channel changes are **not** persisted - copy them to `config.py` to survive restarts.

View File

@@ -1,4 +1,4 @@
# ULP Monitor Quick Reference
# ULP Monitor - Quick Reference
> For Claude Code: read the per-file `.md` alongside each `.py` before editing.
> Full docs in `README.md`.
@@ -10,7 +10,7 @@
```
ulp_monitor/
├── main.py Entry point (--no-tui flag for CLI mode)
├── config.py All settings edit this for keywords, channels, paths
├── config.py All settings - edit this for keywords, channels, paths
├── core/ Telegram I/O pipeline (all async, Telethon-dependent)
│ ├── scraper.py Live listener + backfill orchestration
@@ -24,11 +24,11 @@ ulp_monitor/
│ ├── cache.py Seen file-ID dedup (data/cache.json)
│ └── database.py SQLite read/write (data/hits.db)
├── tui/ Textual TUI runs in main thread
├── tui/ Textual TUI - runs in main thread
│ ├── app.py MonitorApp + all screens + bot thread launcher
│ └── events.py Thread-safe queue.Queue event bus
└── data/ Runtime output gitignored
└── data/ Runtime output - gitignored
├── hits.db
├── hits.txt
├── hits.csv
@@ -126,7 +126,7 @@ cross-thread communication
| MEDIUM | 20 | Client-facing URL (app, booking, helpdesk…) |
| LOW | 10 | Org domain appears anywhere in line |
`@`-keyword rule: pattern requires literal `@` before domain `user@gmail.com` on a URL containing `myorg.cl` does **not** trigger CRITICAL.
`@`-keyword rule: pattern requires literal `@` before domain - `user@gmail.com` on a URL containing `myorg.cl` does **not** trigger CRITICAL.
---

View File

@@ -33,7 +33,7 @@ ulp_monitor/
│ ├── processor.py Archive extraction + line-by-line search
│ └── notifier.py hits.txt / hits.csv writer + bot alerts
├── utils/ Pure logic no Telegram dependencies
├── utils/ Pure logic - no Telegram dependencies
│ ├── scorer.py Hit severity scoring
│ ├── cache.py Seen-file deduplication
│ └── database.py SQLite persistence layer
@@ -75,11 +75,11 @@ cp .env.example .env
Open `config.py` and set:
- **`TARGET_KEYWORDS`** your org's domains and email patterns.
- **`TARGET_KEYWORDS`** - your org's domains and email patterns.
Keywords with `@` (e.g. `r"@myorg\.cl"`) are **employee email domains** → CRITICAL.
Keywords without `@` are plain domain matches → LOW baseline.
- **`WATCHED_CHANNELS`** channel usernames or numeric IDs
- **`BACKFILL_LIMIT`** past messages to scan per channel on startup
- **`WATCHED_CHANNELS`** - channel usernames or numeric IDs
- **`BACKFILL_LIMIT`** - past messages to scan per channel on startup
### 5. Install dependencies
@@ -97,7 +97,7 @@ curl -sSL https://raw.githubusercontent.com/iyear/tdl/main/scripts/install.sh |
tdl login -n monitor_session
```
### 6. First run complete Telegram auth
### 6. First run - complete Telegram auth
```bash
python main.py --no-tui
@@ -130,9 +130,9 @@ python main.py --no-tui # plain CLI
| File | Description |
|------|-------------|
| `data/hits.db` | SQLite all hits with scores, severity, dedup flag |
| `data/hits.db` | SQLite - all hits with scores, severity, dedup flag |
| `data/hits.txt` | Human-readable grouped log |
| `data/hits.csv` | CSV easy to pull into Excel / pandas |
| `data/hits.csv` | CSV - easy to pull into Excel / pandas |
| `data/logs/monitor.log` | Full run log |
Telegram alerts fire for CRITICAL / HIGH / MEDIUM only. LOW is stored silently.
@@ -141,6 +141,6 @@ Telegram alerts fire for CRITICAL / HIGH / MEDIUM only. LOW is stored silently.
## Notes
- **Session files are sensitive** equivalent to a logged-in account. Gitignored, never share.
- **Flood limits** `FloodWaitError` is handled automatically.
- **Private channels** your user account must already be a member.
- **Session files are sensitive** - equivalent to a logged-in account. Gitignored, never share.
- **Flood limits** - `FloodWaitError` is handled automatically.
- **Private channels** - your user account must already be a member.

View File

@@ -1,5 +1,5 @@
"""
config.py Loads and validates all settings from .env
config.py - Loads and validates all settings from .env
"""
import json
@@ -29,30 +29,35 @@ RUNTIME_CONFIG_PATH = Path("./data/runtime_config.json")
# Add your org's domains, email patterns, IP ranges, known usernames, etc.
# All patterns are case-insensitive regex.
_DEFAULT_KEYWORDS: list[str] = [
r"sanatorioaleman\.cl",
r"@sanatorioaleman\.cl",
#r"sanatorioaleman\.cl",
#r"@sanatorioaleman\.cl",
#r"@hites\.cl",
#r"hites\.com",
# r"192\.168\.10\.", # internal IP range example
# r"specificuser", # known internal usernames
r"onion\.global",
r"@onion\.global",
]
# Use usernames (without @) or numeric channel IDs (-100xxxxxxxxxx)
_DEFAULT_CHANNELS: list[str | int] = [
#-1002230225603,
"cloudxlog",
#-1001967030016, # daisycloud
#"berserklogs", # berserklogs
#"BorwitaFreeLogs", # borwita
-1002748707556, # darkcloud
-1001684073398, # BHF Cloud
-1003163621939, # Wich Love from R
-1003611713618, # Khazan Cloud
-1003328682684, # LogsPlanet
-1003204260194, # JDP
-1002828367761, # HesoyamCloud
-1003513974925, # Slurm Logs
-1003599300787, # Arhont Corp
-1002582513379, # OnlyLogs
-1002788333372, # Ickis Cloud
#"cloudxlog",
##-1001967030016, # daisycloud
##"berserklogs", # berserklogs
##"BorwitaFreeLogs", # borwita
#-1002748707556, # darkcloud
#-1001684073398, # BHF Cloud
#-1003163621939, # Wich Love from R
#-1003611713618, # Khazan Cloud
#-1003328682684, # LogsPlanet
#-1003204260194, # JDP
#-1002828367761, # HesoyamCloud
#-1003513974925, # Slurm Logs
#-1003599300787, # Arhont Corp
#-1002582513379, # OnlyLogs
#-1002788333372, # Ickis Cloud
-1002643355608, # Cloud URL
#-1001234567890, # private channel by ID
]
@@ -149,5 +154,5 @@ TDL_PERFILE = 4
TDL_AMOUNT = 4
# Whether to use a Telegram takeout session for downloads (lower flood limits).
# Takeout sessions are rate-limited differently good for bulk backfill.
# Takeout sessions are rate-limited differently - good for bulk backfill.
TDL_TAKEOUT = True

View File

@@ -1 +1 @@
"""core Telegram I/O pipeline (scraper, downloader, processor, notifier)."""
"""core - Telegram I/O pipeline (scraper, downloader, processor, notifier)."""

View File

@@ -1,5 +1,5 @@
"""
bot_downloader.py Handles "click to download" inline button flows.
bot_downloader.py - Handles "click to download" inline button flows.
Some Telegram channels post messages with a DOWNLOAD button that triggers
a bot to send you the actual file. This module simulates that click and

View File

@@ -1,5 +1,5 @@
"""
notifier.py Persists hits to disk and sends Telegram bot alerts.
notifier.py - Persists hits to disk and sends Telegram bot alerts.
Includes:
- Severity scoring via scorer.py
@@ -31,7 +31,7 @@ log = logging.getLogger(__name__)
MAX_PREVIEW = 10 # hits to show per severity group in alert
DEDUP_FILE = Path("./data/dedup.json")
# Only alert immediately for these severities LOW hits are silent
# Only alert immediately for these severities - LOW hits are silent
ALERT_SEVERITIES = {CRITICAL, HIGH, MEDIUM}
@@ -124,7 +124,7 @@ def write_hits(scored_hits: list, source: str) -> None:
def write_hits_csv(scored_hits: list, source: str, filename: str) -> None:
"""Append new hits to hits.csv one row per hit, easy to import."""
"""Append new hits to hits.csv - one row per hit, easy to import."""
HITS_CSV.parent.mkdir(parents=True, exist_ok=True)
write_header = not HITS_CSV.exists()
timestamp = _timestamp()
@@ -152,13 +152,13 @@ async def send_alert(
) -> None:
"""
Send a Telegram alert grouped by severity.
Only includes CRITICAL, HIGH, MEDIUM LOW hits are omitted from alerts.
Only includes CRITICAL, HIGH, MEDIUM - LOW hits are omitted from alerts.
"""
summary = summarize(scored_hits)
alertable = [h for h in scored_hits if h.severity in ALERT_SEVERITIES]
if not alertable:
log.info(" No alertable hits (all LOW) skipping Telegram notification.")
log.info(" No alertable hits (all LOW) - skipping Telegram notification.")
return
lines = [
@@ -210,7 +210,7 @@ async def notify(bot: TelegramClient, hits: list[str], source: str, filename: st
# Score first
scored = score_hits(hits)
log.info(f" Scored {len(scored)} hit(s) {summarize(scored)}")
log.info(f" Scored {len(scored)} hit(s) - {summarize(scored)}")
# Deduplicate
new_hits, dupe_hits = deduplicate(scored)
@@ -222,7 +222,7 @@ async def notify(bot: TelegramClient, hits: list[str], source: str, filename: st
insert_hits(dupe_hits, source, filename, seen_before=True)
if not new_hits:
log.info(" All hits already seen before no alert sent.")
log.info(" All hits already seen before - no alert sent.")
return
# Push hits to TUI

View File

@@ -54,8 +54,8 @@ Nested archives are recursed **one level** only.
## Password order
1. `extra_password` (from message/channel carry-forward) tried first
2. `config.ARCHIVE_PASSWORDS` tried in order
1. `extra_password` (from message/channel carry-forward) - tried first
2. `config.ARCHIVE_PASSWORDS` - tried in order
---

View File

@@ -1,8 +1,8 @@
"""
processor.py Archive extraction and hit searching logic.
processor.py - Archive extraction and hit searching logic.
Supports: .txt, .zip, .7z, .rar
Stream-processes files line by line safe for large combo lists.
Stream-processes files line by line - safe for large combo lists.
"""
import rarfile
@@ -40,7 +40,7 @@ def compile_patterns(keywords: list[str]) -> list[re.Pattern]:
def search_file(filepath: Path, patterns: list[re.Pattern]) -> list[str]:
"""
Stream-reads a text file line by line and returns lines matching any pattern.
Ignores encoding errors combo files are often messy.
Ignores encoding errors - combo files are often messy.
"""
hits: list[str] = []
try:
@@ -82,7 +82,7 @@ def extract_zip(filepath: Path, dest: Path, extra_password: str | None = None) -
except RuntimeError:
log.info(f" ZIP is password-protected, trying common passwords...")
if not _try_passwords(try_extract, ARCHIVE_PASSWORDS):
log.warning(f" Could not unlock {filepath.name} skipping.")
log.warning(f" Could not unlock {filepath.name} - skipping.")
return []
extracted = [p for p in dest.rglob("*") if p.is_file()]
@@ -95,7 +95,7 @@ def extract_zip(filepath: Path, dest: Path, extra_password: str | None = None) -
def extract_7z(filepath: Path, dest: Path, extra_password: str | None = None) -> list[Path]:
if not HAS_7Z:
log.warning("py7zr not installed skipping .7z file.")
log.warning("py7zr not installed - skipping .7z file.")
return []
extracted: list[Path] = []
passwords = ARCHIVE_PASSWORDS.copy()
@@ -119,7 +119,7 @@ def extract_7z(filepath: Path, dest: Path, extra_password: str | None = None) ->
except Exception:
continue
if not success:
log.warning(f" Could not unlock {filepath.name} skipping.")
log.warning(f" Could not unlock {filepath.name} - skipping.")
return []
extracted = [p for p in dest.rglob("*") if p.is_file()]
@@ -130,7 +130,7 @@ def extract_7z(filepath: Path, dest: Path, extra_password: str | None = None) ->
def extract_rar(filepath: Path, dest: Path, extra_password: str | None = None) -> list[Path]:
if not HAS_RAR:
log.warning("rarfile not installed skipping .rar file.")
log.warning("rarfile not installed - skipping .rar file.")
return []
passwords = ARCHIVE_PASSWORDS.copy()
@@ -150,7 +150,7 @@ def extract_rar(filepath: Path, dest: Path, extra_password: str | None = None) -
except Exception:
log.info(f" RAR may be password-protected, trying common passwords...")
if not _try_passwords(try_extract, ARCHIVE_PASSWORDS):
log.warning(f" Could not unlock {filepath.name} skipping.")
log.warning(f" Could not unlock {filepath.name} - skipping.")
return []
extracted = [p for p in dest.rglob("*") if p.is_file()]
@@ -184,7 +184,7 @@ def unpack(filepath: Path, extra_password: str | None = None) -> tuple[list[Path
return files, extract_dir
else:
# Plain file return as-is, no extract dir to clean up
# Plain file - return as-is, no extract dir to clean up
return [filepath], None
@@ -207,7 +207,7 @@ def process_file(filepath: Path, patterns, password: str | None = None) -> list[
log.info(f"{len(hits)} hit(s) in {f.name}")
all_hits.extend(hits)
# Nested archives recurse one level
# Nested archives - recurse one level
elif f.suffix.lower() in {".zip", ".7z", ".rar"} and f != filepath:
log.info(f" → Nested archive: {f.name}")
nested_hits = process_file(f, patterns)

View File

@@ -11,7 +11,7 @@ from core.scraper import handle_message, backfill_all, register_handlers, warm_e
### `handle_message(client, bot, msg, source_name, patterns, password=None)`
**async.** Full pipeline for one document message:
1. Extract filename + size, check allowlist + size guard
2. Check `utils.cache` skip if already seen
2. Check `utils.cache` - skip if already seen
3. Try `tdl` download → Telethon fallback
4. `core.processor.process_file()` → hits
5. `core.notifier.notify()` if hits found

View File

@@ -1,5 +1,5 @@
"""
scraper.py Telethon user client.
scraper.py - Telethon user client.
Handles:
- Listening for new file messages in watched channels
@@ -99,7 +99,7 @@ async def _telethon_download(client: TelegramClient, msg, dest: Path, filename:
"""Download a single file via Telethon. Returns True on success."""
_bid = batch_id or f"telethon_{int(time.monotonic_ns())}"
if batch_id is None:
# Standalone call (not already queued by tdl path) post queued event
# Standalone call (not already queued by tdl path) - post queued event
bus.post(bus.EvDownloadQueued(
batch_id=_bid, filename=filename,
size_mb=round(size / (1024 * 1024), 2),
@@ -165,12 +165,12 @@ async def handle_message(
size = get_filesize(msg)
ok, reason = is_processable(filename, size)
if not ok:
log.warning(f" handle_message: skipping '{filename}' {reason}")
log.warning(f" handle_message: skipping '{filename}' - {reason}")
return
doc_id = msg.media.document.id
if is_seen(doc_id):
log.info(f" Skipping {filename} already processed.")
log.info(f" Skipping {filename} - already processed.")
return
dest = _make_dest(msg, filename)
@@ -180,7 +180,7 @@ async def handle_message(
downloaded = await download_single_with_tdl(msg, dest) if is_tdl_available() else False
if not downloaded:
if is_tdl_available():
log.warning(" [tdl] failed falling back to Telethon")
log.warning(" [tdl] failed - falling back to Telethon")
downloaded = await _telethon_download(client, msg, dest, filename, size)
if not downloaded:
@@ -307,7 +307,7 @@ async def backfill_channel(
ok, reason = is_processable(filename, size)
if not ok:
log.warning(f" [Backfill] Skipping '{filename}' {reason}")
log.warning(f" [Backfill] Skipping '{filename}' - {reason}")
continue
if is_seen(msg.media.document.id):
@@ -319,13 +319,13 @@ async def backfill_channel(
if len(batch) >= TDL_AMOUNT:
await flush_batch()
else:
# No tdl fall straight through to single handle_message
# No tdl - fall straight through to single handle_message
await handle_message(client, bot, msg, source_name, patterns, password=password)
total += 1
await asyncio.sleep(0.5)
elif msg.buttons and has_download_button(msg):
# Bot-button messages can't be batched handle individually
# Bot-button messages can't be batched - handle individually
await flush_batch() # flush any pending batch first
await handle_bot_download_message(client, bot, msg, source_name, patterns, password=password)
total += 1
@@ -339,7 +339,7 @@ async def backfill_channel(
except Exception as e:
log.error(f"[Backfill] Error scanning {channel}: {e}")
log.info(f"[Backfill] Done: {channel} {total} file(s) processed")
log.info(f"[Backfill] Done: {channel} - {total} file(s) processed")
async def backfill_all(

View File

@@ -22,7 +22,7 @@ Used by the live handler and `bot_downloader`.
### `download_batch_with_tdl(entries: list[BatchEntry]) -> dict[int, bool]`
**async.** Downloads up to `TDL_AMOUNT` messages in a single `tdl dl` invocation.
Returns `{doc_id: True|False}` `False` means Telethon fallback needed.
Returns `{doc_id: True|False}` - `False` means Telethon fallback needed.
---
@@ -55,7 +55,7 @@ In CLI mode: subprocess inherits the terminal, progress bars render natively.
Each batch/single download gets a unique `data/tmp/_tdl_{monotonic_ns}/` staging dir.
After `tdl` exits, files are matched by name (with fuzzy stem fallback for `filenamify()` mangling) and moved to final `dest`. Staging dir is removed regardless of outcome.
`--template '{{ filenamify .FileName }}'` tdl uses the original Telegram filename, not its default `DialogID_MessageID_filename` format.
`--template '{{ filenamify .FileName }}'` - tdl uses the original Telegram filename, not its default `DialogID_MessageID_filename` format.
---

View File

@@ -1,10 +1,10 @@
"""
tdl_downloader.py Fast file downloads via tdl (Go MTProto implementation).
tdl_downloader.py - Fast file downloads via tdl (Go MTProto implementation).
Install: https://github.com/iyear/tdl
curl -sSL https://raw.githubusercontent.com/iyear/tdl/main/scripts/install.sh | bash
First-time setup log in once:
First-time setup - log in once:
tdl login # saves to namespace "default"
tdl login -n myns # saves to a named namespace
@@ -77,7 +77,7 @@ def _build_cmd(urls: list[str], staging_dir: Path) -> list[str]:
(no DialogID_MessageID_ prefix).
--continue is kept so interrupted downloads resume rather than restart.
--skip-same is intentionally omitted deduplication is handled upstream
--skip-same is intentionally omitted - deduplication is handled upstream
by is_seen(), and --skip-same can cause the .tmp rename to fail when a
same-named file already exists in the directory.
"""
@@ -103,7 +103,7 @@ def _build_cmd(urls: list[str], staging_dir: Path) -> list[str]:
# ─── Runner ───────────────────────────────────────────────────────────────────
# ANSI escape stripper tdl emits colour codes even when not a TTY
# ANSI escape stripper - tdl emits colour codes even when not a TTY
import re as _re
_ANSI_RE = _re.compile(r"\x1b\[[0-9;]*[mGKHFJA-Z]|\x1b=|\x1b>|\x1b\[\?[0-9]+[hl]")
@@ -141,7 +141,7 @@ async def _run_tdl(cmd: list[str], label: str) -> bool:
buf += chunk.decode(errors="replace")
# Split on both \r and \n; process all complete segments
parts = _re.split(r"[\r\n]", buf)
# Last element may be an incomplete segment keep in buffer
# Last element may be an incomplete segment - keep in buffer
buf = parts[-1]
for part in parts[:-1]:
clean = _strip_ansi(part).strip()
@@ -163,7 +163,7 @@ async def _run_tdl(cmd: list[str], label: str) -> bool:
log.info(f"[tdl] ✓ {label}")
return True
else:
log.error(f"[tdl] ✗ exit {proc.returncode} {label}")
log.error(f"[tdl] ✗ exit {proc.returncode} - {label}")
return False
except FileNotFoundError:
log.error("[tdl] binary not found at runtime")
@@ -260,7 +260,7 @@ async def download_batch_with_tdl(entries: list[BatchEntry]) -> dict[int, bool]:
return {}
if not is_tdl_available():
log.warning("[tdl] not available all entries need Telethon fallback")
log.warning("[tdl] not available - all entries need Telethon fallback")
return {e.doc_id: False for e in entries}
urls: list[str] = []
@@ -327,7 +327,7 @@ async def download_single_with_tdl(msg, dest: Path) -> bool:
bot_downloader where batching doesn't apply.
"""
if not is_tdl_available():
log.warning("[tdl] not available falling back to Telethon")
log.warning("[tdl] not available - falling back to Telethon")
return False
try:

View File

@@ -1,5 +1,5 @@
"""
main.py Entry point for the ULP credential monitor.
main.py - Entry point for the ULP credential monitor.
Usage:
python main.py # TUI mode (default)
@@ -55,7 +55,7 @@ def _start_web_thread(host: str, port: int) -> threading.Thread:
# ─── Plain CLI mode ───────────────────────────────────────────────────────────
async def _cli_main():
"""Original asyncio main runs without the TUI."""
"""Original asyncio main - runs without the TUI."""
logging.getLogger().addHandler(logging.StreamHandler(sys.stdout))
from telethon import TelegramClient
@@ -64,7 +64,7 @@ async def _cli_main():
from core.scraper import backfill_all, register_handlers, warm_entity_cache
log.info("=" * 60)
log.info(" ULP Credential Monitor CLI mode")
log.info(" ULP Credential Monitor - CLI mode")
log.info("=" * 60)
patterns = compile_patterns(config.TARGET_KEYWORDS)

46
pyproject.toml Normal file
View File

@@ -0,0 +1,46 @@
[build-system]
requires = ["setuptools>=68"]
build-backend = "setuptools.backends.legacy:build"
[project]
name = "stealergram"
version = "0.1.0"
description = "Telegram channel monitor - downloads, extracts, scores, and alerts on credential leaks"
requires-python = ">=3.11"
dependencies = [
# Telegram
"telethon",
"tgcrypto",
# TUI
"textual",
# Config
"python-dotenv",
# Progress bars (CLI mode)
"tqdm",
# Archive extraction
"py7zr",
"rarfile",
]
[project.optional-dependencies]
web = [
"fastapi",
"uvicorn[standard]",
"jinja2",
"python-multipart",
"bcrypt",
"python-jose[cryptography]",
]
dev = [
"pytest",
]
[project.scripts]
stealergram = "main:main"
[tool.setuptools.packages.find]
where = ["."]
exclude = ["tests*", "data*", "logs*", "tmp*"]
[tool.pytest.ini_options]
testpaths = ["tests"]

View File

@@ -15,7 +15,7 @@ tqdm
py7zr
rarfile
# Web frontend (optional only needed with --web)
# Web frontend (optional - only needed with --web)
fastapi
uvicorn[standard]
jinja2

View File

@@ -7,7 +7,7 @@ os.environ.setdefault("API_HASH", "dummy_hash_for_tests")
os.environ.setdefault("BOT_TOKEN", "0:dummy_bot_token")
os.environ.setdefault("NOTIFY_CHAT_ID", "99999")
# Web frontend test defaults set once here so all web test files see the same values.
# Web frontend test defaults - set once here so all web test files see the same values.
os.environ.setdefault("WEB_SECRET_KEY", "test-secret-key-for-pytest")
os.environ.setdefault("WEB_ADMIN_USER", "superadmin")
os.environ.setdefault("WEB_ADMIN_PASS", "superpass")
@@ -17,8 +17,8 @@ import config
import utils.scorer as scorer
# Two test keywords:
# @testcorp\.com employee email domain (triggers CRITICAL)
# testcorp\.com plain domain match (triggers LOW baseline)
# @testcorp\.com - employee email domain (triggers CRITICAL)
# testcorp\.com - plain domain match (triggers LOW baseline)
TEST_KEYWORDS = [r"@testcorp\.com", r"testcorp\.com"]
@@ -29,7 +29,7 @@ def patched_keywords(monkeypatch):
scorer's module-level globals so scoring logic uses known test patterns.
scorer.py now reads _config.TARGET_KEYWORDS at call time via `import config as _config`,
so patching config.TARGET_KEYWORDS is sufficient no direct scorer patch needed.
so patching config.TARGET_KEYWORDS is sufficient - no direct scorer patch needed.
"""
monkeypatch.setattr(config, "TARGET_KEYWORDS", TEST_KEYWORDS)
monkeypatch.setattr(scorer, "EMPLOYEE_DOMAINS", scorer._build_employee_domains())

View File

@@ -1,5 +1,5 @@
"""
Tests for utils/cache.py file-ID deduplication cache.
Tests for utils/cache.py - file-ID deduplication cache.
Each test gets an isolated cache file via the `isolated_cache` fixture
so tests never touch data/cache.json.

View File

@@ -1,5 +1,5 @@
"""
Tests for utils/database.py SQLite persistence layer.
Tests for utils/database.py - SQLite persistence layer.
Each test gets an isolated in-memory-equivalent DB via the `isolated_db`
fixture so tests never touch data/hits.db.
@@ -112,7 +112,7 @@ def test_by_severity_returns_correct_severity():
def test_by_severity_excludes_duplicates():
"""seen_before=1 rows must be invisible to by_severity they are stored for stats only."""
"""seen_before=1 rows must be invisible to by_severity - they are stored for stats only."""
hit = make_hit(severity=HIGH, url="intranet.testcorp.com")
db_module.insert_hits([hit], source="c", filename="f.txt", seen_before=True)
assert db_module.by_severity(HIGH) == []

View File

@@ -1,5 +1,5 @@
"""
Tests for tui/events.py subscribe/unsubscribe broadcast, signal_channel_changed.
Tests for tui/events.py - subscribe/unsubscribe broadcast, signal_channel_changed.
"""
import queue

View File

@@ -1,5 +1,5 @@
"""
Tests for core/processor.py archive extraction and line-by-line search.
Tests for core/processor.py - archive extraction and line-by-line search.
No Telegram deps, no async. Tests create real archive fixtures in tmp_path
so process_file's cleanup guarantee can be verified against actual disk state.
@@ -60,7 +60,7 @@ class TestSearchFile:
assert search_file(f, patterns) == ["testcorp.com|user|pass"]
def test_handles_encoding_errors_gracefully(self, tmp_path, patterns):
"""Combo files are often messy invalid bytes must not crash the search."""
"""Combo files are often messy - invalid bytes must not crash the search."""
f = tmp_path / "combo.txt"
f.write_bytes(
b"testcorp.com|user1|pass\n"
@@ -81,7 +81,7 @@ class TestSearchFile:
assert len(hits) == 2
# ─── process_file plain .txt ────────────────────────────────────────────────
# ─── process_file - plain .txt ────────────────────────────────────────────────
class TestProcessFilePlainText:
def test_returns_hits(self, tmp_path, patterns):
@@ -104,7 +104,7 @@ class TestProcessFilePlainText:
assert not f.exists()
# ─── process_file .zip extraction ──────────────────────────────────────────
# ─── process_file - .zip extraction ──────────────────────────────────────────
class TestProcessFileZip:
def _make_zip(self, tmp_path: Path, content: str, filename="content.txt") -> Path:
@@ -155,7 +155,7 @@ class TestProcessFileZip:
assert len(hits) == 2
# ─── process_file nested archives ──────────────────────────────────────────
# ─── process_file - nested archives ──────────────────────────────────────────
class TestProcessFileNested:
def test_nested_zip_is_recursed(self, tmp_path, patterns):
@@ -177,7 +177,7 @@ class TestProcessFileNested:
assert not (tmp_path / "outer").exists()
# ─── process_file password-protected .7z ───────────────────────────────────
# ─── process_file - password-protected .7z ───────────────────────────────────
class TestProcessFile7zPassword:
def test_unlocks_with_correct_password(self, tmp_path, patterns, monkeypatch):
@@ -218,6 +218,6 @@ class TestProcessFile7zPassword:
z.write(txt, "content.txt")
txt.unlink()
# No hits archive could not be opened
# No hits - archive could not be opened
hits = process_file(szf, patterns)
assert hits == []

View File

@@ -1,10 +1,10 @@
"""
Tests for utils/scorer.py severity scoring and ULP line parsing.
Tests for utils/scorer.py - severity scoring and ULP line parsing.
All tests use the `patched_keywords` fixture (see conftest.py) which
replaces TARGET_KEYWORDS with two entries:
@testcorp.com employee email domain (CRITICAL trigger)
testcorp.com plain domain match (LOW baseline)
@testcorp.com - employee email domain (CRITICAL trigger)
testcorp.com - plain domain match (LOW baseline)
"""
import pytest
@@ -50,7 +50,7 @@ class TestULPParsingRealWorld:
@pytest.mark.parametrize("line,exp_url,exp_user,exp_pass", [
# ── Protocol + port + path, colon separator ──────────────────────────
# Port is digits followed by '/' must be consumed as part of the URL.
# Port is digits followed by '/' - must be consumed as part of the URL.
(
"http://portal.fakehosp.example.com:88/:55512309-1:hunter2",
"http://portal.fakehosp.example.com:88/", "55512309-1", "hunter2",
@@ -91,7 +91,7 @@ class TestULPParsingRealWorld:
"jdoe@fakehosp.example.com", "Passw0rd!",
),
# ── Pipe separator (unambiguous port stays in URL) ──────────────────
# ── Pipe separator (unambiguous - port stays in URL) ──────────────────
(
"http://portal.fakehosp.example.com:88/|22.987.654-3|florida88",
"http://portal.fakehosp.example.com:88/", "22.987.654-3", "florida88",
@@ -113,7 +113,7 @@ class TestULPParsingRealWorld:
"portal.fakehosp.example.com:88/", "22.987.654-3", "florida88",
),
# ── No protocol, no port plain colon separators ────────────────────
# ── No protocol, no port - plain colon separators ────────────────────
(
"booking.fakehosp.example.com:66778899-7:correcthorse",
"booking.fakehosp.example.com", "66778899-7", "correcthorse",
@@ -234,7 +234,7 @@ class TestWeakPasswordFlags:
assert any("Common password" in r for r in hit.reasons)
def test_weak_password_does_not_escalate_severity(self, patched_keywords):
"""Weak password flags are informational they must not change severity."""
"""Weak password flags are informational - they must not change severity."""
hit = score_hit("testcorp.com|user|abc")
assert hit.severity == LOW

View File

@@ -1,5 +1,5 @@
"""
Tests for web/auth.py JWT token lifecycle, bcrypt helpers.
Tests for web/auth.py - JWT token lifecycle, bcrypt helpers.
"""
import pytest

View File

@@ -1,5 +1,5 @@
"""
Tests for web/db.py user store and refresh token management.
Tests for web/db.py - user store and refresh token management.
"""
import pytest

View File

@@ -1 +1 @@
"""tui Textual TUI frontend and event bus."""
"""tui - Textual TUI frontend and event bus."""

View File

@@ -34,8 +34,8 @@ MonitorApp (App)
### Threading model
- **Bot backend** → `threading.Thread(daemon=True)` with its own `asyncio.new_event_loop()`
Runs `_bot_main()` Telethon is completely isolated from Textual's loop.
- **TUI drain** → `set_interval(0.1, _drain_bus)` polls `queue.Queue` every 100ms on Textual's loop.
Runs `_bot_main()` - Telethon is completely isolated from Textual's loop.
- **TUI drain** → `set_interval(0.1, _drain_bus)` - polls `queue.Queue` every 100ms on Textual's loop.
### Key methods
@@ -105,7 +105,7 @@ Changes apply immediately (handler re-registered). Not persisted to `config.py`
- Validates regex before adding
- On change: rebuilds `utils.scorer.EMPLOYEE_DOMAINS` and `ORG_DOMAINS`
- Bot handler recompiles patterns on the next incoming message automatically
- **Changes are in-memory only** copy to `config.py` to persist
- **Changes are in-memory only** - copy to `config.py` to persist
---

View File

@@ -1,5 +1,5 @@
"""
tui.py Textual TUI for the ULP credential monitor.
tui.py - Textual TUI for the ULP credential monitor.
Layout (main screen):
┌──────────────────────────────────┬──────────────────────────────────┐
@@ -14,13 +14,13 @@ Layout (main screen):
└─────────────────────────────────────────────────────────────────────┘
Additional screens (push/pop via keybindings):
• SearchScreen full-text search across hits DB [s]
• HitsDBScreen paginated recent / severity viewer [h]
• KeywordsScreen live-edit TARGET_KEYWORDS regex list [k]
• SearchScreen - full-text search across hits DB [s]
• HitsDBScreen - paginated recent / severity viewer [h]
• KeywordsScreen - live-edit TARGET_KEYWORDS regex list [k]
Architecture:
- The entire bot backend runs as a Textual Worker (asyncio task inside the
TUI event loop no threading needed).
TUI event loop - no threading needed).
- A second Worker runs _bus_consumer(), reading events from tui_events.queue
and dispatching to the right panel.
- Channel add/remove from the UI immediately re-registers Telethon handlers
@@ -29,7 +29,7 @@ Architecture:
into the download panel's RichLog.
- StatsPanel polls database.stats() every 10 s via set_interval().
- Keyword changes are applied in-memory immediately (scorer caches rebuilt);
NOT auto-persisted to config.py a notice banner reminds the user.
NOT auto-persisted to config.py - a notice banner reminds the user.
- Live patterns are recompiled from config.TARGET_KEYWORDS on every message
so keyword changes take effect without a handler restart.
"""
@@ -88,7 +88,7 @@ def _now() -> str:
class DownloadPanel(Vertical):
"""
Left panel two sub-logs stacked vertically:
Left panel - two sub-logs stacked vertically:
• top: tdl raw output (stripped ANSI), scrolling
• bottom: our own structured status entries
"""
@@ -158,7 +158,7 @@ class DownloadPanel(Vertical):
# ─── Hits panel ───────────────────────────────────────────────────────────────
class HitsPanel(Vertical):
"""Right panel scrollable color-coded hit log with live counter badge."""
"""Right panel - scrollable color-coded hit log with live counter badge."""
hit_count: reactive[int] = reactive(0)
@@ -208,7 +208,7 @@ class HitsPanel(Vertical):
class StatsPanel(Horizontal):
"""
Slim bar shows live DB stats, refreshed every 10 s.
Slim bar - shows live DB stats, refreshed every 10 s.
Also refreshed immediately whenever a new hit arrives.
"""
@@ -233,14 +233,14 @@ class StatsPanel(Horizontal):
def compose(self) -> ComposeResult:
yield Static("📊 DB Stats", id="stat-label")
yield Static("🔴 ", classes="stat-critical", id="stat-critical")
yield Static("🟠 ", classes="stat-high", id="stat-high")
yield Static("🟡 ", classes="stat-medium", id="stat-medium")
yield Static("🟢 ", classes="stat-low", id="stat-low")
yield Static("total: ", id="stat-total")
yield Static("unique: ", id="stat-unique")
yield Static("dupes: ", id="stat-dupes")
yield Static("sources: ", id="stat-sources")
yield Static("🔴 - ", classes="stat-critical", id="stat-critical")
yield Static("🟠 - ", classes="stat-high", id="stat-high")
yield Static("🟡 - ", classes="stat-medium", id="stat-medium")
yield Static("🟢 - ", classes="stat-low", id="stat-low")
yield Static("total: - ", id="stat-total")
yield Static("unique: - ", id="stat-unique")
yield Static("dupes: - ", id="stat-dupes")
yield Static("sources: - ", id="stat-sources")
def on_mount(self) -> None:
self.set_interval(10, self.refresh_stats)
@@ -266,7 +266,7 @@ class StatsPanel(Horizontal):
class ChannelPanel(Vertical):
"""
Bottom panel live-editable channel list.
Bottom panel - live-editable channel list.
Changes are applied immediately (Telethon handlers are re-registered).
To make them permanent, edit config.py's WATCHED_CHANNELS manually.
@@ -314,7 +314,7 @@ class ChannelPanel(Vertical):
def compose(self) -> ComposeResult:
yield Label(
"📡 Channels changes apply immediately | edit config.py to persist",
"📡 Channels - changes apply immediately | edit config.py to persist",
classes="panel-title",
)
with Horizontal(classes="controls"):
@@ -524,7 +524,7 @@ class HitsDBScreen(Screen):
status,
)
self.query_one("#db-status", Label).update(
f" {len(rows)} row(s) {label}"
f" {len(rows)} row(s) - {label}"
)
def _load_recent(self) -> None:
@@ -560,7 +560,7 @@ class KeywordsScreen(Screen):
• scorer's domain caches are rebuilt
• The bot handler recompiles patterns on the next message automatically
Changes are NOT written back to config.py a notice banner says so.
Changes are NOT written back to config.py - a notice banner says so.
"""
BINDINGS = [Binding("escape", "dismiss", "Back")]
@@ -601,7 +601,7 @@ class KeywordsScreen(Screen):
yield Header()
yield Label("🔑 Keyword / Pattern Editor", classes="screen-title")
yield Label(
"⚠ Changes are in-memory only copy patterns to config.py to persist across restarts.",
"⚠ Changes are in-memory only - copy patterns to config.py to persist across restarts.",
classes="notice",
)
with Horizontal(id="kw-controls"):
@@ -671,7 +671,7 @@ class KeywordsScreen(Screen):
except Exception as e:
log.warning(f"Could not rebuild scorer caches: {e}")
bus.post(bus.EvStatus(
f"Keywords updated {len(config.TARGET_KEYWORDS)} pattern(s) active"
f"Keywords updated - {len(config.TARGET_KEYWORDS)} pattern(s) active"
))
def action_dismiss(self) -> None:
@@ -721,7 +721,7 @@ class MonitorApp(App):
# The bot backend runs in its own thread with its own asyncio event
# loop, completely isolated from Textual. Telethon spawns background
# tasks via asyncio.ensure_future() and calls connect() which returns
# only after its receiver loop is scheduled both of these deadlock
# only after its receiver loop is scheduled - both of these deadlock
# inside Textual's managed loop. Running in a dedicated thread
# sidesteps all of that.
#
@@ -767,7 +767,7 @@ class MonitorApp(App):
"""
Called every 100 ms by set_interval(). Drains all pending events
from the thread-safe queue and dispatches them to the right widget.
Runs on Textual's event loop safe to call widget methods directly.
Runs on Textual's event loop - safe to call widget methods directly.
"""
q = bus.get_bus()
if q is None:
@@ -854,7 +854,7 @@ class MonitorApp(App):
async def _bot_main(self) -> None:
"""
Full bot backend runs inside the bot thread's own event loop.
Full bot backend - runs inside the bot thread's own event loop.
Telethon is free to schedule background tasks without interfering
with Textual's loop.
"""
@@ -870,7 +870,7 @@ class MonitorApp(App):
patterns = compile_patterns(config.TARGET_KEYWORDS)
bus.post(bus.EvStatus(
f"Starting {len(config.WATCHED_CHANNELS)} channel(s), "
f"Starting - {len(config.WATCHED_CHANNELS)} channel(s), "
f"{len(patterns)} pattern(s)"
))
@@ -894,9 +894,9 @@ class MonitorApp(App):
await user_client.connect()
log.info("[bot] user_client connected, checking auth...")
if not await user_client.is_user_authorized():
log.error("[bot] user_client not authorized run: python main.py --no-tui")
log.error("[bot] user_client not authorized - run: python main.py --no-tui")
bus.post(bus.EvStatus(
"Not authorized run --no-tui once to complete login",
"Not authorized - run --no-tui once to complete login",
level="error",
))
return
@@ -962,7 +962,7 @@ class MonitorApp(App):
log.info(f"[bot] Handler registered for {len(channels)} channel(s)")
bus.post(bus.EvStatus(f"Watching {len(channels)} channel(s)"))
# Channel-change event lives on this (bot) loop.
# Channel-change event - lives on this (bot) loop.
# Textual signals it thread-safely via _signal_channel_changed().
_ch_changed = asyncio.Event()
self._bot_loop_channel_event = _ch_changed
@@ -971,7 +971,7 @@ class MonitorApp(App):
bus.post(bus.EvStatus("Live listener active"))
await backfill_all(user_client, bot_client, patterns)
bus.post(bus.EvStatus("Backfill complete monitoring live"))
bus.post(bus.EvStatus("Backfill complete - monitoring live"))
async def _watch_channels():
while True:
@@ -1009,7 +1009,7 @@ class MonitorApp(App):
# ─── Entry point ──────────────────────────────────────────────────────────────
def run_tui() -> None:
# Do NOT call bus.init_bus() here the Queue must be created inside
# Do NOT call bus.init_bus() here - the Queue must be created inside
# Textual's event loop (see MonitorApp.on_mount). Calling it here
# would bind the Queue to the outer loop which is discarded when
# App.run() creates a new one.

View File

@@ -14,11 +14,11 @@ from tui.events import set_bot_context, signal_channel_changed
```
### `init_bus() -> queue.Queue`
Creates the `queue.Queue`. Called inside `MonitorApp.on_mount()` **must run on Textual's event loop**, not before `App.run()`.
Creates the `queue.Queue`. Called inside `MonitorApp.on_mount()` - **must run on Textual's event loop**, not before `App.run()`.
### `post(event: Any) -> None`
Fire-and-forget from any thread. Delivers to the TUI queue **and** all subscriber queues.
Uses `queue.Queue.put_nowait()` never blocks.
Uses `queue.Queue.put_nowait()` - never blocks.
### `get_bus() -> queue.Queue | None`
Returns the TUI queue for `_drain_bus()` to consume.

View File

@@ -1,5 +1,5 @@
"""
tui_events.py Thread-safe event bus between the bot backend and the TUI.
tui_events.py - Thread-safe event bus between the bot backend and the TUI.
The bot backend runs in a dedicated thread with its own asyncio event loop
(completely isolated from Textual's loop). Events are posted via a standard
@@ -18,7 +18,7 @@ import threading
from dataclasses import dataclass, field
from typing import Any
# Thread-safe queue works across the bot thread and Textual's thread.
# Thread-safe queue - works across the bot thread and Textual's thread.
_queue: queue.Queue | None = None
_queue_lock = threading.Lock()

View File

@@ -1 +1 @@
"""utils pure logic modules with no Telegram dependencies."""
"""utils - pure logic modules with no Telegram dependencies."""

View File

@@ -11,7 +11,7 @@ from utils.cache import is_seen, mark_seen
### `is_seen(file_id: int) -> bool`
Returns `True` if this document ID has been processed before.
Loads from disk on every call (safe for multi-process, slightly slow for hot loops not an issue given download cadence).
Loads from disk on every call (safe for multi-process, slightly slow for hot loops - not an issue given download cadence).
### `mark_seen(file_id: int) -> None`
Adds `file_id` to the cache and persists to disk.
@@ -21,12 +21,12 @@ Adds `file_id` to the cache and persists to disk.
## Storage
- **File:** `data/cache.json`
- **Format:** JSON array of integers `[123456789, 987654321, ...]`
- **No expiry** grows indefinitely. Safe to delete to re-process all files.
- **Format:** JSON array of integers - `[123456789, 987654321, ...]`
- **No expiry** - grows indefinitely. Safe to delete to re-process all files.
---
## Notes
- `is_seen` + `mark_seen` are called in `core/scraper.py` after a successful download+process cycle, not before so a file that fails mid-process will be retried on next run.
- `is_seen` + `mark_seen` are called in `core/scraper.py` after a successful download+process cycle, not before - so a file that fails mid-process will be retried on next run.
- Not thread-safe (load/modify/save is not atomic). Acceptable because downloads are sequential within the bot loop.

View File

@@ -1,5 +1,5 @@
"""
cache.py Tracks already-processed file IDs to avoid redownloading.
cache.py - Tracks already-processed file IDs to avoid redownloading.
Persists to a simple JSON file on disk.
"""

View File

@@ -85,5 +85,5 @@ Indexes: `url`, `username`, `source`, `timestamp`, `severity`.
## Notes
- Each query opens and closes its own connection via the `_connect()` context manager.
- `conn.row_factory = sqlite3.Row` rows support both index and column-name access.
- `conn.row_factory = sqlite3.Row` - rows support both index and column-name access.
- Transactions: commit on success, rollback on exception.

View File

@@ -1,5 +1,5 @@
"""
database.py SQLite storage for credential hits.
database.py - SQLite storage for credential hits.
Schema:
hits table:

View File

@@ -51,7 +51,7 @@ Check 6 (no severity change): flags weak passwords ≤6 chars or common strings.
## Employee domain matching
Keywords in `config.TARGET_KEYWORDS` containing `@` become employee patterns.
Pattern: `@<domain>(?:[^a-zA-Z0-9.\-]|$)` requires literal `@` before the domain.
Pattern: `@<domain>(?:[^a-zA-Z0-9.\-]|$)` - requires literal `@` before the domain.
**`user@gmail.com` on a URL containing `myorg.cl` does NOT trigger CRITICAL.**
Keywords without `@` go only to `ORG_DOMAINS` (LOW baseline).
@@ -64,11 +64,11 @@ Separators: `:` `;` `,` `|` `\t` (any of these between the three fields).
The URL field handles two common stealer-log complications:
1. **`://` not treated as separator** the optional scheme prefix `(?:https?|ftp)://` is consumed before the character-class match, so `https://` never gets split at the colon.
1. **`://` not treated as separator** - the optional scheme prefix `(?:https?|ftp)://` is consumed before the character-class match, so `https://` never gets split at the colon.
2. **Port + path consumed into the URL** the optional group `(?::\d+/[^\s:;,|\t]*)` absorbs `:port/path` when the port is pure digits immediately followed by `/`. This correctly handles `http://host:8085/path/:user:pass` but intentionally skips patterns like `:24145487-8` (RUT number hyphen after digits, no `/`).
2. **Port + path consumed into the URL** - the optional group `(?::\d+/[^\s:;,|\t]*)` absorbs `:port/path` when the port is pure digits immediately followed by `/`. This correctly handles `http://host:8085/path/:user:pass` but intentionally skips patterns like `:24145487-8` (RUT number - hyphen after digits, no `/`).
**Known limitation:** A bare port with no path (e.g. `https://host:8080:user:pass`) will mis-parse `8080` as the username. This is not observed in practice stealer logs always include at least a trailing `/`.
**Known limitation:** A bare port with no path (e.g. `https://host:8080:user:pass`) will mis-parse `8080` as the username. This is not observed in practice - stealer logs always include at least a trailing `/`.
---
@@ -79,7 +79,7 @@ The URL field handles two common stealer-log complications:
| `EMPLOYEE_DOMAINS` | `list[tuple[str, Pattern]]` | `(domain_str, anchored_pattern)` for `@`-keywords |
| `ORG_DOMAINS` | `list[Pattern]` | Plain domain patterns for all keywords |
scorer uses `import config as _config` (not `from config import TARGET_KEYWORDS`), so patching `config.TARGET_KEYWORDS` at runtime is sufficient `_build_*` reads the live module attribute.
scorer uses `import config as _config` (not `from config import TARGET_KEYWORDS`), so patching `config.TARGET_KEYWORDS` at runtime is sufficient - `_build_*` reads the live module attribute.
To rebuild after editing `config.TARGET_KEYWORDS` at runtime:
```python

View File

@@ -1,24 +1,24 @@
"""
scorer.py Severity scoring for credential hits.
scorer.py - Severity scoring for credential hits.
Scoring logic (highest match wins):
CRITICAL Employee credentials (internal email domain)
CRITICAL - Employee credentials (internal email domain)
e.g. jdoe@yourclinic.cl:password
Admin/privileged service URLs
- Admin/privileged service URLs
e.g. admin., vpn., ssh., rdp., gitlab., jira.
HIGH Internal-facing services
HIGH - Internal-facing services
e.g. intranet., erp., crm., portal., citrix.
Password manager or SSO hits
Any credential where username looks like an employee email
- Password manager or SSO hits
- Any credential where username looks like an employee email
MEDIUM Client-facing portals
MEDIUM - Client-facing portals
e.g. app., patient., client., booking.
Domain match on a non-privileged service
- Domain match on a non-privileged service
LOW Generic domain keyword match
No URL parsed, just a raw domain mention
LOW - Generic domain keyword match
- No URL parsed, just a raw domain mention
Each scored hit gets a dict with:
- severity: CRITICAL / HIGH / MEDIUM / LOW

View File

@@ -1,5 +1,5 @@
"""
web/app.py FastAPI application factory.
web/app.py - FastAPI application factory.
Usage:
from web.app import create_app

View File

@@ -1,9 +1,9 @@
"""
web/auth.py JWT signing/verification and bcrypt password helpers.
web/auth.py - JWT signing/verification and bcrypt password helpers.
Tokens:
access HS256, 15 min TTL, payload: {sub, role, type:"access"}
refresh HS256, 7 day TTL, payload: {sub, jti, type:"refresh"}
access - HS256, 15 min TTL, payload: {sub, role, type:"access"}
refresh - HS256, 7 day TTL, payload: {sub, jti, type:"refresh"}
Both tokens live in httpOnly SameSite=Strict cookies.
The `type` claim prevents an access token being used as a refresh token.

View File

@@ -1,9 +1,9 @@
"""
web/db.py SQLite user store for the web frontend.
web/db.py - SQLite user store for the web frontend.
Tables:
users credentials + role + active flag
refresh_tokens JTI-indexed refresh token revocation list
users - credentials + role + active flag
refresh_tokens - JTI-indexed refresh token revocation list
Bootstrap: on first init, creates a superadmin from WEB_ADMIN_USER / WEB_ADMIN_PASS
env vars (required only on first run if the DB doesn't exist yet).
@@ -63,7 +63,9 @@ def init_db() -> None:
admin_pass = os.environ.get("WEB_ADMIN_PASS")
if not admin_pass:
raise RuntimeError(
"WEB_ADMIN_PASS env var is required on first run to create the superadmin."
"WEB_ADMIN_PASS env var is required on first run to bootstrap the superadmin. "
"Add WEB_ADMIN_PASS=<password> (and optionally WEB_ADMIN_USER=<username>) "
"to your .env file, then restart."
)
conn.execute(
"INSERT INTO users (id, username, password_hash, role, created_at) VALUES (?,?,?,?,?)",

View File

@@ -1,5 +1,5 @@
"""
web/dependencies.py FastAPI dependency functions.
web/dependencies.py - FastAPI dependency functions.
get_current_user: reads the access_token cookie, decodes + validates it,
loads the user row from web.db. Raises 401 if anything fails.

View File

@@ -1,5 +1,5 @@
"""
web/models.py Pydantic request/response schemas.
web/models.py - Pydantic request/response schemas.
"""
import re

View File

@@ -1,9 +1,9 @@
"""
web/routes/auth.py Login, logout, token refresh.
web/routes/auth.py - Login, logout, token refresh.
POST /login form submit; sets access_token + refresh_token cookies
POST /logout revokes refresh token, clears cookies
POST /refresh exchanges refresh_token cookie for a new access_token
POST /login - form submit; sets access_token + refresh_token cookies
POST /logout - revokes refresh token, clears cookies
POST /refresh - exchanges refresh_token cookie for a new access_token
"""
from fastapi import APIRouter, Form, HTTPException, Request, Response, status

View File

@@ -1,5 +1,5 @@
"""
web/routes/config_routes.py Keyword groups and channel list management.
web/routes/config_routes.py - Keyword groups and channel list management.
GET /config/keywords → render groups editor
PUT /config/keywords → validate + save groups, reload scorer

View File

@@ -1,5 +1,5 @@
"""
web/routes/dashboard.py Dashboard views and SSE live stream.
web/routes/dashboard.py - Dashboard views and SSE live stream.
GET / → redirect to /dashboard
GET /dashboard → overview: all groups, stats, live hit feed

View File

@@ -1,5 +1,5 @@
"""
web/routes/users.py User CRUD (superadmin only).
web/routes/users.py - User CRUD (superadmin only).
GET /users → list all users
POST /users → create a new user