Rename to stealergram, add pyproject.toml, purge em-dashes

- Rename project to stealergram throughout - Add pyproject.toml (replaces requirements.txt split, folds pytest.ini) - Replace all em-dashes with hyphens across all source files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-19 10:06:30 -04:00
parent 4c104cddd2
commit 741e6bb0d3
46 changed files with 244 additions and 191 deletions
--- a/utils/init.py
+++ b/utils/init.py
@@ -1 +1 @@
-"""utils — pure logic modules with no Telegram dependencies."""
+"""utils - pure logic modules with no Telegram dependencies."""
--- a/utils/cache.md
+++ b/utils/cache.md
@@ -11,7 +11,7 @@ from utils.cache import is_seen, mark_seen

 ### `is_seen(file_id: int) -> bool`
 Returns `True` if this document ID has been processed before.  
-Loads from disk on every call (safe for multi-process, slightly slow for hot loops — not an issue given download cadence).
+Loads from disk on every call (safe for multi-process, slightly slow for hot loops - not an issue given download cadence).

 ### `mark_seen(file_id: int) -> None`
 Adds `file_id` to the cache and persists to disk.
@@ -21,12 +21,12 @@ Adds `file_id` to the cache and persists to disk.
 ## Storage

 - **File:** `data/cache.json`
- **Format:** JSON array of integers — `[123456789, 987654321, ...]`
- **No expiry** — grows indefinitely. Safe to delete to re-process all files.
+- **Format:** JSON array of integers - `[123456789, 987654321, ...]`
+- **No expiry** - grows indefinitely. Safe to delete to re-process all files.

 ---

 ## Notes

- `is_seen` + `mark_seen` are called in `core/scraper.py` after a successful download+process cycle, not before — so a file that fails mid-process will be retried on next run.
+- `is_seen` + `mark_seen` are called in `core/scraper.py` after a successful download+process cycle, not before - so a file that fails mid-process will be retried on next run.
 - Not thread-safe (load/modify/save is not atomic). Acceptable because downloads are sequential within the bot loop.
--- a/utils/cache.py
+++ b/utils/cache.py
@@ -1,5 +1,5 @@
 """
-cache.py — Tracks already-processed file IDs to avoid redownloading.
+cache.py - Tracks already-processed file IDs to avoid redownloading.
 Persists to a simple JSON file on disk.
 """

--- a/utils/database.md
+++ b/utils/database.md
@@ -85,5 +85,5 @@ Indexes: `url`, `username`, `source`, `timestamp`, `severity`.
 ## Notes

 - Each query opens and closes its own connection via the `_connect()` context manager.
- `conn.row_factory = sqlite3.Row` — rows support both index and column-name access.
+- `conn.row_factory = sqlite3.Row` - rows support both index and column-name access.
 - Transactions: commit on success, rollback on exception.
--- a/utils/database.py
+++ b/utils/database.py
@@ -1,5 +1,5 @@
 """
-database.py — SQLite storage for credential hits.
+database.py - SQLite storage for credential hits.

 Schema:
  hits table:
--- a/utils/scorer.md
+++ b/utils/scorer.md
@@ -51,7 +51,7 @@ Check 6 (no severity change): flags weak passwords ≤6 chars or common strings.
 ## Employee domain matching

 Keywords in `config.TARGET_KEYWORDS` containing `@` become employee patterns.  
-Pattern: `@<domain>(?:[^a-zA-Z0-9.\-]|$)` — requires literal `@` before the domain.  
+Pattern: `@<domain>(?:[^a-zA-Z0-9.\-]|$)` - requires literal `@` before the domain.  
 **`user@gmail.com` on a URL containing `myorg.cl` does NOT trigger CRITICAL.**

 Keywords without `@` go only to `ORG_DOMAINS` (LOW baseline).
@@ -64,11 +64,11 @@ Separators: `:` `;` `,` `|` `\t` (any of these between the three fields).

 The URL field handles two common stealer-log complications:

-1. **`://` not treated as separator** — the optional scheme prefix `(?:https?|ftp)://` is consumed before the character-class match, so `https://` never gets split at the colon.
+1. **`://` not treated as separator** - the optional scheme prefix `(?:https?|ftp)://` is consumed before the character-class match, so `https://` never gets split at the colon.

-2. **Port + path consumed into the URL** — the optional group `(?::\d+/[^\s:;,|\t]*)` absorbs `:port/path` when the port is pure digits immediately followed by `/`. This correctly handles `http://host:8085/path/:user:pass` but intentionally skips patterns like `:24145487-8` (RUT number — hyphen after digits, no `/`).
+2. **Port + path consumed into the URL** - the optional group `(?::\d+/[^\s:;,|\t]*)` absorbs `:port/path` when the port is pure digits immediately followed by `/`. This correctly handles `http://host:8085/path/:user:pass` but intentionally skips patterns like `:24145487-8` (RUT number - hyphen after digits, no `/`).

-**Known limitation:** A bare port with no path (e.g. `https://host:8080:user:pass`) will mis-parse `8080` as the username. This is not observed in practice — stealer logs always include at least a trailing `/`.
+**Known limitation:** A bare port with no path (e.g. `https://host:8080:user:pass`) will mis-parse `8080` as the username. This is not observed in practice - stealer logs always include at least a trailing `/`.

 ---

@@ -79,7 +79,7 @@ The URL field handles two common stealer-log complications:
 | `EMPLOYEE_DOMAINS` | `list[tuple[str, Pattern]]` | `(domain_str, anchored_pattern)` for `@`-keywords |
 | `ORG_DOMAINS` | `list[Pattern]` | Plain domain patterns for all keywords |

-scorer uses `import config as _config` (not `from config import TARGET_KEYWORDS`), so patching `config.TARGET_KEYWORDS` at runtime is sufficient — `_build_*` reads the live module attribute.
+scorer uses `import config as _config` (not `from config import TARGET_KEYWORDS`), so patching `config.TARGET_KEYWORDS` at runtime is sufficient - `_build_*` reads the live module attribute.

 To rebuild after editing `config.TARGET_KEYWORDS` at runtime:
 ```python
--- a/utils/scorer.py
+++ b/utils/scorer.py
@@ -1,24 +1,24 @@
 """
-scorer.py — Severity scoring for credential hits.
+scorer.py - Severity scoring for credential hits.

 Scoring logic (highest match wins):

-  CRITICAL  — Employee credentials (internal email domain)
+  CRITICAL - Employee credentials (internal email domain)
                e.g. jdoe@yourclinic.cl:password
-              — Admin/privileged service URLs
+ - Admin/privileged service URLs
                e.g. admin., vpn., ssh., rdp., gitlab., jira.

-  HIGH      — Internal-facing services
+  HIGH - Internal-facing services
                e.g. intranet., erp., crm., portal., citrix.
-              — Password manager or SSO hits
-              — Any credential where username looks like an employee email
+ - Password manager or SSO hits
+ - Any credential where username looks like an employee email

-  MEDIUM    — Client-facing portals
+  MEDIUM - Client-facing portals
                e.g. app., patient., client., booking.
-              — Domain match on a non-privileged service
+ - Domain match on a non-privileged service

-  LOW       — Generic domain keyword match
-              — No URL parsed, just a raw domain mention
+  LOW - Generic domain keyword match
+ - No URL parsed, just a raw domain mention

 Each scored hit gets a dict with:
  - severity:    CRITICAL / HIGH / MEDIUM / LOW