feat(clustering): wire high-weight edges end-to-end

The connected-components clusterer now writes attacker_identities rows + sets attackers.identity_id when high-weight signals (JA3 / HASSH / payload-hash / C2-endpoint exact match) agree across observations. Singletons stay un-fingerprinted and un-clustered. Algorithm split: - cluster_observations(observations) — pure union-find over the high-weight edge function. Same code path for fixture validation and production tick. - from_attacker_row(row) — production-row adapter; recovers JA3 + HASSH from Attacker.fingerprints JSON. Payload + C2 join from logs in later commits; the function shape doesn't change. Repo additions on BaseRepository + SQLModelRepository: - list_attackers_for_clustering(limit=None) - create_attacker_identity(row) - set_attacker_identity_id(attacker_uuid, identity_uuid) DummyRepo coverage stub updated. v1 behavior is conservative: only assigns identities to observations whose identity_id is currently NULL. Multi-identity components are skipped this pass — merge / re-assign lands in commit 10 with revocable merges. Fixture bounds tightened against the production clusterer: - lone_wolf (F3) — singletons stay singletons - shared_wordlist (F1) — credential-only overlap doesn't cluster (high-weight tier doesn't include credentials) - vpn_hopping (F2, identity-level) — 5 rotated IPs with stable JA3 + HASSH fold into one identity, ARI = 1.0, completeness = 1.0
2026-04-26 08:19:56 -04:00
parent a9775c4000
commit de2f4c3a62
5 changed files with 631 additions and 23 deletions
--- a/tests/db/test_base_repo.py
+++ b/tests/db/test_base_repo.py
@@ -66,6 +66,9 @@ class DummyRepo(BaseRepository):
    async def count_identities(self): await super().count_identities(); return 0
    async def list_observations_for_identity(self, u, limit=50, offset=0): await super().list_observations_for_identity(u, limit, offset); return []
    async def count_observations_for_identity(self, u): await super().count_observations_for_identity(u); return 0
+    async def list_attackers_for_clustering(self, limit=None): await super().list_attackers_for_clustering(limit); return []
+    async def create_attacker_identity(self, row): await super().create_attacker_identity(row); return ""
+    async def set_attacker_identity_id(self, a, i): await super().set_attacker_identity_id(a, i)

@pytest.mark.asyncio
 async def test_base_repo_coverage():
@@ -133,6 +136,9 @@ async def test_base_repo_coverage():
    await dr.count_identities()
    await dr.list_observations_for_identity("a")
    await dr.count_observations_for_identity("a")
+    await dr.list_attackers_for_clustering()
+    await dr.create_attacker_identity({"uuid": "i"})
+    await dr.set_attacker_identity_id("a", "i")

    # Swarm methods: default NotImplementedError on BaseRepository.  Covering
    # them here keeps the coverage contract honest for the swarm CRUD surface.