feat(clustering): wire high-weight edges end-to-end

The connected-components clusterer now writes attacker_identities
rows + sets attackers.identity_id when high-weight signals (JA3 /
HASSH / payload-hash / C2-endpoint exact match) agree across
observations. Singletons stay un-fingerprinted and un-clustered.

Algorithm split:
- cluster_observations(observations) — pure union-find over the
  high-weight edge function. Same code path for fixture validation
  and production tick.
- from_attacker_row(row) — production-row adapter; recovers JA3 +
  HASSH from Attacker.fingerprints JSON. Payload + C2 join from
  logs in later commits; the function shape doesn't change.

Repo additions on BaseRepository + SQLModelRepository:
- list_attackers_for_clustering(limit=None)
- create_attacker_identity(row)
- set_attacker_identity_id(attacker_uuid, identity_uuid)
DummyRepo coverage stub updated.

v1 behavior is conservative: only assigns identities to observations
whose identity_id is currently NULL. Multi-identity components are
skipped this pass — merge / re-assign lands in commit 10 with
revocable merges.

Fixture bounds tightened against the production clusterer:
- lone_wolf (F3) — singletons stay singletons
- shared_wordlist (F1) — credential-only overlap doesn't cluster
  (high-weight tier doesn't include credentials)
- vpn_hopping (F2, identity-level) — 5 rotated IPs with stable JA3
  + HASSH fold into one identity, ARI = 1.0, completeness = 1.0
This commit is contained in:
2026-04-26 08:19:56 -04:00
parent a9775c4000
commit de2f4c3a62
5 changed files with 631 additions and 23 deletions

View File

@@ -66,6 +66,9 @@ class DummyRepo(BaseRepository):
async def count_identities(self): await super().count_identities(); return 0
async def list_observations_for_identity(self, u, limit=50, offset=0): await super().list_observations_for_identity(u, limit, offset); return []
async def count_observations_for_identity(self, u): await super().count_observations_for_identity(u); return 0
async def list_attackers_for_clustering(self, limit=None): await super().list_attackers_for_clustering(limit); return []
async def create_attacker_identity(self, row): await super().create_attacker_identity(row); return ""
async def set_attacker_identity_id(self, a, i): await super().set_attacker_identity_id(a, i)
@pytest.mark.asyncio
async def test_base_repo_coverage():
@@ -133,6 +136,9 @@ async def test_base_repo_coverage():
await dr.count_identities()
await dr.list_observations_for_identity("a")
await dr.count_observations_for_identity("a")
await dr.list_attackers_for_clustering()
await dr.create_attacker_identity({"uuid": "i"})
await dr.set_attacker_identity_id("a", "i")
# Swarm methods: default NotImplementedError on BaseRepository. Covering
# them here keeps the coverage contract honest for the swarm CRUD surface.