feat(clustering): wire high-weight edges end-to-end

The connected-components clusterer now writes attacker_identities rows + sets attackers.identity_id when high-weight signals (JA3 / HASSH / payload-hash / C2-endpoint exact match) agree across observations. Singletons stay un-fingerprinted and un-clustered. Algorithm split: - cluster_observations(observations) — pure union-find over the high-weight edge function. Same code path for fixture validation and production tick. - from_attacker_row(row) — production-row adapter; recovers JA3 + HASSH from Attacker.fingerprints JSON. Payload + C2 join from logs in later commits; the function shape doesn't change. Repo additions on BaseRepository + SQLModelRepository: - list_attackers_for_clustering(limit=None) - create_attacker_identity(row) - set_attacker_identity_id(attacker_uuid, identity_uuid) DummyRepo coverage stub updated. v1 behavior is conservative: only assigns identities to observations whose identity_id is currently NULL. Multi-identity components are skipped this pass — merge / re-assign lands in commit 10 with revocable merges. Fixture bounds tightened against the production clusterer: - lone_wolf (F3) — singletons stay singletons - shared_wordlist (F1) — credential-only overlap doesn't cluster (high-weight tier doesn't include credentials) - vpn_hopping (F2, identity-level) — 5 rotated IPs with stable JA3 + HASSH fold into one identity, ARI = 1.0, completeness = 1.0
2026-04-26 08:19:56 -04:00
parent a9775c4000
commit de2f4c3a62
5 changed files with 631 additions and 23 deletions
--- a/decnet/web/db/repository.py
+++ b/decnet/web/db/repository.py
@@ -406,6 +406,49 @@ class BaseRepository(ABC):
        """Total ``Attacker`` rows FK'd to this identity."""
        pass

+    # ─── Identity resolution writes (clusterer worker) ─────────────────────
+    # Populated by ``decnet clusterer``. The read-only API on top of
+    # ``attacker_identities`` shipped in commit ``dc3d08d``; this is the
+    # write side. See ``decnet.clustering.impl.connected_components``.
+
+    @abstractmethod
+    async def list_attackers_for_clustering(
+        self, limit: Optional[int] = None,
+    ) -> list[dict[str, Any]]:
+        """Project every ``Attacker`` into the clusterer's input shape.
+
+        Returns dicts with at least ``uuid``, ``asn``, ``identity_id``,
+        and ``fingerprints`` (raw JSON list). The clusterer parses the
+        fingerprints list to recover JA3 / HASSH per observation. Empty
+        list when no attackers exist.
+
+        ``limit`` is optional — passed by callers that want to bound a
+        single tick's working set; leave ``None`` to fetch all.
+        """
+        pass
+
+    @abstractmethod
+    async def create_attacker_identity(self, row: dict[str, Any]) -> str:
+        """Insert a new ``AttackerIdentity`` row and return its uuid.
+
+        ``row`` must include ``uuid``; other fields are optional and
+        default per the model. Caller is responsible for generating
+        the uuid (so it can be used in the same tick to back-link
+        observations without a second round-trip).
+        """
+        pass
+
+    @abstractmethod
+    async def set_attacker_identity_id(
+        self, attacker_uuid: str, identity_uuid: str,
+    ) -> None:
+        """Set ``attackers.identity_id`` on a single observation row.
+
+        Idempotent — re-setting the same value is a no-op. Used by
+        the clusterer when it links an observation to an identity.
+        """
+        pass
+
    @abstractmethod
    async def get_attacker_commands(
        self,
--- a/decnet/web/db/sqlmodel_repo.py
+++ b/decnet/web/db/sqlmodel_repo.py
@@ -1468,6 +1468,52 @@ class SQLModelRepository(BaseRepository):
            result = await session.execute(statement)
            return result.scalar() or 0

+    # ─── Identity resolution writes (clusterer worker) ─────────────────────
+
+    async def list_attackers_for_clustering(
+        self, limit: Optional[int] = None,
+    ) -> list[dict[str, Any]]:
+        # Project the columns the clusterer's similarity graph reads.
+        # Keep it narrow so future denormalised projections (payloads
+        # joined from logs, c2 endpoints aggregated from sessions) can
+        # land here without churning every caller. ``fingerprints`` is
+        # the raw JSON list — the clusterer parses for JA3 / HASSH.
+        statement = select(
+            Attacker.uuid, Attacker.asn, Attacker.identity_id, Attacker.fingerprints,
+        ).order_by(Attacker.first_seen)
+        if limit is not None:
+            statement = statement.limit(limit)
+        async with self._session() as session:
+            result = await session.execute(statement)
+            return [
+                {
+                    "uuid": row.uuid,
+                    "asn": row.asn,
+                    "identity_id": row.identity_id,
+                    "fingerprints": row.fingerprints,
+                }
+                for row in result.all()
+            ]
+
+    async def create_attacker_identity(self, row: dict[str, Any]) -> str:
+        identity = AttackerIdentity(**row)
+        async with self._session() as session:
+            session.add(identity)
+            await session.commit()
+        return identity.uuid
+
+    async def set_attacker_identity_id(
+        self, attacker_uuid: str, identity_uuid: str,
+    ) -> None:
+        statement = (
+            update(Attacker)
+            .where(Attacker.uuid == attacker_uuid)
+            .values(identity_id=identity_uuid)
+        )
+        async with self._session() as session:
+            await session.execute(statement)
+            await session.commit()
+
    async def get_attacker_commands(
        self,
        uuid: str,