feat(clustering): wire high-weight edges end-to-end

The connected-components clusterer now writes attacker_identities
rows + sets attackers.identity_id when high-weight signals (JA3 /
HASSH / payload-hash / C2-endpoint exact match) agree across
observations. Singletons stay un-fingerprinted and un-clustered.

Algorithm split:
- cluster_observations(observations) — pure union-find over the
  high-weight edge function. Same code path for fixture validation
  and production tick.
- from_attacker_row(row) — production-row adapter; recovers JA3 +
  HASSH from Attacker.fingerprints JSON. Payload + C2 join from
  logs in later commits; the function shape doesn't change.

Repo additions on BaseRepository + SQLModelRepository:
- list_attackers_for_clustering(limit=None)
- create_attacker_identity(row)
- set_attacker_identity_id(attacker_uuid, identity_uuid)
DummyRepo coverage stub updated.

v1 behavior is conservative: only assigns identities to observations
whose identity_id is currently NULL. Multi-identity components are
skipped this pass — merge / re-assign lands in commit 10 with
revocable merges.

Fixture bounds tightened against the production clusterer:
- lone_wolf (F3) — singletons stay singletons
- shared_wordlist (F1) — credential-only overlap doesn't cluster
  (high-weight tier doesn't include credentials)
- vpn_hopping (F2, identity-level) — 5 rotated IPs with stable JA3
  + HASSH fold into one identity, ARI = 1.0, completeness = 1.0
This commit is contained in:
2026-04-26 08:19:56 -04:00
parent a9775c4000
commit de2f4c3a62
5 changed files with 631 additions and 23 deletions

View File

@@ -406,6 +406,49 @@ class BaseRepository(ABC):
"""Total ``Attacker`` rows FK'd to this identity."""
pass
# ─── Identity resolution writes (clusterer worker) ─────────────────────
# Populated by ``decnet clusterer``. The read-only API on top of
# ``attacker_identities`` shipped in commit ``dc3d08d``; this is the
# write side. See ``decnet.clustering.impl.connected_components``.
@abstractmethod
async def list_attackers_for_clustering(
self, limit: Optional[int] = None,
) -> list[dict[str, Any]]:
"""Project every ``Attacker`` into the clusterer's input shape.
Returns dicts with at least ``uuid``, ``asn``, ``identity_id``,
and ``fingerprints`` (raw JSON list). The clusterer parses the
fingerprints list to recover JA3 / HASSH per observation. Empty
list when no attackers exist.
``limit`` is optional — passed by callers that want to bound a
single tick's working set; leave ``None`` to fetch all.
"""
pass
@abstractmethod
async def create_attacker_identity(self, row: dict[str, Any]) -> str:
"""Insert a new ``AttackerIdentity`` row and return its uuid.
``row`` must include ``uuid``; other fields are optional and
default per the model. Caller is responsible for generating
the uuid (so it can be used in the same tick to back-link
observations without a second round-trip).
"""
pass
@abstractmethod
async def set_attacker_identity_id(
self, attacker_uuid: str, identity_uuid: str,
) -> None:
"""Set ``attackers.identity_id`` on a single observation row.
Idempotent — re-setting the same value is a no-op. Used by
the clusterer when it links an observation to an identity.
"""
pass
@abstractmethod
async def get_attacker_commands(
self,

View File

@@ -1468,6 +1468,52 @@ class SQLModelRepository(BaseRepository):
result = await session.execute(statement)
return result.scalar() or 0
# ─── Identity resolution writes (clusterer worker) ─────────────────────
async def list_attackers_for_clustering(
self, limit: Optional[int] = None,
) -> list[dict[str, Any]]:
# Project the columns the clusterer's similarity graph reads.
# Keep it narrow so future denormalised projections (payloads
# joined from logs, c2 endpoints aggregated from sessions) can
# land here without churning every caller. ``fingerprints`` is
# the raw JSON list — the clusterer parses for JA3 / HASSH.
statement = select(
Attacker.uuid, Attacker.asn, Attacker.identity_id, Attacker.fingerprints,
).order_by(Attacker.first_seen)
if limit is not None:
statement = statement.limit(limit)
async with self._session() as session:
result = await session.execute(statement)
return [
{
"uuid": row.uuid,
"asn": row.asn,
"identity_id": row.identity_id,
"fingerprints": row.fingerprints,
}
for row in result.all()
]
async def create_attacker_identity(self, row: dict[str, Any]) -> str:
identity = AttackerIdentity(**row)
async with self._session() as session:
session.add(identity)
await session.commit()
return identity.uuid
async def set_attacker_identity_id(
self, attacker_uuid: str, identity_uuid: str,
) -> None:
statement = (
update(Attacker)
.where(Attacker.uuid == attacker_uuid)
.values(identity_id=identity_uuid)
)
async with self._session() as session:
await session.execute(statement)
await session.commit()
async def get_attacker_commands(
self,
uuid: str,