feat(clustering): wire high-weight edges end-to-end
The connected-components clusterer now writes attacker_identities rows + sets attackers.identity_id when high-weight signals (JA3 / HASSH / payload-hash / C2-endpoint exact match) agree across observations. Singletons stay un-fingerprinted and un-clustered. Algorithm split: - cluster_observations(observations) — pure union-find over the high-weight edge function. Same code path for fixture validation and production tick. - from_attacker_row(row) — production-row adapter; recovers JA3 + HASSH from Attacker.fingerprints JSON. Payload + C2 join from logs in later commits; the function shape doesn't change. Repo additions on BaseRepository + SQLModelRepository: - list_attackers_for_clustering(limit=None) - create_attacker_identity(row) - set_attacker_identity_id(attacker_uuid, identity_uuid) DummyRepo coverage stub updated. v1 behavior is conservative: only assigns identities to observations whose identity_id is currently NULL. Multi-identity components are skipped this pass — merge / re-assign lands in commit 10 with revocable merges. Fixture bounds tightened against the production clusterer: - lone_wolf (F3) — singletons stay singletons - shared_wordlist (F1) — credential-only overlap doesn't cluster (high-weight tier doesn't include credentials) - vpn_hopping (F2, identity-level) — 5 rotated IPs with stable JA3 + HASSH fold into one identity, ARI = 1.0, completeness = 1.0
This commit is contained in:
@@ -406,6 +406,49 @@ class BaseRepository(ABC):
|
||||
"""Total ``Attacker`` rows FK'd to this identity."""
|
||||
pass
|
||||
|
||||
# ─── Identity resolution writes (clusterer worker) ─────────────────────
|
||||
# Populated by ``decnet clusterer``. The read-only API on top of
|
||||
# ``attacker_identities`` shipped in commit ``dc3d08d``; this is the
|
||||
# write side. See ``decnet.clustering.impl.connected_components``.
|
||||
|
||||
@abstractmethod
|
||||
async def list_attackers_for_clustering(
|
||||
self, limit: Optional[int] = None,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Project every ``Attacker`` into the clusterer's input shape.
|
||||
|
||||
Returns dicts with at least ``uuid``, ``asn``, ``identity_id``,
|
||||
and ``fingerprints`` (raw JSON list). The clusterer parses the
|
||||
fingerprints list to recover JA3 / HASSH per observation. Empty
|
||||
list when no attackers exist.
|
||||
|
||||
``limit`` is optional — passed by callers that want to bound a
|
||||
single tick's working set; leave ``None`` to fetch all.
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def create_attacker_identity(self, row: dict[str, Any]) -> str:
|
||||
"""Insert a new ``AttackerIdentity`` row and return its uuid.
|
||||
|
||||
``row`` must include ``uuid``; other fields are optional and
|
||||
default per the model. Caller is responsible for generating
|
||||
the uuid (so it can be used in the same tick to back-link
|
||||
observations without a second round-trip).
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def set_attacker_identity_id(
|
||||
self, attacker_uuid: str, identity_uuid: str,
|
||||
) -> None:
|
||||
"""Set ``attackers.identity_id`` on a single observation row.
|
||||
|
||||
Idempotent — re-setting the same value is a no-op. Used by
|
||||
the clusterer when it links an observation to an identity.
|
||||
"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def get_attacker_commands(
|
||||
self,
|
||||
|
||||
@@ -1468,6 +1468,52 @@ class SQLModelRepository(BaseRepository):
|
||||
result = await session.execute(statement)
|
||||
return result.scalar() or 0
|
||||
|
||||
# ─── Identity resolution writes (clusterer worker) ─────────────────────
|
||||
|
||||
async def list_attackers_for_clustering(
|
||||
self, limit: Optional[int] = None,
|
||||
) -> list[dict[str, Any]]:
|
||||
# Project the columns the clusterer's similarity graph reads.
|
||||
# Keep it narrow so future denormalised projections (payloads
|
||||
# joined from logs, c2 endpoints aggregated from sessions) can
|
||||
# land here without churning every caller. ``fingerprints`` is
|
||||
# the raw JSON list — the clusterer parses for JA3 / HASSH.
|
||||
statement = select(
|
||||
Attacker.uuid, Attacker.asn, Attacker.identity_id, Attacker.fingerprints,
|
||||
).order_by(Attacker.first_seen)
|
||||
if limit is not None:
|
||||
statement = statement.limit(limit)
|
||||
async with self._session() as session:
|
||||
result = await session.execute(statement)
|
||||
return [
|
||||
{
|
||||
"uuid": row.uuid,
|
||||
"asn": row.asn,
|
||||
"identity_id": row.identity_id,
|
||||
"fingerprints": row.fingerprints,
|
||||
}
|
||||
for row in result.all()
|
||||
]
|
||||
|
||||
async def create_attacker_identity(self, row: dict[str, Any]) -> str:
|
||||
identity = AttackerIdentity(**row)
|
||||
async with self._session() as session:
|
||||
session.add(identity)
|
||||
await session.commit()
|
||||
return identity.uuid
|
||||
|
||||
async def set_attacker_identity_id(
|
||||
self, attacker_uuid: str, identity_uuid: str,
|
||||
) -> None:
|
||||
statement = (
|
||||
update(Attacker)
|
||||
.where(Attacker.uuid == attacker_uuid)
|
||||
.values(identity_id=identity_uuid)
|
||||
)
|
||||
async with self._session() as session:
|
||||
await session.execute(statement)
|
||||
await session.commit()
|
||||
|
||||
async def get_attacker_commands(
|
||||
self,
|
||||
uuid: str,
|
||||
|
||||
Reference in New Issue
Block a user