Files

anti 862e4dbb31 merge: testing → main (reconcile 2-week divergence)

2026-04-28 18:36:00 -04:00

14 KiB

Raw Blame History

Identity Resolution — Design

Status: pre-implementation. This doc is the spec; code follows.

Roadmap pressure: Campaign Clustering (CAMPAIGN_CLUSTERING.md), Keystroke Dynamics (DEVELOPMENT_V2.md §1), Federation (DEVELOPMENT_V2.md §3).

Premise

The attackers table is keyed per-IP — one row every time we observe activity from a new source IP. That works for naive scoring, but it conflates two distinct concepts:

Observation event. "We saw activity from IP X starting at T1." Mutable; IPs come and go; the unit of ingestion on the wire.
Actor identity. "These N observations are the same hands." Semi-stable; recovered from signals the attacker can't cheaply rotate (JA3, HASSH, payload hashes, C2 callbacks, eventually keystroke rhythm).

A campaign is then one-level-up: "these M identities are coordinated." The clean ladder is Observation → Identity → Campaign, three levels, each derived from the level below by clustering on increasingly meta signals.

We will not ship a clusterer in this PR sequence. The plan here is the substrate the clusterer writes into — schema, API, bus topics, frontend hooks — landed empty so downstream work targets a stable shape and the campaign clustering fixtures can encode honest multi-row-per-actor scenarios.

Order of work, strictly:

This design doc.
Schema-only PR — attacker_identities table + nullable attackers.identity_id FK. Empty table, no production reads/writes.
Read-only API — /api/v1/identities/* returning empty lists / 404.
Frontend — conditional IdentityDetail page; AttackerDetail gains a "Identity: " badge when populated.
Bus topics + wiki — declare topics, document, no publishers yet.
Test factory adapter — campaign factory emits N rows per IP-rotating actor with shared truth_identity_id. Unblocks fixture 2 (vpn_hopping) and beyond.

The clusterer itself follows after fixtures 2–6 ship, on the substrate this PR sequence builds.

Why now, why not later

Pre-v1 schema changes are nearly free. SQLModel metadata.create_all() picks up new tables; new nullable columns are free; no Alembic until v1. Real production data is currently small and replayable.

Post-v1 the cost compounds. Real attacker rows accumulate, FKs proliferate, dashboard URLs get bookmarked, federation gossip locks in schema_version=1 payload shapes. Every month we wait, the migration becomes harder.

V2 keystroke dynamics needs an identity row. kd_digraph_simhash correlation is the feature that graduates fingerprint into identity. It needs a row to attach to. Without it, the V2 work either rebuilds this substrate from scratch, or hangs simhash off the per-IP observation table — which means an IP-rotating actor's typing rhythm gets fragmented across every IP they used.

Federation gossip is identity-level. Operators in different geographies will never share an IP. They may share an identity.

Why sibling-add, not rename

Considered: rename attackers → attacker_observations. Eliminates the "attacker means observation" lie at the schema layer.

Rejected. Costs:

126 occurrences of attacker_uuid across the codebase, mid-migration churn directly on top of DEBT-041 (commit 3eb67c9, just landed).
Frontend Attacker → Observation mismatches user mental model. Operators click "show me the attacker," not "show me the observation." Splunk, ELK, MISP, every CTI platform keeps the user-facing concept stable and exposes identity resolution as a derived view.
The lie is in documentation, not in code. Code already operates per-IP correctly; it's just named imprecisely. Fixing it via docstring + wiki is far cheaper than renaming.

Adopted: sibling-add. Keep the attackers table; document its semantic role as "per-IP observation." Add attacker_identities as a new sibling. Add nullable attackers.identity_id FK. The clusterer populates identities. Existing code paths are unchanged. Frontend AttackerDetail gains a conditional widget; new IdentityDetail page aggregates observations.

The "Attacker" vocabulary continues to mean "what the operator clicks in the dashboard" — the per-IP observation row. "Identity" is the analyst-facing concept, surfaced when the clusterer has resolved one.

Schema

`AttackerIdentity` (new)

Column	Type	Notes
`uuid`	TEXT PK	uuid4(); identities are NOT fingerprint-derived (fingerprints evolve as the actor's tooling changes; the row's identity must outlive its current fingerprints)
`schema_version`	INT, default 1	Federation-gossip compat from day one. Bumping feature definitions without a version field silently poisons other operators' clustering
`campaign_id`	TEXT FK nullable	Set by the campaign clusterer (downstream effort)
`first_seen_at`	TIMESTAMP	Earliest observation linked to this identity
`last_seen_at`	TIMESTAMP	Latest observation linked to this identity
`created_at` / `updated_at`	TIMESTAMP	Row audit
`confidence`	REAL nullable	Identity-cohesion score from clusterer; null until clusterer writes
`observation_count`	INT default 0	Denormalized for cheap dashboard reads. Maintained by the clusterer when it links/unlinks
`ja3_hashes`	TEXT (JSON list) nullable	Multiple TLS stacks per actor possible (different tools, different hosts)
`hassh_hashes`	TEXT (JSON list) nullable
`payload_simhashes`	TEXT (JSON list) nullable	64-bit ints serialized as hex strings
`c2_endpoints`	TEXT (JSON list) nullable	Domain or IP, dedup'd
`kd_digraph_simhash`	BINARY(8) nullable	V2 keystroke-dynamics hook. Same shape as `SessionProfile.kd_digraph_simhash`; identity-level value is the centroid (or majority vote) across the identity's sessions
`merged_into_uuid`	TEXT self-FK nullable	Soft-merge audit trail. When the clusterer combines two existing identities, the loser's row stays in place with `merged_into_uuid` pointing at the winner — preserves the audit trail without orphaning FKs
`notes`	TEXT nullable	Operator-editable. Free-form

All clusterer-populated fields are nullable; the table ships empty and is valid in that state.

`attackers` (extended)

One nullable column added:

Column	Type	Notes
`identity_id`	TEXT FK nullable, indexed	References `attacker_identities.uuid`. NULL until the clusterer resolves an identity

Migration: None needed. Pre-v1 SQLModel metadata.create_all() adds the new table and column. No data backfill (column is nullable).

Where intel lives — both, with clear semantics

DEBT-041 (3eb67c9) just re-keyed attacker_intel on attacker_uuid (observation level). That work is correct; we do not touch it here.

Observation-level intel (attacker_intel, current):

AbuseIPDB confidence, GreyNoise classification, abuse.ch matches, PTR records, GeoIP — all IP-scoped facts. An identity spanning 40 IPs has 40 distinct AbuseIPDB verdicts. We must not lose that granularity.

Identity-level intel (attacker_identity_intel, deferred):

Aggregate reputation (e.g. "this identity has been reported as malicious across 4 of 5 observed IPs").
Threat-actor naming from MISP/CTI feeds, where naming is actor-scoped not IP-scoped.
TTP / MITRE ATT&CK tags.

Different lifecycle (clusterer-driven, not enricher-driven), different inputs (aggregates over observations, not direct API calls), so it gets its own table and its own enricher when it ships. Not in this PR sequence.

The IdentityDetail API (read side) aggregates observation intel on read until the identity-level table exists.

Bus Topics

Three new topics. No publishers in this PR sequence — constants exist; publishers ship with the clusterer.

Topic	Payload	When
`identity.formed`	`{identity_uuid, observation_uuids: [], confidence, first_seen_at}`	Clusterer creates a new identity from one or more observations
`identity.observation.linked`	`{identity_uuid, observation_uuid, confidence_after}`	Clusterer attaches an observation to an existing identity (or re-attaches one previously linked elsewhere)
`identity.merged`	`{winner_uuid, loser_uuid, observation_uuids: [], confidence_after}`	Clusterer collapses two identities. The loser's row stays in place via `merged_into_uuid`; subscribers re-key any cached identity references to the winner

Deferred: identity.campaign.assigned. Adds opportunistically when the campaign clusterer ships. YAGNI before then.

Wiki: Service-Bus.md documents these in the same commit that adds the constants (per the project's feedback_wiki_bus_signals rule).

API Surface

All new endpoints are read-only and auth-gated identically to /api/v1/attackers/* (per project_health_auth_gated).

Method	Path	Returns
GET	`/api/v1/identities`	Paginated list of identities. Response shape mirrors `AttackersResponse`
GET	`/api/v1/identities/{uuid}`	Identity row + aggregated intel summary (rolled up from FK'd observations) + campaign stub if assigned
GET	`/api/v1/identities/{uuid}/observations`	Paginated list of `Attacker` observation rows that FK to this identity

While the table is empty, every endpoint returns either an empty list or 404 — both are valid responses.

AttackerDetail change (frontend, not API): when attackers.identity_id is non-null, render a "Identity: " badge linking to /identities/<uuid>. No change otherwise.

Frontend

AttackerDetail.tsx — conditional badge. Zero behavior change when identity_id is null.
IdentityDetail.tsx (new) — aggregates observations, fingerprint summary, intel summary, campaign link. Same visual vocabulary as AttackerDetail so operators feel at home.
Routing — /identities/:uuid alongside /attackers/:uuid.
Default browse remains "Attackers." There is no "Identities" tab in the main navigation until identities are populated; once they are, an "Identity Resolution" entry appears under the Analytics section (this is post-clusterer; out of scope here).

Risks

Confidence drift. The clusterer can rewrite identity assignments as evidence accumulates. An observation linked to identity-A today may move to identity-B tomorrow. UI must surface this without alarming operators ("This observation has been re-attributed; previous identity remains as a soft-merged reference"). The merged_into_uuid chain is the audit trail.
~~API URL stability.~~ Resolved in commit dc3d08d: the read-only API follows merged_into_uuid and surfaces the canonical winner. Loser UUIDs resolve to the winner row.
Schema-version lock-in for federation. schema_version=1 is what we ship. Any fingerprint added to the identity row post-v1 bumps the version. Operators behind by versions get a degraded gossip experience but should not crash — the receiver must tolerate unknown fields.
Observation FK proliferation. Today only attackers would carry identity_id. Tomorrow, SessionProfile, AttackerIntel, webhook payloads might want it too. Resist proliferation; the normalised path is observation.identity_id and identity-level facts go in attacker_identity_intel. We only carry identity_id on tables where joining via the observation row is materially slower at read time.
Identity-level intel scope creep. Easy to start moving DEBT-041 intel up to identity level "for cleanliness." Don't. AbuseIPDB results are IP-scoped facts; moving them up loses information. Identity-level intel is aggregate intel, a different thing.

Open Questions

~~Revocability of identity merges.~~ Resolved 2026-04-26: merges are revocable. identity.unmerged topic ships in decnet/bus/topics.py alongside the existing three so subscribers on identity.> get it from day one. Clusterer clears merged_into_uuid, re-links observations, publishes identity.unmerged + a fresh identity.formed for the resurrected side.
~~AttackerDetail UX when identity_id changes.~~ Resolved 2026-04-26: SSE channel modeled on the topology-mutator SSE. New endpoint subscribes to identity.>, JWT via ?token=, snapshot-on-connect + live forward. AttackerDetail and IdentityDetail consume it.
SessionProfile.identity_id FK. Does this PR sequence add it, or does it wait for V2 keystroke dynamics? Leaning wait — the FK is only useful when the identity-level keystroke similarity query exists, which is V2 work. Adds a column we don't read in v1 = unused complexity.
Webhook payload identity_id. Adds opportunistically once identities are populated. Not load-bearing for this PR sequence.
Identity-level intel table. Schema sketch is straightforward (uuid PK, identity_uuid FK, source, confidence, ttps JSON, timestamps), but the enricher is meaningfully different from the IP-scoped one. Defer entirely.

What is explicitly NOT in this design

The clusterer worker (decnet/clustering/ worker bin). Designed in CAMPAIGN_CLUSTERING.md §4; lands on top of this substrate.
attacker_identity_intel table.
SessionProfile.identity_id FK.
Webhook payload identity_id enrichment.
Renaming attackers → attacker_observations. Considered, rejected.
Identity-level federation gossip. The schema is federation-ready (schema_version, no operator-identifying fields); the gossip wire itself is V2.

Verification

After all 5 commits below land:

source .311/bin/activate

# Schema lands cleanly.
pytest tests/db/test_identity_schema.py -v

# API surface returns expected shapes against an empty identities table.
pytest tests/web/test_api_identities.py -v

# No regressions on the unchanged path.
pytest tests/web/ tests/profiler/ tests/correlation/ -v

# Bus topic constants importable; wiki updated.
python -c "from decnet.bus.topics import IDENTITY_FORMED, IDENTITY_OBSERVATION_LINKED, IDENTITY_MERGED; print('OK')"
test -f wiki-checkout/Identity-Resolution.md
grep -q "identity.formed" wiki-checkout/Service-Bus.md

# Factory adapter unblocks fixture 2.
pytest tests/clustering/test_campaign_factory.py -v

Manual smoke after schema + API + frontend:

decnet api then decnet web.
Browse to an existing AttackerDetail page → no badge (identity_id is NULL).
GET /api/v1/identities → {"data": [], "total": 0, ...}.
GET /api/v1/identities/<random-uuid> → 404.

14 KiB Raw Blame History Unescape Escape