# Identity Resolution — Design **Status:** pre-implementation. This doc is the spec; code follows. **Roadmap pressure:** Campaign Clustering (`CAMPAIGN_CLUSTERING.md`), Keystroke Dynamics (`DEVELOPMENT_V2.md` §1), Federation (`DEVELOPMENT_V2.md` §3). ## Premise The `attackers` table is keyed per-IP — one row every time we observe activity from a new source IP. That works for naive scoring, but it conflates two distinct concepts: - **Observation event.** "We saw activity from IP X starting at T1." Mutable; IPs come and go; the unit of *ingestion* on the wire. - **Actor identity.** "These N observations are the same hands." Semi-stable; recovered from signals the attacker can't cheaply rotate (JA3, HASSH, payload hashes, C2 callbacks, eventually keystroke rhythm). A campaign is then one-level-up: "these M identities are coordinated." The clean ladder is **Observation → Identity → Campaign**, three levels, each derived from the level below by clustering on increasingly meta signals. We will not ship a clusterer in this PR sequence. The plan here is the **substrate the clusterer writes into** — schema, API, bus topics, frontend hooks — landed empty so downstream work targets a stable shape and the campaign clustering fixtures can encode honest multi-row-per-actor scenarios. Order of work, strictly: 1. This design doc. 2. Schema-only PR — `attacker_identities` table + nullable `attackers.identity_id` FK. Empty table, no production reads/writes. 3. Read-only API — `/api/v1/identities/*` returning empty lists / 404. 4. Frontend — conditional `IdentityDetail` page; `AttackerDetail` gains a "Identity: " badge when populated. 5. Bus topics + wiki — declare topics, document, no publishers yet. 6. Test factory adapter — campaign factory emits N rows per IP-rotating actor with shared `truth_identity_id`. Unblocks fixture 2 (`vpn_hopping`) and beyond. The clusterer itself follows after fixtures 2–6 ship, on the substrate this PR sequence builds. --- ## Why now, why not later **Pre-v1 schema changes are nearly free.** SQLModel `metadata.create_all()` picks up new tables; new nullable columns are free; no Alembic until v1. Real production data is currently small and replayable. **Post-v1 the cost compounds.** Real attacker rows accumulate, FKs proliferate, dashboard URLs get bookmarked, federation gossip locks in `schema_version=1` payload shapes. Every month we wait, the migration becomes harder. **V2 keystroke dynamics needs an identity row.** `kd_digraph_simhash` correlation is *the* feature that graduates fingerprint into identity. It needs a row to attach to. Without it, the V2 work either rebuilds this substrate from scratch, or hangs simhash off the per-IP observation table — which means an IP-rotating actor's typing rhythm gets fragmented across every IP they used. **Federation gossip is identity-level.** Operators in different geographies will never share an IP. They may share an identity. --- ## Why sibling-add, not rename **Considered:** rename `attackers` → `attacker_observations`. Eliminates the "attacker means observation" lie at the schema layer. **Rejected.** Costs: - 126 occurrences of `attacker_uuid` across the codebase, mid-migration churn directly on top of DEBT-041 (commit `3eb67c9`, just landed). - Frontend `Attacker` → `Observation` mismatches user mental model. Operators click "show me the attacker," not "show me the observation." Splunk, ELK, MISP, every CTI platform keeps the user-facing concept stable and exposes identity resolution as a derived view. - The lie is in *documentation*, not in code. Code already operates per-IP correctly; it's just named imprecisely. Fixing it via docstring + wiki is far cheaper than renaming. **Adopted:** **sibling-add.** Keep the `attackers` table; document its semantic role as "per-IP observation." Add `attacker_identities` as a new sibling. Add nullable `attackers.identity_id` FK. The clusterer populates identities. Existing code paths are unchanged. Frontend `AttackerDetail` gains a conditional widget; new `IdentityDetail` page aggregates observations. The "Attacker" vocabulary continues to mean "what the operator clicks in the dashboard" — the per-IP observation row. "Identity" is the analyst-facing concept, surfaced when the clusterer has resolved one. --- ## Schema ### `AttackerIdentity` (new) | Column | Type | Notes | |---|---|---| | `uuid` | TEXT PK | uuid4(); identities are NOT fingerprint-derived (fingerprints evolve as the actor's tooling changes; the row's identity must outlive its current fingerprints) | | `schema_version` | INT, default 1 | Federation-gossip compat from day one. Bumping feature definitions without a version field silently poisons other operators' clustering | | `campaign_id` | TEXT FK nullable | Set by the campaign clusterer (downstream effort) | | `first_seen_at` | TIMESTAMP | Earliest observation linked to this identity | | `last_seen_at` | TIMESTAMP | Latest observation linked to this identity | | `created_at` / `updated_at` | TIMESTAMP | Row audit | | `confidence` | REAL nullable | Identity-cohesion score from clusterer; null until clusterer writes | | `observation_count` | INT default 0 | Denormalized for cheap dashboard reads. Maintained by the clusterer when it links/unlinks | | `ja3_hashes` | TEXT (JSON list) nullable | Multiple TLS stacks per actor possible (different tools, different hosts) | | `hassh_hashes` | TEXT (JSON list) nullable | | | `payload_simhashes` | TEXT (JSON list) nullable | 64-bit ints serialized as hex strings | | `c2_endpoints` | TEXT (JSON list) nullable | Domain or IP, dedup'd | | `kd_digraph_simhash` | BINARY(8) nullable | V2 keystroke-dynamics hook. Same shape as `SessionProfile.kd_digraph_simhash`; identity-level value is the centroid (or majority vote) across the identity's sessions | | `merged_into_uuid` | TEXT self-FK nullable | Soft-merge audit trail. When the clusterer combines two existing identities, the loser's row stays in place with `merged_into_uuid` pointing at the winner — preserves the audit trail without orphaning FKs | | `notes` | TEXT nullable | Operator-editable. Free-form | All clusterer-populated fields are nullable; the table ships empty and is valid in that state. ### `attackers` (extended) One nullable column added: | Column | Type | Notes | |---|---|---| | `identity_id` | TEXT FK nullable, indexed | References `attacker_identities.uuid`. NULL until the clusterer resolves an identity | **Migration:** None needed. Pre-v1 SQLModel `metadata.create_all()` adds the new table and column. No data backfill (column is nullable). --- ## Where intel lives — both, with clear semantics DEBT-041 (`3eb67c9`) just re-keyed `attacker_intel` on `attacker_uuid` (observation level). That work is correct; we do **not** touch it here. **Observation-level intel** (`attacker_intel`, current): - AbuseIPDB confidence, GreyNoise classification, abuse.ch matches, PTR records, GeoIP — all **IP-scoped facts**. An identity spanning 40 IPs has 40 distinct AbuseIPDB verdicts. We must not lose that granularity. **Identity-level intel** (`attacker_identity_intel`, deferred): - Aggregate reputation (e.g. "this identity has been reported as malicious across 4 of 5 observed IPs"). - Threat-actor naming from MISP/CTI feeds, where naming is actor-scoped not IP-scoped. - TTP / MITRE ATT&CK tags. Different lifecycle (clusterer-driven, not enricher-driven), different inputs (aggregates over observations, not direct API calls), so it gets its own table and its own enricher when it ships. **Not in this PR sequence.** The IdentityDetail API (read side) aggregates observation intel on read until the identity-level table exists. --- ## Bus Topics Three new topics. No publishers in this PR sequence — constants exist; publishers ship with the clusterer. | Topic | Payload | When | |---|---|---| | `identity.formed` | `{identity_uuid, observation_uuids: [], confidence, first_seen_at}` | Clusterer creates a new identity from one or more observations | | `identity.observation.linked` | `{identity_uuid, observation_uuid, confidence_after}` | Clusterer attaches an observation to an existing identity (or re-attaches one previously linked elsewhere) | | `identity.merged` | `{winner_uuid, loser_uuid, observation_uuids: [], confidence_after}` | Clusterer collapses two identities. The loser's row stays in place via `merged_into_uuid`; subscribers re-key any cached identity references to the winner | **Deferred:** `identity.campaign.assigned`. Adds opportunistically when the campaign clusterer ships. YAGNI before then. **Wiki:** `Service-Bus.md` documents these in the same commit that adds the constants (per the project's `feedback_wiki_bus_signals` rule). --- ## API Surface All new endpoints are read-only and auth-gated identically to `/api/v1/attackers/*` (per `project_health_auth_gated`). | Method | Path | Returns | |---|---|---| | GET | `/api/v1/identities` | Paginated list of identities. Response shape mirrors `AttackersResponse` | | GET | `/api/v1/identities/{uuid}` | Identity row + aggregated intel summary (rolled up from FK'd observations) + campaign stub if assigned | | GET | `/api/v1/identities/{uuid}/observations` | Paginated list of `Attacker` observation rows that FK to this identity | While the table is empty, every endpoint returns either an empty list or 404 — both are valid responses. **`AttackerDetail` change** (frontend, not API): when `attackers.identity_id` is non-null, render a "Identity: " badge linking to `/identities/`. No change otherwise. --- ## Frontend - **`AttackerDetail.tsx`** — conditional badge. Zero behavior change when `identity_id` is null. - **`IdentityDetail.tsx`** (new) — aggregates observations, fingerprint summary, intel summary, campaign link. Same visual vocabulary as `AttackerDetail` so operators feel at home. - **Routing** — `/identities/:uuid` alongside `/attackers/:uuid`. - Default browse remains "Attackers." There is no "Identities" tab in the main navigation until identities are populated; once they are, an "Identity Resolution" entry appears under the Analytics section (this is post-clusterer; out of scope here). --- ## Risks 1. **Confidence drift.** The clusterer can rewrite identity assignments as evidence accumulates. An observation linked to identity-A today may move to identity-B tomorrow. UI must surface this without alarming operators ("This observation has been re-attributed; previous identity remains as a soft-merged reference"). The `merged_into_uuid` chain is the audit trail. 2. ~~**API URL stability.**~~ Resolved in commit `dc3d08d`: the read-only API follows `merged_into_uuid` and surfaces the canonical winner. Loser UUIDs resolve to the winner row. 3. **Schema-version lock-in for federation.** `schema_version=1` is what we ship. Any fingerprint added to the identity row post-v1 bumps the version. Operators behind by versions get a degraded gossip experience but should not crash — the receiver must tolerate unknown fields. 4. **Observation FK proliferation.** Today only `attackers` would carry `identity_id`. Tomorrow, `SessionProfile`, `AttackerIntel`, webhook payloads might want it too. Resist proliferation; the normalised path is `observation.identity_id` and identity-level facts go in `attacker_identity_intel`. We only carry `identity_id` on tables where joining via the observation row is materially slower at read time. 5. **Identity-level intel scope creep.** Easy to start moving DEBT-041 intel up to identity level "for cleanliness." Don't. AbuseIPDB results are IP-scoped facts; moving them up loses information. Identity-level intel is *aggregate* intel, a different thing. --- ## Open Questions 1. ~~**Revocability of identity merges.**~~ **Resolved 2026-04-26:** merges are revocable. `identity.unmerged` topic ships in `decnet/bus/topics.py` alongside the existing three so subscribers on `identity.>` get it from day one. Clusterer clears `merged_into_uuid`, re-links observations, publishes `identity.unmerged` + a fresh `identity.formed` for the resurrected side. 2. ~~**`AttackerDetail` UX when `identity_id` changes.**~~ **Resolved 2026-04-26:** SSE channel modeled on the topology-mutator SSE. New endpoint subscribes to `identity.>`, JWT via `?token=`, snapshot-on-connect + live forward. `AttackerDetail` and `IdentityDetail` consume it. 3. **`SessionProfile.identity_id` FK.** Does this PR sequence add it, or does it wait for V2 keystroke dynamics? Leaning **wait** — the FK is only useful when the identity-level keystroke similarity query exists, which is V2 work. Adds a column we don't read in v1 = unused complexity. 4. **Webhook payload identity_id.** Adds opportunistically once identities are populated. Not load-bearing for this PR sequence. 5. **Identity-level intel table.** Schema sketch is straightforward (uuid PK, identity_uuid FK, source, confidence, ttps JSON, timestamps), but the enricher is meaningfully different from the IP-scoped one. Defer entirely. --- ## What is explicitly NOT in this design - The clusterer worker (`decnet/clustering/` worker bin). Designed in `CAMPAIGN_CLUSTERING.md` §4; lands on top of this substrate. - `attacker_identity_intel` table. - `SessionProfile.identity_id` FK. - Webhook payload `identity_id` enrichment. - Renaming `attackers` → `attacker_observations`. Considered, rejected. - Identity-level federation gossip. The schema is federation-ready (schema_version, no operator-identifying fields); the gossip wire itself is V2. --- ## Verification After all 5 commits below land: ```bash source .311/bin/activate # Schema lands cleanly. pytest tests/db/test_identity_schema.py -v # API surface returns expected shapes against an empty identities table. pytest tests/web/test_api_identities.py -v # No regressions on the unchanged path. pytest tests/web/ tests/profiler/ tests/correlation/ -v # Bus topic constants importable; wiki updated. python -c "from decnet.bus.topics import IDENTITY_FORMED, IDENTITY_OBSERVATION_LINKED, IDENTITY_MERGED; print('OK')" test -f wiki-checkout/Identity-Resolution.md grep -q "identity.formed" wiki-checkout/Service-Bus.md # Factory adapter unblocks fixture 2. pytest tests/clustering/test_campaign_factory.py -v ``` Manual smoke after schema + API + frontend: - `decnet api` then `decnet web`. - Browse to an existing AttackerDetail page → no badge (identity_id is NULL). - `GET /api/v1/identities` → `{"data": [], "total": 0, ...}`. - `GET /api/v1/identities/` → 404.