Files
DECNET/development/IDENTITY_RESOLUTION.md
anti 943bb3a39d docs(identity): resolve merge revocability + SSE open questions
Open Question 1 (merge revocability): adopted. The clusterer will
clear merged_into_uuid on contradicting evidence and publish a new
identity.unmerged topic alongside the existing three identity.* topics
so subscribers on identity.> get it from day one.

Open Question 2 (AttackerDetail UX on identity_id change): adopted
SSE over refresh-on-focus. New endpoint will mirror the existing
topology mutator SSE (bus subscription on identity.>, JWT via ?token=,
snapshot-on-connect + live forward).

Risk 2 (API URL stability for soft-merged loser UUIDs): struck —
already shipped in commit dc3d08d (read-only API follows
merged_into_uuid and surfaces the canonical winner).
2026-04-26 07:33:36 -04:00

339 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Identity Resolution — Design
**Status:** pre-implementation. This doc is the spec; code follows.
**Roadmap pressure:** Campaign Clustering (`CAMPAIGN_CLUSTERING.md`),
Keystroke Dynamics (`DEVELOPMENT_V2.md` §1), Federation
(`DEVELOPMENT_V2.md` §3).
## Premise
The `attackers` table is keyed per-IP — one row every time we observe
activity from a new source IP. That works for naive scoring, but it
conflates two distinct concepts:
- **Observation event.** "We saw activity from IP X starting at T1."
Mutable; IPs come and go; the unit of *ingestion* on the wire.
- **Actor identity.** "These N observations are the same hands."
Semi-stable; recovered from signals the attacker can't cheaply rotate
(JA3, HASSH, payload hashes, C2 callbacks, eventually keystroke
rhythm).
A campaign is then one-level-up: "these M identities are coordinated."
The clean ladder is **Observation → Identity → Campaign**, three
levels, each derived from the level below by clustering on
increasingly meta signals.
We will not ship a clusterer in this PR sequence. The plan here is the
**substrate the clusterer writes into** — schema, API, bus topics,
frontend hooks — landed empty so downstream work targets a stable
shape and the campaign clustering fixtures can encode honest
multi-row-per-actor scenarios.
Order of work, strictly:
1. This design doc.
2. Schema-only PR — `attacker_identities` table + nullable
`attackers.identity_id` FK. Empty table, no production reads/writes.
3. Read-only API — `/api/v1/identities/*` returning empty lists / 404.
4. Frontend — conditional `IdentityDetail` page; `AttackerDetail`
gains a "Identity: <link>" badge when populated.
5. Bus topics + wiki — declare topics, document, no publishers yet.
6. Test factory adapter — campaign factory emits N rows per
IP-rotating actor with shared `truth_identity_id`. Unblocks
fixture 2 (`vpn_hopping`) and beyond.
The clusterer itself follows after fixtures 26 ship, on the
substrate this PR sequence builds.
---
## Why now, why not later
**Pre-v1 schema changes are nearly free.** SQLModel
`metadata.create_all()` picks up new tables; new nullable columns are
free; no Alembic until v1. Real production data is currently small
and replayable.
**Post-v1 the cost compounds.** Real attacker rows accumulate, FKs
proliferate, dashboard URLs get bookmarked, federation gossip locks
in `schema_version=1` payload shapes. Every month we wait, the
migration becomes harder.
**V2 keystroke dynamics needs an identity row.** `kd_digraph_simhash`
correlation is *the* feature that graduates fingerprint into identity.
It needs a row to attach to. Without it, the V2 work either rebuilds
this substrate from scratch, or hangs simhash off the per-IP
observation table — which means an IP-rotating actor's typing rhythm
gets fragmented across every IP they used.
**Federation gossip is identity-level.** Operators in different
geographies will never share an IP. They may share an identity.
---
## Why sibling-add, not rename
**Considered:** rename `attackers``attacker_observations`.
Eliminates the "attacker means observation" lie at the schema layer.
**Rejected.** Costs:
- 126 occurrences of `attacker_uuid` across the codebase, mid-migration
churn directly on top of DEBT-041 (commit `3eb67c9`, just landed).
- Frontend `Attacker``Observation` mismatches user mental model.
Operators click "show me the attacker," not "show me the
observation." Splunk, ELK, MISP, every CTI platform keeps the
user-facing concept stable and exposes identity resolution as a
derived view.
- The lie is in *documentation*, not in code. Code already operates
per-IP correctly; it's just named imprecisely. Fixing it via
docstring + wiki is far cheaper than renaming.
**Adopted:** **sibling-add.** Keep the `attackers` table; document its
semantic role as "per-IP observation." Add `attacker_identities` as a
new sibling. Add nullable `attackers.identity_id` FK. The clusterer
populates identities. Existing code paths are unchanged. Frontend
`AttackerDetail` gains a conditional widget; new `IdentityDetail`
page aggregates observations.
The "Attacker" vocabulary continues to mean "what the operator clicks
in the dashboard" — the per-IP observation row. "Identity" is the
analyst-facing concept, surfaced when the clusterer has resolved one.
---
## Schema
### `AttackerIdentity` (new)
| Column | Type | Notes |
|---|---|---|
| `uuid` | TEXT PK | uuid4(); identities are NOT fingerprint-derived (fingerprints evolve as the actor's tooling changes; the row's identity must outlive its current fingerprints) |
| `schema_version` | INT, default 1 | Federation-gossip compat from day one. Bumping feature definitions without a version field silently poisons other operators' clustering |
| `campaign_id` | TEXT FK nullable | Set by the campaign clusterer (downstream effort) |
| `first_seen_at` | TIMESTAMP | Earliest observation linked to this identity |
| `last_seen_at` | TIMESTAMP | Latest observation linked to this identity |
| `created_at` / `updated_at` | TIMESTAMP | Row audit |
| `confidence` | REAL nullable | Identity-cohesion score from clusterer; null until clusterer writes |
| `observation_count` | INT default 0 | Denormalized for cheap dashboard reads. Maintained by the clusterer when it links/unlinks |
| `ja3_hashes` | TEXT (JSON list) nullable | Multiple TLS stacks per actor possible (different tools, different hosts) |
| `hassh_hashes` | TEXT (JSON list) nullable | |
| `payload_simhashes` | TEXT (JSON list) nullable | 64-bit ints serialized as hex strings |
| `c2_endpoints` | TEXT (JSON list) nullable | Domain or IP, dedup'd |
| `kd_digraph_simhash` | BINARY(8) nullable | V2 keystroke-dynamics hook. Same shape as `SessionProfile.kd_digraph_simhash`; identity-level value is the centroid (or majority vote) across the identity's sessions |
| `merged_into_uuid` | TEXT self-FK nullable | Soft-merge audit trail. When the clusterer combines two existing identities, the loser's row stays in place with `merged_into_uuid` pointing at the winner — preserves the audit trail without orphaning FKs |
| `notes` | TEXT nullable | Operator-editable. Free-form |
All clusterer-populated fields are nullable; the table ships empty and
is valid in that state.
### `attackers` (extended)
One nullable column added:
| Column | Type | Notes |
|---|---|---|
| `identity_id` | TEXT FK nullable, indexed | References `attacker_identities.uuid`. NULL until the clusterer resolves an identity |
**Migration:** None needed. Pre-v1 SQLModel `metadata.create_all()`
adds the new table and column. No data backfill (column is nullable).
---
## Where intel lives — both, with clear semantics
DEBT-041 (`3eb67c9`) just re-keyed `attacker_intel` on `attacker_uuid`
(observation level). That work is correct; we do **not** touch it
here.
**Observation-level intel** (`attacker_intel`, current):
- AbuseIPDB confidence, GreyNoise classification, abuse.ch matches,
PTR records, GeoIP — all **IP-scoped facts**. An identity spanning
40 IPs has 40 distinct AbuseIPDB verdicts. We must not lose that
granularity.
**Identity-level intel** (`attacker_identity_intel`, deferred):
- Aggregate reputation (e.g. "this identity has been reported as
malicious across 4 of 5 observed IPs").
- Threat-actor naming from MISP/CTI feeds, where naming is
actor-scoped not IP-scoped.
- TTP / MITRE ATT&CK tags.
Different lifecycle (clusterer-driven, not enricher-driven), different
inputs (aggregates over observations, not direct API calls), so it
gets its own table and its own enricher when it ships. **Not in this
PR sequence.**
The IdentityDetail API (read side) aggregates observation intel on
read until the identity-level table exists.
---
## Bus Topics
Three new topics. No publishers in this PR sequence — constants exist;
publishers ship with the clusterer.
| Topic | Payload | When |
|---|---|---|
| `identity.formed` | `{identity_uuid, observation_uuids: [], confidence, first_seen_at}` | Clusterer creates a new identity from one or more observations |
| `identity.observation.linked` | `{identity_uuid, observation_uuid, confidence_after}` | Clusterer attaches an observation to an existing identity (or re-attaches one previously linked elsewhere) |
| `identity.merged` | `{winner_uuid, loser_uuid, observation_uuids: [], confidence_after}` | Clusterer collapses two identities. The loser's row stays in place via `merged_into_uuid`; subscribers re-key any cached identity references to the winner |
**Deferred:** `identity.campaign.assigned`. Adds opportunistically
when the campaign clusterer ships. YAGNI before then.
**Wiki:** `Service-Bus.md` documents these in the same commit that
adds the constants (per the project's `feedback_wiki_bus_signals`
rule).
---
## API Surface
All new endpoints are read-only and auth-gated identically to
`/api/v1/attackers/*` (per `project_health_auth_gated`).
| Method | Path | Returns |
|---|---|---|
| GET | `/api/v1/identities` | Paginated list of identities. Response shape mirrors `AttackersResponse` |
| GET | `/api/v1/identities/{uuid}` | Identity row + aggregated intel summary (rolled up from FK'd observations) + campaign stub if assigned |
| GET | `/api/v1/identities/{uuid}/observations` | Paginated list of `Attacker` observation rows that FK to this identity |
While the table is empty, every endpoint returns either an empty list
or 404 — both are valid responses.
**`AttackerDetail` change** (frontend, not API): when
`attackers.identity_id` is non-null, render a "Identity: <uuid-link>"
badge linking to `/identities/<uuid>`. No change otherwise.
---
## Frontend
- **`AttackerDetail.tsx`** — conditional badge. Zero behavior change
when `identity_id` is null.
- **`IdentityDetail.tsx`** (new) — aggregates observations, fingerprint
summary, intel summary, campaign link. Same visual vocabulary as
`AttackerDetail` so operators feel at home.
- **Routing** — `/identities/:uuid` alongside `/attackers/:uuid`.
- Default browse remains "Attackers." There is no "Identities" tab
in the main navigation until identities are populated; once they
are, an "Identity Resolution" entry appears under the Analytics
section (this is post-clusterer; out of scope here).
---
## Risks
1. **Confidence drift.** The clusterer can rewrite identity
assignments as evidence accumulates. An observation linked to
identity-A today may move to identity-B tomorrow. UI must surface
this without alarming operators ("This observation has been
re-attributed; previous identity remains as a soft-merged
reference"). The `merged_into_uuid` chain is the audit trail.
2. ~~**API URL stability.**~~ Resolved in commit `dc3d08d`: the
read-only API follows `merged_into_uuid` and surfaces the canonical
winner. Loser UUIDs resolve to the winner row.
3. **Schema-version lock-in for federation.** `schema_version=1` is
what we ship. Any fingerprint added to the identity row post-v1
bumps the version. Operators behind by versions get a degraded
gossip experience but should not crash — the receiver must
tolerate unknown fields.
4. **Observation FK proliferation.** Today only `attackers` would
carry `identity_id`. Tomorrow, `SessionProfile`, `AttackerIntel`,
webhook payloads might want it too. Resist proliferation; the
normalised path is `observation.identity_id` and identity-level
facts go in `attacker_identity_intel`. We only carry `identity_id`
on tables where joining via the observation row is materially
slower at read time.
5. **Identity-level intel scope creep.** Easy to start moving DEBT-041
intel up to identity level "for cleanliness." Don't. AbuseIPDB
results are IP-scoped facts; moving them up loses information.
Identity-level intel is *aggregate* intel, a different thing.
---
## Open Questions
1. ~~**Revocability of identity merges.**~~ **Resolved 2026-04-26:**
merges are revocable. `identity.unmerged` topic ships in
`decnet/bus/topics.py` alongside the existing three so subscribers
on `identity.>` get it from day one. Clusterer clears
`merged_into_uuid`, re-links observations, publishes
`identity.unmerged` + a fresh `identity.formed` for the
resurrected side.
2. ~~**`AttackerDetail` UX when `identity_id` changes.**~~ **Resolved
2026-04-26:** SSE channel modeled on the topology-mutator SSE.
New endpoint subscribes to `identity.>`, JWT via `?token=`,
snapshot-on-connect + live forward. `AttackerDetail` and
`IdentityDetail` consume it.
3. **`SessionProfile.identity_id` FK.** Does this PR sequence add it,
or does it wait for V2 keystroke dynamics? Leaning **wait** — the
FK is only useful when the identity-level keystroke similarity
query exists, which is V2 work. Adds a column we don't read in
v1 = unused complexity.
4. **Webhook payload identity_id.** Adds opportunistically once
identities are populated. Not load-bearing for this PR sequence.
5. **Identity-level intel table.** Schema sketch is straightforward
(uuid PK, identity_uuid FK, source, confidence, ttps JSON,
timestamps), but the enricher is meaningfully different from
the IP-scoped one. Defer entirely.
---
## What is explicitly NOT in this design
- The clusterer worker (`decnet/clustering/` worker bin). Designed in
`CAMPAIGN_CLUSTERING.md` §4; lands on top of this substrate.
- `attacker_identity_intel` table.
- `SessionProfile.identity_id` FK.
- Webhook payload `identity_id` enrichment.
- Renaming `attackers``attacker_observations`. Considered, rejected.
- Identity-level federation gossip. The schema is federation-ready
(schema_version, no operator-identifying fields); the gossip wire
itself is V2.
---
## Verification
After all 5 commits below land:
```bash
source .311/bin/activate
# Schema lands cleanly.
pytest tests/db/test_identity_schema.py -v
# API surface returns expected shapes against an empty identities table.
pytest tests/web/test_api_identities.py -v
# No regressions on the unchanged path.
pytest tests/web/ tests/profiler/ tests/correlation/ -v
# Bus topic constants importable; wiki updated.
python -c "from decnet.bus.topics import IDENTITY_FORMED, IDENTITY_OBSERVATION_LINKED, IDENTITY_MERGED; print('OK')"
test -f wiki-checkout/Identity-Resolution.md
grep -q "identity.formed" wiki-checkout/Service-Bus.md
# Factory adapter unblocks fixture 2.
pytest tests/clustering/test_campaign_factory.py -v
```
Manual smoke after schema + API + frontend:
- `decnet api` then `decnet web`.
- Browse to an existing AttackerDetail page → no badge (identity_id is NULL).
- `GET /api/v1/identities``{"data": [], "total": 0, ...}`.
- `GET /api/v1/identities/<random-uuid>` → 404.