feat(clustering): combined edge weight + medium-tier wiring

The clusterer now drops a single high-tier function call in favor of
a tier-weighted sum. Tier multipliers (high=1.0, medium=0.6, low=0.2,
very_low=0.05) are tuned so the threshold (1.0) admits high-tier
agreement alone while leaving every weaker tier — and every
combination of weaker tiers — under threshold.

Per-tier discipline tested:
- high alone clusters
- medium alone does NOT cluster (supporting signal only)
- low alone does NOT cluster (fixture 1's failure mode)
- very-low alone does NOT cluster (fixture 2's failure mode)
- all three weak tiers stacked still don't reach threshold
- high + medium clusters (high already saturates)

The combination is forward-compatible: low + very-low contributions
are computed today but always project to 0.0 because the production
adapter doesn't populate credentials / ASN-edge inputs into the
fixture path yet. Their contribution becomes load-bearing in commit 7
when the low-tier landing tightens the F1 / F2 bounds.

Fixture 4 (paused_campaign) ratchet added: high-tier signal carries
the multi-day-silence campaign into one identity. Time-agnostic
invariant — silence is irrelevant to the edge weight.
This commit is contained in:
2026-04-26 08:22:10 -04:00
parent de2f4c3a62
commit f7da33726c
4 changed files with 159 additions and 9 deletions

View File

@@ -34,8 +34,9 @@ from typing import Any, Iterable, Optional
from decnet.clustering.base import Clusterer, ClusterResult
from decnet.clustering.impl.similarity import (
EDGE_THRESHOLD,
Observation,
high_weight_edge,
combined_edge_weight,
)
from decnet.logging import get_logger
from decnet.web.db.repository import BaseRepository
@@ -43,13 +44,6 @@ from decnet.web.db.repository import BaseRepository
log = get_logger("clustering.connected_components")
# Threshold above which an edge survives into the graph. The high-tier
# functions return 1.0 on agreement, so a literal >= 1.0 cutoff means
# "exact match required." Once medium-tier edges combine, this becomes
# a tunable.
_EDGE_THRESHOLD = 1.0
def cluster_observations(
observations: Iterable[Observation],
) -> dict[str, str]:
@@ -81,7 +75,7 @@ def cluster_observations(
for i, a in enumerate(obs_list):
for b in obs_list[i + 1:]:
if high_weight_edge(a, b) >= _EDGE_THRESHOLD:
if combined_edge_weight(a, b) >= EDGE_THRESHOLD:
union(a.observation_id, b.observation_id)
# Roots: each unique find(o) is a component representative. Use