feat(clustering): combined edge weight + medium-tier wiring
The clusterer now drops a single high-tier function call in favor of a tier-weighted sum. Tier multipliers (high=1.0, medium=0.6, low=0.2, very_low=0.05) are tuned so the threshold (1.0) admits high-tier agreement alone while leaving every weaker tier — and every combination of weaker tiers — under threshold. Per-tier discipline tested: - high alone clusters - medium alone does NOT cluster (supporting signal only) - low alone does NOT cluster (fixture 1's failure mode) - very-low alone does NOT cluster (fixture 2's failure mode) - all three weak tiers stacked still don't reach threshold - high + medium clusters (high already saturates) The combination is forward-compatible: low + very-low contributions are computed today but always project to 0.0 because the production adapter doesn't populate credentials / ASN-edge inputs into the fixture path yet. Their contribution becomes load-bearing in commit 7 when the low-tier landing tightens the F1 / F2 bounds. Fixture 4 (paused_campaign) ratchet added: high-tier signal carries the multi-day-silence campaign into one identity. Time-agnostic invariant — silence is irrelevant to the edge weight.
This commit is contained in:
@@ -34,8 +34,9 @@ from typing import Any, Iterable, Optional
|
||||
|
||||
from decnet.clustering.base import Clusterer, ClusterResult
|
||||
from decnet.clustering.impl.similarity import (
|
||||
EDGE_THRESHOLD,
|
||||
Observation,
|
||||
high_weight_edge,
|
||||
combined_edge_weight,
|
||||
)
|
||||
from decnet.logging import get_logger
|
||||
from decnet.web.db.repository import BaseRepository
|
||||
@@ -43,13 +44,6 @@ from decnet.web.db.repository import BaseRepository
|
||||
log = get_logger("clustering.connected_components")
|
||||
|
||||
|
||||
# Threshold above which an edge survives into the graph. The high-tier
|
||||
# functions return 1.0 on agreement, so a literal >= 1.0 cutoff means
|
||||
# "exact match required." Once medium-tier edges combine, this becomes
|
||||
# a tunable.
|
||||
_EDGE_THRESHOLD = 1.0
|
||||
|
||||
|
||||
def cluster_observations(
|
||||
observations: Iterable[Observation],
|
||||
) -> dict[str, str]:
|
||||
@@ -81,7 +75,7 @@ def cluster_observations(
|
||||
|
||||
for i, a in enumerate(obs_list):
|
||||
for b in obs_list[i + 1:]:
|
||||
if high_weight_edge(a, b) >= _EDGE_THRESHOLD:
|
||||
if combined_edge_weight(a, b) >= EDGE_THRESHOLD:
|
||||
union(a.observation_id, b.observation_id)
|
||||
|
||||
# Roots: each unique find(o) is a component representative. Use
|
||||
|
||||
@@ -162,6 +162,63 @@ def very_low_weight_edge(a: Observation, b: Observation) -> float:
|
||||
return 1.0 if a.asn == b.asn else 0.0
|
||||
|
||||
|
||||
# ─── Combined weight ────────────────────────────────────────────────────────
|
||||
|
||||
#: Tier multipliers applied to the per-tier edge scores when combining
|
||||
#: into a single weight. Tuned so that:
|
||||
#:
|
||||
#: * High-tier agreement alone (1.0) crosses the 1.0 threshold.
|
||||
#: * Medium-tier alone (max 1.0) yields 0.6 — below threshold.
|
||||
#: * Low-tier alone (max 1.0) yields 0.2 — defeats fixture 1's
|
||||
#: credential-overlap-only failure mode.
|
||||
#: * Very-low alone (max 1.0) yields 0.05 — defeats fixture 2's
|
||||
#: ASN-rotation failure mode.
|
||||
#:
|
||||
#: The ratio between tiers matters more than the absolute values: a
|
||||
#: tier should never combine its way past threshold without help from
|
||||
#: a stronger one.
|
||||
TIER_WEIGHTS = {
|
||||
"high": 1.0,
|
||||
"medium": 0.6,
|
||||
"low": 0.2,
|
||||
"very_low": 0.05,
|
||||
}
|
||||
|
||||
#: Threshold a combined edge weight must meet to survive into the
|
||||
#: similarity graph. The connected-components impl drops anything
|
||||
#: under this before running union-find.
|
||||
EDGE_THRESHOLD = 1.0
|
||||
|
||||
|
||||
def combined_edge_weight(a: Observation, b: Observation) -> float:
|
||||
"""Sum of all four tier scores, weighted by :data:`TIER_WEIGHTS`.
|
||||
|
||||
Each per-tier function returns a score in ``[0, 1]``; the
|
||||
weighted sum lets stronger tiers dominate without letting weaker
|
||||
ones combine their way past threshold.
|
||||
|
||||
The connected-components clusterer compares this against
|
||||
:data:`EDGE_THRESHOLD` to decide whether to draw an edge. Pure /
|
||||
time-agnostic — fixture 7 forbids recency-decay weighting.
|
||||
|
||||
Commits 5–7 land each tier in the call site:
|
||||
|
||||
* Commit 5 (this commit): high + medium.
|
||||
* Commit 6: + phase-handoff (a separate edge family, not a tier).
|
||||
* Commit 7: + low + very_low.
|
||||
|
||||
Until commit 7 lands, the low / very_low contributions stay zero
|
||||
by virtue of the underlying functions returning ``0.0`` whenever
|
||||
their inputs are missing. The combination is forward-compatible.
|
||||
"""
|
||||
return (
|
||||
TIER_WEIGHTS["high"] * high_weight_edge(a, b)
|
||||
+ TIER_WEIGHTS["medium"] * medium_weight_edge(a, b)
|
||||
+ TIER_WEIGHTS["low"] * low_weight_edge(a, b)
|
||||
+ TIER_WEIGHTS["very_low"] * very_low_weight_edge(a, b)
|
||||
)
|
||||
|
||||
|
||||
# ─── Adapter for the synthetic-corpus tests ─────────────────────────────────
|
||||
|
||||
|
||||
@@ -206,5 +263,8 @@ __all__ = [
|
||||
"medium_weight_edge",
|
||||
"low_weight_edge",
|
||||
"very_low_weight_edge",
|
||||
"combined_edge_weight",
|
||||
"from_synthetic",
|
||||
"EDGE_THRESHOLD",
|
||||
"TIER_WEIGHTS",
|
||||
]
|
||||
|
||||
Reference in New Issue
Block a user