Divergence Engine¶

The Divergence Engine is the heart of CT Toolkit's security layer. It provides a multi-tiered approach to measuring and mitigating identity drift in real-time.

The Scoring Mechanism¶

The engine calculates a Divergence Score ($D$) between the agent's current interaction ($I_n$) and its Constitutional Kernel ($K$).

\[D = 1 - \text{CosineSimilarity}(\text{Embed}(I_n), \text{Embed}(K))\]

$0.0$: Perfect alignment. The response is mathematically consistent with the agent's identity.
$1.0$: Absolute drift. The response has no relation to the agent's core commitments.

Monitoring Tiers¶

CT Toolkit uses a "Progressive Hardening" approach to minimize latency while maximizing safety.

Tier 1: Embedding Cosine Similarity (ECS)¶

Method: Vector comparison using fast embedding models.
Cost: Near zero.
Latency: Minimal (< 50ms).
Action: Low-level monitoring and warning.

Tier 2: LLM-as-Judge¶

Method: A secondary "Identity Judge" LLM analyzes the response against the kernel's text.
Trigger: Triggered when L1 score exceedes l2_threshold.
Action: Can initiate Autonomous Self-Correction if the judge detects misalignment.

Tier 3: Identity Probe Battery (ICM)¶

Method: Full suite of "Identity Consistency Measures".
Trigger: Triggered for high-stakes interactions or critical drift.
Action: Hard block of the response and notification of a human operator.

Summary of Tiers¶

Tier	Score Range	Action	Cost
`ok`	0.00 – 0.15	No action	$
`l1_warning`	0.15 – 0.30	Log & Monitor	$
`l2_judge`	0.30 – 0.50	Judge Analysis	$$
`l3_icm`	0.50 – 0.80	Identity Probes	$$$
`critical`	0.80+	Immediate Block	$$$

See Tiered Guardrails for the implementation details.