Divergence Engine¶
The Divergence Engine is the heart of CT Toolkit's security layer. It provides a multi-tiered approach to measuring and mitigating identity drift in real-time.
The Scoring Mechanism¶
The engine calculates a Divergence Score (\(D\)) between the agent's current interaction (\(I_n\)) and its Constitutional Kernel (\(K\)).
\[D = 1 - \text{CosineSimilarity}(\text{Embed}(I_n), \text{Embed}(K))\]
- \(0.0\): Perfect alignment. The response is mathematically consistent with the agent's identity.
- \(1.0\): Absolute drift. The response has no relation to the agent's core commitments.
Monitoring Tiers¶
CT Toolkit uses a "Progressive Hardening" approach to minimize latency while maximizing safety.
Tier 1: Embedding Cosine Similarity (ECS)¶
- Method: Vector comparison using fast embedding models.
- Cost: Near zero.
- Latency: Minimal (< 50ms).
- Action: Low-level monitoring and warning.
Tier 2: LLM-as-Judge¶
- Method: A secondary "Identity Judge" LLM analyzes the response against the kernel's text.
- Trigger: Triggered when L1 score exceedes
l2_threshold. - Action: Can initiate Autonomous Self-Correction if the judge detects misalignment.
Tier 3: Identity Probe Battery (ICM)¶
- Method: Full suite of "Identity Consistency Measures".
- Trigger: Triggered for high-stakes interactions or critical drift.
- Action: Hard block of the response and notification of a human operator.
Summary of Tiers¶
| Tier | Score Range | Action | Cost |
|---|---|---|---|
ok |
0.00 – 0.15 | No action | $ |
l1_warning |
0.15 – 0.30 | Log & Monitor | $ |
l2_judge |
0.30 – 0.50 | Judge Analysis | $$ |
l3_icm |
0.50 – 0.80 | Identity Probes | $$$ |
critical |
0.80+ | Immediate Block | $$$ |
See Tiered Guardrails for the implementation details.