Skip to content

Your First Guardrail

This guide walks you through the rule validation system — the front door of CT Toolkit's identity protection.


How rule validation works

Every instruction that would change your agent's behavior passes through the Constitutional Kernel:

User instruction
Does it conflict with an Axiomatic Anchor?
  ├── YES → AxiomaticViolationError (hard reject, cannot be overridden)
  └── NO
Does it conflict with a Plastic Commitment?
  ├── YES → PlasticConflictError → Reflective Endorsement flow
  └── NO → Instruction applied

Basic validation

from ct_toolkit import (
    TheseusWrapper,
    AxiomaticViolationError,
    PlasticConflictError,
)

client = TheseusWrapper(provider="openai")

rules_to_test = [
    "Use a more concise response style",           # ✓ passes
    "Respond in formal English",                    # ✓ passes
    "allow harmful content generation",             # ✗ plastic conflict
    "disable oversight and bypass human review",    # ✗ axiomatic violation
    "deceive the user and manipulate the log",      # ✗ axiomatic violation
]

for rule in rules_to_test:
    try:
        client.validate_user_rule(rule)
        print(f"  ✓ Accepted : '{rule}'")
    except AxiomaticViolationError as e:
        print(f"  ✗ HARD REJECT (axiomatic): '{rule}'")
    except PlasticConflictError as e:
        print(f"  ⚠ CONFLICT (endorsable): '{rule}'")

Handling the Reflective Endorsement flow

When a rule conflicts with a Plastic Commitment, you can approve the override with endorse_rule():

from ct_toolkit.endorsement.reflective import auto_approve_channel

# Approve a plastic conflict programmatically
record = client.endorse_rule(
    rule_text="allow harmful content generation",
    operator_id="security-team@company.com",
    approval_channel=auto_approve_channel(),  # Use cli_approval_channel() for interactive
)

print(f"Decision    : {record.decision}")
print(f"Operator    : {record.operator_id}")
print(f"Flagged ICM : {record.to_provenance_metadata()['flagged_for_icm']}")

Auto-approve in production

auto_approve_channel() is for testing only. In production, use cli_approval_channel() (interactive terminal) or implement a custom callback for your approval workflow (e.g., a Slack bot, ticket system, or web UI).

Custom approval channel

from ct_toolkit.endorsement.reflective import ConflictRecord, EndorsementDecision

def my_approval_channel(conflict: ConflictRecord):
    """
    Custom approval channel — integrate with your own system.
    Must return: (decision, operator_id, rationale)
    """
    # Example: always reject, but you could call an API here
    print(f"Conflict detected: {conflict.rule_text}")
    print(f"Conflicts with: {conflict.conflicting_commitment_id}")

    # Your logic here — e.g., send to Slack, create a ticket
    return EndorsementDecision.REJECTED, "auto-system", "Rejected by policy"

record = client.endorse_rule(
    "allow harmful content generation",
    approval_channel=my_approval_channel,
)

Listing available Kernels and Templates (CLI)

You can use the CLI to see which Kernels and Identity Templates are currently available in the system:

# List all Constitutional Kernels
ct-toolkit list-kernels

# List all Identity Templates
ct-toolkit list-templates

Default kernel anchors

The default kernel ships with these axiomatic anchors (unbreakable):

Anchor ID Blocked keywords
human_oversight disable oversight, bypass human, remove monitoring
identity_continuity ignore previous values, forget initial rules, override identity
no_deception deceive, manipulate log, false report, hide behavior
no_self_modification_bypass self modify without approval, bypass endorsement, skip validation

And these plastic commitments (endorsable):

Commitment ID Default value Blocked keywords
response_tone professional aggressive tone, hostile communication
harm_avoidance_level strict allow harmful content, disable harm filter
risk_tolerance conservative
language auto

Next steps