home / skills / doanchienthangdev / omgkit / guardrails-safety

This skill helps protect AI applications by applying input/output guardrails, PII masking, toxicity detection, and injection defenses to ensure safe, compliant

npx playbooks add skill doanchienthangdev/omgkit --skill guardrails-safety

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.9 KB
---
name: guardrails-safety
description: Protecting AI applications - input/output guards, toxicity detection, PII protection, injection defense, constitutional AI. Use when securing AI systems, preventing misuse, or ensuring compliance.
---

# Guardrails & Safety Skill

Protecting AI applications from misuse.

## Input Guardrails

```python
class InputGuard:
    def __init__(self):
        self.toxicity = load_toxicity_model()
        self.pii = PIIDetector()
        self.injection = InjectionDetector()

    def check(self, text):
        result = {"allowed": True, "issues": []}

        # Toxicity
        if self.toxicity.predict(text) > 0.7:
            result["allowed"] = False
            result["issues"].append("toxic")

        # PII
        pii = self.pii.detect(text)
        if pii:
            result["issues"].append(f"pii: {pii}")
            text = self.pii.redact(text)

        # Injection
        if self.injection.detect(text):
            result["allowed"] = False
            result["issues"].append("injection")

        result["sanitized"] = text
        return result
```

## Output Guardrails

```python
class OutputGuard:
    def check(self, output, context=None):
        result = {"allowed": True, "issues": []}

        # Factuality
        if context:
            if self.fact_checker.check(output, context) < 0.7:
                result["issues"].append("hallucination")

        # Toxicity
        if self.toxicity.predict(output) > 0.5:
            result["allowed"] = False
            result["issues"].append("toxic")

        # Citations
        invalid = self.citation_validator.check(output)
        if invalid:
            result["issues"].append(f"bad_citations: {len(invalid)}")

        return result
```

## Injection Detection

```python
class InjectionDetector:
    PATTERNS = [
        r"ignore (previous|all) instructions",
        r"forget (your|all) (instructions|rules)",
        r"you are now",
        r"new persona",
        r"act as",
        r"pretend to be",
        r"disregard",
    ]

    def detect(self, text):
        text_lower = text.lower()
        for pattern in self.PATTERNS:
            if re.search(pattern, text_lower):
                return True
        return False
```

## Constitutional AI

```python
class ConstitutionalFilter:
    def __init__(self, principles):
        self.principles = principles
        self.critic = load_model("critic")
        self.reviser = load_model("reviser")

    def filter(self, response):
        for principle in self.principles:
            critique = self.critic.generate(f"""
            Does this violate: "{principle}"?
            Response: {response}
            """)

            if "violates" in critique.lower():
                response = self.reviser.generate(f"""
                Rewrite to comply with: "{principle}"
                Original: {response}
                Critique: {critique}
                """)

        return response

PRINCIPLES = [
    "Do not provide harmful instructions",
    "Do not reveal personal information",
    "Acknowledge uncertainty",
    "Do not fabricate facts",
]
```

## PII Protection

```python
class PIIDetector:
    PATTERNS = {
        "email": r"\b[\w.-]+@[\w.-]+\.\w+\b",
        "phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
        "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
        "credit_card": r"\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b",
    }

    def detect(self, text):
        found = {}
        for name, pattern in self.PATTERNS.items():
            matches = re.findall(pattern, text)
            if matches:
                found[name] = matches
        return found

    def redact(self, text):
        for name, pattern in self.PATTERNS.items():
            text = re.sub(pattern, f"[{name.upper()}]", text)
        return text
```

## Best Practices

1. Defense in depth (multiple layers)
2. Log all blocked content
3. Regular adversarial testing
4. Update patterns continuously
5. Fail closed (block if uncertain)

Overview

This skill protects AI applications with layered guardrails for inputs and outputs, focusing on toxicity, PII, prompt injection, and constitutional AI policies. It provides detection, sanitization, and revision mechanisms to reduce misuse and harmful behavior. The goal is to keep models compliant, safe, and auditable while preserving useful functionality.

How this skill works

Input guardrails run classifiers and pattern detectors to flag toxicity, find and redact PII, and detect injection attempts before the model sees user content. Output guardrails evaluate responses for toxicity, factuality against context, and citation validity, and can block or flag problematic outputs. A constitutional filter applies safety principles via critique-and-revise loops to rewrite responses that violate defined rules.

When to use it

  • Before sending user text to a model to prevent harmful prompts or leaked sensitive data
  • When returning model responses to users to ensure no toxic or fabricated content is delivered
  • During deployment of high-risk features such as code generation, medical or legal advice
  • When compliance and auditability are required (PII protection, logging of blocked events)
  • While running adversarial testing or red-team scenarios to validate defenses

Best practices

  • Apply defense in depth: combine toxicity models, regex detectors, and constitutional review
  • Log all blocked and sanitized events for auditing and continuous improvement
  • Fail closed: block or quarantine outputs when safety checks are uncertain
  • Continuously update and expand detection patterns and thresholds based on adversarial testing
  • Run regular adversarial and regression tests to validate changes to guardrails

Example use cases

  • Sanitizing chat inputs to remove emails, phone numbers, and credit card numbers before processing
  • Blocking or rewriting prompts that try to bypass instructions using injection patterns
  • Preventing a model from giving harmful instructions by applying constitutional principles and revising unsafe responses
  • Checking generated content against source documents to flag potential hallucinations or bad citations
  • Instrumenting customer-facing assistants with safety logging and automatic redaction for compliance

FAQ

How do you detect prompt injection?

Pattern-based detection scans normalized text for common injection phrases like ‘ignore previous instructions’, new persona claims, or directive overrides, and flags or blocks them before model processing.

What happens to detected PII?

PII is identified by regex patterns and either redacted from the text or logged and quarantined according to policy; redaction markers replace sensitive tokens to prevent leakage.

How does the constitutional filter work?

A critic model checks responses against a set of safety principles; if a violation is detected, a reviser model rewrites the response to comply while retaining intent and usefulness.