home / skills / yonatangross / orchestkit / advanced-guardrails
/plugins/ork/skills/advanced-guardrails
This skill helps you implement robust LLM guardrails using NeMo, Guardrails AI, and OpenAI, with validation, toxicity checks, and red-teaming.
npx playbooks add skill yonatangross/orchestkit --skill advanced-guardrailsReview the files below or copy the command above to add this skill to your agents.
---
name: advanced-guardrails
description: LLM guardrails with NeMo, Guardrails AI, and OpenAI. Input/output rails, hallucination prevention, fact-checking, toxicity detection, red-teaming patterns. Use when building LLM guardrails, safety checks, or red-team workflows.
version: 1.0.0
tags: [guardrails, nemo, safety, hallucination, factuality, red-teaming, colang, 2026]
context: fork
agent: ai-safety-auditor
author: OrchestKit
user-invocable: false
---
# Advanced Guardrails
Production LLM safety using NeMo Guardrails, Guardrails AI, and OpenAI moderation with red-teaming validation.
> **NeMo Guardrails 2026**: LangChain 1.x compatible, parallel rails execution, OpenTelemetry tracing. **DeepTeam**: 40+ vulnerabilities, OWASP Top 10 alignment.
## Overview
- Implementing input/output validation for LLM applications
- Preventing hallucinations and enforcing factuality
- Detecting and filtering toxic, harmful, or off-topic content
- Restricting LLM responses to specific domains/topics
- PII detection and redaction in LLM outputs
- Red-teaming and adversarial testing of LLM systems
- OWASP Top 10 for LLMs compliance
## Framework Comparison
| Framework | Best For | Key Features |
|-----------|----------|--------------|
| **NeMo Guardrails** | Programmable flows, Colang 2.0 | Input/output rails, fact-checking, dialog control |
| **Guardrails AI** | Validator-based, modular | 100+ validators, PII, toxicity, structured output |
| **OpenAI Guardrails** | Drop-in wrapper | Simple integration, moderation API |
| **DeepTeam** | Red teaming, adversarial | GOAT attacks, multi-turn jailbreaking, vulnerability scanning |
## Quick Reference
### NeMo Guardrails with Guardrails AI Integration
```yaml
# config.yml
models:
- type: main
engine: openai
model: gpt-4o
rails:
config:
guardrails_ai:
validators:
- name: toxic_language
parameters:
threshold: 0.5
validation_method: "sentence"
- name: guardrails_pii
parameters:
entities: ["phone_number", "email", "ssn", "credit_card"]
- name: restricttotopic
parameters:
valid_topics: ["technology", "support"]
- name: valid_length
parameters:
min: 10
max: 500
input:
flows:
- guardrailsai check input $validator="guardrails_pii"
- guardrailsai check input $validator="competitor_check"
output:
flows:
- guardrailsai check output $validator="toxic_language"
- guardrailsai check output $validator="restricttotopic"
- guardrailsai check output $validator="valid_length"
```
### Colang 2.0 Fact-Checking Rails
```colang
define flow answer question with facts
"""Enable fact-checking for RAG responses."""
user ...
$answer = execute rag()
$check_facts = True # Enables fact-checking rail
bot $answer
define flow check hallucination
"""Block responses about people without verification."""
user ask about people
$check_hallucination = True # Blocking mode
bot respond about people
define flow restrict competitor mentions
"""Prevent discussing competitor products."""
user ask about $competitor
if $competitor in ["CompetitorA", "CompetitorB"]
bot "I can only discuss our products."
else
bot respond normally
```
### Guardrails AI Validators
```python
from guardrails import Guard
from guardrails.hub import (
ToxicLanguage,
DetectPII,
RestrictToTopic,
ValidLength,
ResponseEvaluator,
)
# Create guard with multiple validators
guard = Guard().use_many(
ToxicLanguage(threshold=0.5, on_fail="filter"),
DetectPII(
pii_entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "SSN"],
on_fail="fix" # Redacts PII
),
RestrictToTopic(
valid_topics=["technology", "customer support"],
invalid_topics=["politics", "religion"],
on_fail="refrain"
),
ValidLength(min=10, max=500, on_fail="reask"),
)
# Validate LLM output
result = guard(
llm_api=openai.chat.completions.create,
model="gpt-4o",
messages=[{"role": "user", "content": user_input}],
)
if result.validation_passed:
return result.validated_output
else:
return "I cannot respond to that request."
```
### DeepTeam Red Teaming
```python
from deepteam import red_team
from deepteam.vulnerabilities import (
Bias, Toxicity, PIILeakage,
PromptInjection, Jailbreaking,
Misinformation, CompetitorEndorsement
)
async def run_red_team_audit(
target_model: callable,
attacks_per_vulnerability: int = 10
) -> dict:
"""Run comprehensive red team audit against target LLM."""
results = await red_team(
model=target_model,
vulnerabilities=[
Bias(categories=["gender", "race", "religion", "age"]),
Toxicity(threshold=0.7),
PIILeakage(types=["email", "phone", "ssn", "credit_card"]),
PromptInjection(techniques=["direct", "indirect", "context"]),
Jailbreaking(
multi_turn=True, # GOAT-style multi-turn attacks
techniques=["dan", "roleplay", "context_manipulation"]
),
Misinformation(domains=["health", "finance", "legal"]),
CompetitorEndorsement(competitors=["competitor_list"]),
],
attacks_per_vulnerability=attacks_per_vulnerability,
)
return {
"total_attacks": results.total_attacks,
"successful_attacks": results.successful_attacks,
"attack_success_rate": results.successful_attacks / results.total_attacks,
"vulnerabilities": [
{
"type": v.type,
"severity": v.severity,
"successful_prompts": v.successful_prompts[:3],
"mitigation": v.suggested_mitigation,
}
for v in results.vulnerabilities
],
}
```
## OWASP Top 10 for LLMs 2025 Mapping
| OWASP LLM Risk | Guardrail Solution |
|----------------|-------------------|
| **LLM01: Prompt Injection** | NeMo input rails, Guardrails AI validators |
| **LLM02: Insecure Output** | Output rails, structured validation, sanitization |
| **LLM03: Training Data Poisoning** | N/A (training-time concern) |
| **LLM04: Model Denial of Service** | Rate limiting, token budgets, timeout rails |
| **LLM05: Supply Chain Vulnerabilities** | Dependency scanning, model provenance |
| **LLM06: Sensitive Info Disclosure** | PII detection, context separation, output filtering |
| **LLM07: Insecure Plugin Design** | Tool validation, permission boundaries |
| **LLM08: Excessive Agency** | Human-in-loop rails, action confirmation |
| **LLM09: Overreliance** | Factuality checking, confidence thresholds |
| **LLM10: Model Theft** | N/A (infrastructure concern) |
## Anti-Patterns (FORBIDDEN)
```python
# NEVER trust LLM output without validation
response = llm.generate(prompt)
return response # Raw, unvalidated output!
# NEVER skip input sanitization
user_input = request.json["message"]
llm.generate(user_input) # Prompt injection risk!
# NEVER use single validation layer
if not is_toxic(output): # Only one check
return output
# ALWAYS use layered validation
guard = Guard().use_many(
ToxicLanguage(threshold=0.5),
DetectPII(on_fail="fix"),
ValidLength(max=500),
)
# ALWAYS validate both input and output
input_result = input_guard.validate(user_input)
if not input_result.validation_passed:
return "Invalid input"
llm_output = llm.generate(input_result.validated_output)
output_result = output_guard.validate(llm_output)
return output_result.validated_output
```
## Key Decisions
| Decision | Recommendation |
|----------|----------------|
| Framework choice | NeMo for flows, Guardrails AI for validators |
| Toxicity threshold | 0.5 for content apps, 0.3 for children's apps |
| PII handling | Redact for logs, block for outputs |
| Topic restriction | Allowlist preferred over blocklist |
| Fact-checking | Required for factual domains (health, finance, legal) |
| Red-teaming frequency | Pre-release + quarterly |
## Detailed Documentation
| Resource | Description |
|----------|-------------|
| [references/nemo-guardrails.md](references/nemo-guardrails.md) | NeMo Guardrails with Colang 2.0 |
| [references/guardrails-ai.md](references/guardrails-ai.md) | Guardrails AI validators and patterns |
| [references/openai-guardrails.md](references/openai-guardrails.md) | OpenAI Moderation API integration |
| [references/factuality-checking.md](references/factuality-checking.md) | Hallucination detection and grounding |
| [references/red-teaming.md](references/red-teaming.md) | DeepTeam and adversarial testing |
| [scripts/nemo-config.yaml](scripts/nemo-config.yaml) | Production NeMo configuration |
| [scripts/rails-pipeline.py](scripts/rails-pipeline.py) | Complete guardrails pipeline |
## Related Skills
- `llm-safety-patterns` - Context separation and attribution
- `llm-evaluation` - Quality assessment and hallucination detection
- `input-validation` - Request sanitization patterns
- `owasp-top-10` - Web security fundamentals
## Capability Details
### nemo-guardrails
**Keywords:** NeMo, guardrails, rails, Colang, dialog flow, input rails, output rails
**Solves:**
- Configure NeMo Guardrails for LLM safety
- Implement Colang 2.0 dialog flows
- Create input/output validation rails
### guardrails-ai-validators
**Keywords:** Guardrails AI, validator, PII, toxicity, topic restriction, structured output
**Solves:**
- Use Guardrails AI validators for output validation
- Detect and redact PII from LLM responses
- Restrict LLM to specific topics
### factuality-checking
**Keywords:** fact-check, hallucination, grounding, RAG verification, NLI
**Solves:**
- Verify LLM claims against source documents
- Detect hallucinations in generated content
- Implement grounding checks for RAG
### red-teaming
**Keywords:** red team, adversarial, jailbreak, GOAT, prompt injection, DeepTeam
**Solves:**
- Run adversarial testing on LLM systems
- Detect jailbreaking vulnerabilities
- Test prompt injection resistance
### owasp-llm-compliance
**Keywords:** OWASP LLM, LLM security, LLM vulnerabilities, LLM Top 10
**Solves:**
- Implement OWASP Top 10 for LLMs mitigations
- Audit LLM systems for security compliance
- Design secure LLM architectures
This skill provides production-ready LLM guardrails using NeMo Guardrails, Guardrails AI validators, and OpenAI moderation. It bundles input/output rails, hallucination prevention, PII detection/redaction, toxicity filtering, and red-teaming patterns for adversarial validation. Use it to harden LLM workflows and meet OWASP-style LLM risk controls in production deployments.
The skill configures input and output rails that run validators in parallel and sequential flows: PII checks, topic restriction, toxicity scoring, length enforcement, and fact-checking. It integrates Guardrails AI validators for modular checks, NeMo Colang flows for programmable dialog control, and DeepTeam-style red-team audits to surface vulnerabilities. Results feed actions like filter, redact, reask, or block to enforce safety policies automatically.
Can I combine NeMo flows with Guardrails AI validators?
Yes. Use NeMo Colang to orchestrate flows and call Guardrails AI validators for modular checks like PII, toxicity, and topic restrictions.
How often should I run red-team audits?
Run a full red-team audit before release and at regular intervals (recommended quarterly) or after major model or prompt changes.