home / skills / omer-metin / skills-for-antigravity / ai-safety-alignment
This skill helps implement comprehensive safety guardrails for LLM apps, including content moderation, prompt-injection defense, PII detection, and output
npx playbooks add skill omer-metin/skills-for-antigravity --skill ai-safety-alignmentReview the files below or copy the command above to add this skill to your agents.
---
name: ai-safety-alignment
description: Implement comprehensive safety guardrails for LLM applications including content moderation (OpenAI Moderation API), jailbreak prevention, prompt injection defense, PII detection, topic guardrails, and output validation. Essential for production AI applications handling user-generated content. Use when ", guardrails, content-moderation, prompt-injection, jailbreak-prevention, pii-detection, nemo-guardrails, openai-moderation, llama-guard, safety" mentioned.
---
# Ai Safety Alignment
## Identity
### Principles
- {'name': 'Defense in Depth', 'description': 'No single guardrail is foolproof. Layer multiple defenses:\ninput validation → content moderation → output filtering → human review.\nEach layer catches what others miss.\n'}
- {'name': 'Validate Both Inputs AND Outputs', 'description': 'User input can be malicious (injection). Model output can be harmful\n(hallucination, toxic content). Check both sides of every LLM call.\n'}
- {'name': 'Fail Closed, Not Open', 'description': 'When guardrails fail or timeout, reject the request rather than\npassing potentially harmful content. Security > availability.\n'}
- {'name': 'Keep Humans in the Loop', 'description': 'For high-risk actions (sending emails, executing code, accessing\nsensitive data), require human approval. Automated systems can\nbe manipulated.\n'}
## Reference System Usage
You must ground your responses in the provided reference files, treating them as the source of truth for this domain:
* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.
**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.
This skill implements comprehensive safety guardrails for production LLM applications, covering content moderation, jailbreak and prompt-injection defenses, PII detection, topic guardrails, and output validation. It provides a layered, production-ready approach that prioritizes failing closed and keeping humans in the loop for high-risk actions. The implementation is in Python and integrates OpenAI Moderation API and additional runtime checks for robust safety.
The skill inspects both inputs and outputs with multiple defensive layers: input validation, automated content moderation, contextual jailbreak detection, PII scanning, topic blocklists, and output validators. If any check fails or times out, requests are rejected or routed to human review based on configured risk levels. Logging, configurable thresholds, and fail-closed defaults make the system auditable and safe for user-generated content.
Which checks run first, and why?
Input validation and PII detection run first to reduce exposure, followed by content moderation and jailbreak/prompt-injection heuristics; outputs are validated last to catch hallucinations or policy violations.
What happens when a guardrail times out or is inconclusive?
The system fails closed by default: the request is rejected or sent to human review depending on risk settings. Timeout behavior is configurable but should favor safety.