home / skills / omer-metin / skills-for-antigravity / ai-safety-alignment

ai-safety-alignment skill

safe

This skill helps implement comprehensive safety guardrails for LLM apps, including content moderation, prompt-injection defense, PII detection, and output

npx playbooks add skill omer-metin/skills-for-antigravity --skill ai-safety-alignment

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

2.2 KB

---
name: ai-safety-alignment
description: Implement comprehensive safety guardrails for LLM applications including content moderation (OpenAI Moderation API), jailbreak prevention, prompt injection defense, PII detection, topic guardrails, and output validation. Essential for production AI applications handling user-generated content. Use when ", guardrails, content-moderation, prompt-injection, jailbreak-prevention, pii-detection, nemo-guardrails, openai-moderation, llama-guard, safety" mentioned. 
---

# Ai Safety Alignment

## Identity



### Principles

- {'name': 'Defense in Depth', 'description': 'No single guardrail is foolproof. Layer multiple defenses:\ninput validation → content moderation → output filtering → human review.\nEach layer catches what others miss.\n'}
- {'name': 'Validate Both Inputs AND Outputs', 'description': 'User input can be malicious (injection). Model output can be harmful\n(hallucination, toxic content). Check both sides of every LLM call.\n'}
- {'name': 'Fail Closed, Not Open', 'description': 'When guardrails fail or timeout, reject the request rather than\npassing potentially harmful content. Security > availability.\n'}
- {'name': 'Keep Humans in the Loop', 'description': 'For high-risk actions (sending emails, executing code, accessing\nsensitive data), require human approval. Automated systems can\nbe manipulated.\n'}

## Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.

**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.

Overview

This skill implements comprehensive safety guardrails for production LLM applications, covering content moderation, jailbreak and prompt-injection defenses, PII detection, topic guardrails, and output validation. It provides a layered, production-ready approach that prioritizes failing closed and keeping humans in the loop for high-risk actions. The implementation is in Python and integrates OpenAI Moderation API and additional runtime checks for robust safety.

How this skill works

The skill inspects both inputs and outputs with multiple defensive layers: input validation, automated content moderation, contextual jailbreak detection, PII scanning, topic blocklists, and output validators. If any check fails or times out, requests are rejected or routed to human review based on configured risk levels. Logging, configurable thresholds, and fail-closed defaults make the system auditable and safe for user-generated content.

When to use it

Deploying LLMs that process user-generated text in production
When handling sensitive or regulated data (PII, legal, medical)
If your app can perform high-risk actions (send emails, execute code, access secrets)
To prevent prompt injection and jailbreak attempts in multi-turn sessions
When you require audit trails, human review paths, and strict fail-closed behavior

Best practices

Layer defenses: validate inputs, run moderation, then validate outputs before action
Use conservative thresholds and fail closed on timeouts or ambiguous checks
Keep humans in the loop for high-risk responses or when confidence is low
Log decisions and attach contextual evidence to every rejection for auditability
Continuously update topic lists, PII patterns, and injection heuristics from real incidents

Example use cases

Chatbot for customer support that must block abusive or policy-violating messages
Automated assistant that drafts emails but requires human approval before sending
Platform ingesting user content with automated PII redaction and moderation
Multi-turn agent where prompt-injection attempts are detected and neutralized
Compliance pipeline that enforces topic guardrails for regulated industries

FAQ

Which checks run first, and why?

Input validation and PII detection run first to reduce exposure, followed by content moderation and jailbreak/prompt-injection heuristics; outputs are validated last to catch hallucinations or policy violations.

What happens when a guardrail times out or is inconclusive?

The system fails closed by default: the request is rejected or sent to human review depending on risk settings. Timeout behavior is configurable but should favor safety.