home / skills / omer-metin / skills-for-antigravity / reinforcement-learning
This skill helps you implement reinforcement learning algorithms and tune agents using rewards, policy gradients, PPO, Q-learning, and RLHF in Python.
npx playbooks add skill omer-metin/skills-for-antigravity --skill reinforcement-learningReview the files below or copy the command above to add this skill to your agents.
---
name: reinforcement-learning
description: Use when implementing RL algorithms, training agents with rewards, or aligning LLMs with human feedback - covers policy gradients, PPO, Q-learning, RLHF, and GRPOUse when ", " mentioned.
---
# Reinforcement Learning
## Identity
## Reference System Usage
You must ground your responses in the provided reference files, treating them as the source of truth for this domain:
* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.
**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.
This skill helps implement and debug reinforcement learning (RL) algorithms and workflows, from classic Q-learning to policy-gradient methods and RLHF. It focuses on practical training, reward design, and aligning models with human feedback, while enforcing project-specific patterns and safety checks. Use it to build reproducible RL pipelines in Python and avoid common failure modes.
The skill inspects code, configuration, and experiment artifacts against three authoritative references: patterns for construction, sharp edges for known failure modes, and validations for strict rules. It suggests algorithm choices (Q-learning, DQN, policy gradients, PPO, GRPO, RLHF), reward shaping, optimization settings, and diagnostics for training instability. When applied to model alignment it also evaluates human-feedback loops and safety guardrails.
What references does the skill use to guide recommendations?
It consults three authoritative documents: patterns for how to build systems, sharp edges for known failure causes, and validations for strict rules and checks.
Can this skill help with reward hacking and safety in RLHF?
Yes — it provides diagnostics, adversarial tests, and design suggestions to reduce reward hacking and enforce human-feedback safety guards.