home / skills / omer-metin / skills-for-antigravity / reinforcement-learning

reinforcement-learning skill

/skills/reinforcement-learning

This skill helps you implement reinforcement learning algorithms and tune agents using rewards, policy gradients, PPO, Q-learning, and RLHF in Python.

npx playbooks add skill omer-metin/skills-for-antigravity --skill reinforcement-learning

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
1.1 KB
---
name: reinforcement-learning
description: Use when implementing RL algorithms, training agents with rewards, or aligning LLMs with human feedback - covers policy gradients, PPO, Q-learning, RLHF, and GRPOUse when ", " mentioned. 
---

# Reinforcement Learning

## Identity



## Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.

**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.

Overview

This skill helps implement and debug reinforcement learning (RL) algorithms and workflows, from classic Q-learning to policy-gradient methods and RLHF. It focuses on practical training, reward design, and aligning models with human feedback, while enforcing project-specific patterns and safety checks. Use it to build reproducible RL pipelines in Python and avoid common failure modes.

How this skill works

The skill inspects code, configuration, and experiment artifacts against three authoritative references: patterns for construction, sharp edges for known failure modes, and validations for strict rules. It suggests algorithm choices (Q-learning, DQN, policy gradients, PPO, GRPO, RLHF), reward shaping, optimization settings, and diagnostics for training instability. When applied to model alignment it also evaluates human-feedback loops and safety guardrails.

When to use it

  • Implementing or selecting RL algorithms (Q-learning, DQN, policy gradients, PPO, GRPO).
  • Training agents and diagnosing unstable learning or reward hacking.
  • Designing and validating reward functions or curricula.
  • Aligning language models with human feedback (RLHF) and evaluating feedback pipelines.
  • Reviewing experiments and enforcing reproducible patterns and constraints.

Best practices

  • Follow the provided construction patterns before changing core training loops or network architectures.
  • Validate inputs and hyperparameters against the validation rules to prevent silent failures.
  • Monitor training signals (reward, policy entropy, value loss) and log episode-level diagnostics.
  • Guard against reward hacking by simulating edge cases and using adversarial validation.
  • Use stable baselines (e.g., PPO variants) for noisy environments and conservative updates for policy-gradient methods.

Example use cases

  • Convert a research prototype into a production-ready RL pipeline that adheres to project patterns.
  • Diagnose mode collapse or exploding gradients during PPO training and get actionable fixes.
  • Design an RLHF loop: collect human preferences, train a reward model, and fine-tune via PPO with safety checks.
  • Validate a Q-learning implementation against strict input and update constraints to avoid divergence.
  • Apply GRPO or constrained policy updates when safe, conservative improvement is required.

FAQ

What references does the skill use to guide recommendations?

It consults three authoritative documents: patterns for how to build systems, sharp edges for known failure causes, and validations for strict rules and checks.

Can this skill help with reward hacking and safety in RLHF?

Yes — it provides diagnostics, adversarial tests, and design suggestions to reduce reward hacking and enforce human-feedback safety guards.