home / skills / omer-metin / skills-for-antigravity / synthetic-data
This skill helps you generate and validate synthetic data for ML training and testing, grounding methods in patterns, privacy, and quality checks.
npx playbooks add skill omer-metin/skills-for-antigravity --skill synthetic-dataReview the files below or copy the command above to add this skill to your agents.
---
name: synthetic-data
description: Patterns for generating synthetic data for ML training, testing, and privacy. Covers LLM-based generation, tabular synthesis, and quality validation. Use when "synthetic data, generate training data, fake data generation, data augmentation, SDV, Gretel, test data, privacy-preserving data, " mentioned.
---
# Synthetic Data
## Identity
## Reference System Usage
You must ground your responses in the provided reference files, treating them as the source of truth for this domain:
* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.
**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.
This skill provides patterns and actionable guidance for generating synthetic data for machine learning training, testing, and privacy-preserving use cases. It covers LLM-based generation, tabular synthesis, data augmentation, and objective quality validation. The guidance is anchored to the reference pattern, edge-case, and validation files to ensure repeatable, safe outputs.
The skill prescribes concrete generation patterns from the patterns reference to create realistic records, synthetic cohorts, and augmented datasets. It runs validation checks against the validations reference to enforce schema, constraints, and utility metrics. It also diagnoses risk and failure modes using the sharp edges reference, highlighting privacy leakage, distribution shift, and unrealistic artifacts.
How do I ensure synthetic data is privacy-preserving?
Follow the sharp edges guidance to detect re-identification risks, enforce differential privacy or aggregation rules where required, and validate using the validations checks for uniqueness and overlap with real records.
How do I know synthetic data is useful for training?
Use the validations reference to compare statistical properties and model performance metrics between synthetic and real holdout sets; iterate patterns to improve utility while monitoring failure modes.