home / skills / omer-metin / skills-for-antigravity / synthetic-data

synthetic-data skill

/skills/synthetic-data

This skill helps you generate and validate synthetic data for ML training and testing, grounding methods in patterns, privacy, and quality checks.

npx playbooks add skill omer-metin/skills-for-antigravity --skill synthetic-data

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
1.2 KB
---
name: synthetic-data
description: Patterns for generating synthetic data for ML training, testing, and privacy. Covers LLM-based generation, tabular synthesis, and quality validation. Use when "synthetic data, generate training data, fake data generation, data augmentation, SDV, Gretel, test data, privacy-preserving data, " mentioned. 
---

# Synthetic Data

## Identity



## Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.

**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.

Overview

This skill provides patterns and actionable guidance for generating synthetic data for machine learning training, testing, and privacy-preserving use cases. It covers LLM-based generation, tabular synthesis, data augmentation, and objective quality validation. The guidance is anchored to the reference pattern, edge-case, and validation files to ensure repeatable, safe outputs.

How this skill works

The skill prescribes concrete generation patterns from the patterns reference to create realistic records, synthetic cohorts, and augmented datasets. It runs validation checks against the validations reference to enforce schema, constraints, and utility metrics. It also diagnoses risk and failure modes using the sharp edges reference, highlighting privacy leakage, distribution shift, and unrealistic artifacts.

When to use it

  • Create labeled training data when real data is scarce or costly
  • Produce privacy-preserving copies of sensitive datasets for sharing or testing
  • Augment imbalanced classes for better model performance
  • Generate realistic test data for QA, integration, and load testing
  • Prototype and benchmark models without access to production data

Best practices

  • Always follow the generation patterns in the patterns reference to ensure consistency
  • Run the validations reference checks to enforce schema, ranges, uniqueness, and statistical similarity
  • Use the sharp edges reference to identify and mitigate privacy leaks and edge-case failures
  • Preserve provenance: record generation parameters, seed values, and validation reports
  • Prefer synthetic-first evaluation: test models on synthetic data, then validate on limited real holdouts

Example use cases

  • Generate a synthetic customer table with demographic constraints and realistic correlations for model training
  • Create augmented minority-class examples using LLM prompts for NLP classification
  • Produce masked, privacy-preserving joins for sharing with external vendors
  • Synthesize large-scale test datasets for performance testing of data pipelines
  • Validate synthetic data quality with automated checks for distributions, outliers, and schema conformance

FAQ

How do I ensure synthetic data is privacy-preserving?

Follow the sharp edges guidance to detect re-identification risks, enforce differential privacy or aggregation rules where required, and validate using the validations checks for uniqueness and overlap with real records.

How do I know synthetic data is useful for training?

Use the validations reference to compare statistical properties and model performance metrics between synthetic and real holdout sets; iterate patterns to improve utility while monitoring failure modes.