home / skills / yonatangross / orchestkit / golden-dataset

This skill helps build and manage golden datasets for evaluation, ensuring quality, versioning, and CI integration across datasets and pipelines.

npx playbooks add skill yonatangross/orchestkit --skill golden-dataset

Review the files below or copy the command above to add this skill to your agents.

Files (27)
SKILL.md
5.8 KB
---
name: golden-dataset
license: MIT
compatibility: "Claude Code 2.1.34+."
description: Golden dataset lifecycle patterns for curation, versioning, quality validation, and CI integration. Use when building evaluation datasets, managing dataset versions, validating quality scores, or integrating golden tests into pipelines.
tags: [golden-dataset, evaluation, dataset-curation, dataset-validation, quality, llm-testing]
context: fork
agent: data-pipeline-engineer
version: 2.0.0
author: OrchestKit
user-invocable: false
complexity: medium
metadata:
  category: document-asset-creation
---

# Golden Dataset

Comprehensive patterns for building, managing, and validating golden datasets for AI/ML evaluation. Each category has individual rule files in `rules/` loaded on-demand.

## Quick Reference

| Category | Rules | Impact | When to Use |
| -------- | ----- | ------ | ----------- |
| [Curation](#curation) | 3 | HIGH | Content collection, annotation pipelines, diversity analysis |
| [Management](#management) | 3 | HIGH | Versioning, backup/restore, CI/CD automation |
| [Validation](#validation) | 3 | CRITICAL | Quality scoring, drift detection, regression testing |
| [Add Workflow](#add-workflow) | 1 | HIGH | 9-phase curation, quality scoring, bias detection, silver-to-gold |

Total: 10 rules across 4 categories

## Curation

Content collection, multi-agent annotation, and diversity analysis for golden datasets.

| Rule | File | Key Pattern |
| ---- | ---- | ----------- |
| Collection | `rules/curation-collection.md` | Content type classification, quality thresholds, duplicate prevention |
| Annotation | `rules/curation-annotation.md` | Multi-agent pipeline, consensus aggregation, Langfuse tracing |
| Diversity | `rules/curation-diversity.md` | Difficulty stratification, domain coverage, balance guidelines |

## Management

Versioning, storage, and CI/CD automation for golden datasets.

| Rule | File | Key Pattern |
| ---- | ---- | ----------- |
| Versioning | `rules/management-versioning.md` | JSON backup format, embedding regeneration, disaster recovery |
| Storage | `rules/management-storage.md` | Backup strategies, URL contract, data integrity checks |
| CI Integration | `rules/management-ci.md` | GitHub Actions automation, pre-deployment validation, weekly backups |

## Validation

Quality scoring, drift detection, and regression testing for golden datasets.

| Rule | File | Key Pattern |
| ---- | ---- | ----------- |
| Quality | `rules/validation-quality.md` | Schema validation, content quality, referential integrity |
| Drift | `rules/validation-drift.md` | Duplicate detection, semantic similarity, coverage gap analysis |
| Regression | `rules/validation-regression.md` | Difficulty distribution, pre-commit hooks, full dataset validation |

## Add Workflow

Structured workflow for adding new documents to the golden dataset.

| Rule | File | Key Pattern |
| ---- | ---- | ----------- |
| Add Document | `rules/curation-add-workflow.md` | 9-phase curation, parallel quality analysis, bias detection |

## Quick Start Example

```python
from app.shared.services.embeddings import embed_text

async def validate_before_add(document: dict, source_url_map: dict) -> dict:
    """Pre-addition validation for golden dataset entries."""
    errors = []

    # 1. URL contract check
    if "placeholder" in document.get("source_url", ""):
        errors.append("URL must be canonical, not a placeholder")

    # 2. Content quality
    if len(document.get("title", "")) < 10:
        errors.append("Title too short (min 10 chars)")

    # 3. Tag requirements
    if len(document.get("tags", [])) < 2:
        errors.append("At least 2 domain tags required")

    return {"valid": len(errors) == 0, "errors": errors}
```

## Key Decisions

| Decision | Recommendation |
| -------- | -------------- |
| Backup format | JSON (version controlled, portable) |
| Embedding storage | Exclude from backup (regenerate on restore) |
| Quality threshold | >= 0.70 quality score for inclusion |
| Confidence threshold | >= 0.65 for auto-include |
| Duplicate threshold | >= 0.90 similarity blocks, >= 0.85 warns |
| Min tags per entry | 2 domain tags |
| Min test queries | 3 per document |
| Difficulty balance | Trivial 3, Easy 3, Medium 5, Hard 3 minimum |
| CI frequency | Weekly automated backup (Sunday 2am UTC) |

## Common Mistakes

1. Using placeholder URLs instead of canonical source URLs
2. Skipping embedding regeneration after restore
3. Not validating referential integrity between documents and queries
4. Over-indexing on articles (neglecting tutorials, research papers)
5. Missing difficulty distribution balance in test queries
6. Not running verification after backup/restore operations
7. Testing restore procedures in production instead of staging
8. Committing SQL dumps instead of JSON (not version-control friendly)

## Evaluations

See `test-cases.json` for 9 test cases across all categories.

## Related Skills

- `rag-retrieval` - Retrieval evaluation using golden dataset
- `langfuse-observability` - Tracing patterns for curation workflows
- `testing-patterns` - General testing patterns and strategies
- `ai-native-development` - Embedding generation for restore

## Capability Details

### curation

**Keywords:** golden dataset, curation, content collection, annotation, quality criteria

**Solves:**

- Classify document content types for golden dataset
- Run multi-agent quality analysis pipelines
- Generate test queries for new documents

### management

**Keywords:** golden dataset, backup, restore, versioning, disaster recovery

**Solves:**

- Backup and restore golden datasets with JSON
- Regenerate embeddings after restore
- Automate backups with CI/CD

### validation

**Keywords:** golden dataset, validation, schema, duplicate detection, quality metrics

**Solves:**

- Validate entries against document schema
- Detect duplicate or near-duplicate entries
- Analyze dataset coverage and distribution gaps

Overview

This skill codifies golden-dataset lifecycle patterns for curation, versioning, validation, and CI integration. It provides concrete rules and decision points for building evaluation datasets, enforcing quality thresholds, and automating backups and tests. The patterns are language-agnostic and include orchestration tips for embedding regeneration and regression testing.

How this skill works

The skill organizes rules into four categories—Curation, Management, Validation, and Add Workflow—each with actionable checks and recommended thresholds. It inspects document metadata, content quality, tag coverage, similarity scores, and CI hooks to validate entries and drive automated pipelines. Outputs include validation results, error lists, recommended actions, and CI-ready steps for backup/restore and regression tests.

When to use it

  • Building or curating evaluation/golden datasets for ML models
  • Adding new documents and ensuring they meet quality and tag requirements
  • Versioning and automating backups with weekly CI jobs
  • Running dataset validation: schema checks, duplicate/drift detection, and regression tests
  • Integrating golden tests into pre-deployment or pre-commit pipelines

Best practices

  • Use JSON for backups to keep dataset history version-controlled and portable
  • Exclude embeddings from backups; regenerate embeddings on restore to avoid stale vectors
  • Enforce minimum tags (>=2) and minimum test queries (>=3) per document
  • Treat quality and confidence thresholds as gates (quality >=0.70, confidence >=0.65) for auto-include
  • Run weekly automated backups and validate restore procedures in staging before production

Example use cases

  • Pre-add validation: reject placeholder URLs, short titles, or insufficient tags before insertion
  • CI pipeline: run full dataset validation and regression checks on nightly or weekly schedules
  • Restore workflow: restore JSON backup, regenerate embeddings, then run referential integrity and quality scans
  • Curation pipeline: multi-agent annotation with consensus aggregation and difficulty stratification
  • Drift detection: scheduled similarity scans to surface duplicates and coverage gaps

FAQ

What format should backups use?

Use JSON backups for portability and version control; exclude embeddings and regenerate them after restore.

What are the recommended quality thresholds?

Use a minimum quality score of 0.70 for inclusion and a confidence threshold of 0.65 for auto-inclusion; treat these as adjustable gates.

How do I handle near-duplicates?

Block entries with similarity >=0.90 and flag entries >=0.85 for manual review; integrate semantic-similarity scans into validation.