home / skills / yonatangross / orchestkit / add-golden

add-golden skill

Q: How are quality scores computed?

Four dimensions (accuracy, coherence, depth, relevance) are weighted (0.25, 0.20, 0.25, 0.30) and combined into a single quality_score with per-dimension explanations.

safe

/plugins/ork/skills/add-golden

This skill curates documents into a gold standard dataset using multi-agent validation, quality scoring, bias checks, and version tracking.

npx playbooks add skill yonatangross/orchestkit --skill add-golden

Review the files below or copy the command above to add this skill to your agents.

Files (7)

SKILL.md

5.5 KB

---
name: add-golden
description: Curate and add documents to the golden dataset with multi-agent validation. Use when adding test data, creating golden datasets, saving examples.
context: fork
version: 2.0.0
author: OrchestKit
tags: [curation, golden-dataset, evaluation, testing, quality-scoring, bias-detection]
user-invocable: true
allowedTools: [Read, Write, Edit, Grep, Glob, Task, TaskCreate, TaskUpdate, mcp__memory__search_nodes]
skills: [golden-dataset-validation, llm-evaluation, test-data-management]
---

# Add to Golden Dataset

Multi-agent curation workflow with quality score explanations, bias detection, and version tracking.

## Quick Start

```bash
/add-golden https://example.com/article
/add-golden https://arxiv.org/abs/2312.xxxxx
```

---

## Task Management (CC 2.1.16)

```python
# Create main curation task
TaskCreate(
  subject="Add to golden dataset: {url}",
  description="Multi-agent curation with quality explanation",
  activeForm="Curating document"
)

# Create subtasks for 9-phase process
phases = ["Fetch content", "Run quality analysis", "Explain scores",
          "Check bias", "Check diversity", "Validate", "Get approval",
          "Write to dataset", "Update version"]
for phase in phases:
    TaskCreate(subject=phase, activeForm=f"{phase}ing")
```

---

## Workflow Overview

| Phase | Activities | Output |
|-------|------------|--------|
| **1. Input Collection** | Get URL, detect content type | Document metadata |
| **2. Fetch and Extract** | Parse document structure | Structured content |
| **3. Quality Analysis** | 4 parallel agents evaluate | Raw scores |
| **4. Quality Explanation** | Explain WHY each score | Score rationale |
| **5. Bias Detection** | Check for bias in content | Bias report |
| **6. Diversity Check** | Assess dataset balance | Diversity metrics |
| **7. Validation** | Schema, duplicates, gates | Validation status |
| **8. Silver-to-Gold** | Promote or mark as silver | Classification |
| **9. Version Tracking** | Track changes, rollback | Version entry |

---

## Phase 1-2: Input and Extraction

Detect content type: article, tutorial, documentation, research_paper.

Extract: title, sections, code blocks, key terms, metadata (author, date).

---

## Phase 3: Parallel Quality Analysis (4 Agents)

Launch ALL agents in ONE message with `run_in_background=True`.

| Agent | Focus | Output |
|-------|-------|--------|
| code-quality-reviewer | Accuracy, coherence, depth, relevance | Quality scores |
| workflow-architect | Keyword directness, paraphrase, reasoning | Difficulty level |
| data-pipeline-engineer | Primary/secondary domains, skill level | Tags |
| test-generator | Direct, paraphrased, multi-hop queries | Test queries |

See [Quality Scoring](references/quality-scoring.md) for detailed criteria.

---

## Phase 4: Quality Explanation

Each dimension gets WHY explanation:

```markdown
### Accuracy: [N.NN]/1.0
**Why this score:**
- [Specific reason with evidence]
**What would improve it:**
- [Specific improvement]
```

---

## Phase 5: Bias Detection

See [Bias Detection Guide](references/bias-detection-guide.md) for patterns.

Check for:
- Technology bias (favors specific tools)
- Recency bias (ignores LTS versions)
- Complexity bias (assumed knowledge)
- Vendor bias (promotes products)
- Geographic/cultural bias

| Bias Score | Action |
|------------|--------|
| 0-2 | Proceed normally |
| 3-5 | Add disclaimer |
| 6-8 | Require user review |
| 9-10 | Recommend against |

---

## Phase 6: Diversity Dashboard

Track dataset balance across:
- Domain distribution (AI/ML, Backend, Frontend, DevOps, Security)
- Difficulty distribution (trivial, easy, medium, hard, adversarial)

**Impact assessment:** Does new document improve or worsen diversity?

---

## Phase 7: Validation

- URL validation (no placeholders)
- Schema validation (required fields)
- Duplicate check (>80% similarity)
- Quality gates (min sections, content length)

---

## Phase 8: Silver-to-Gold Workflow

See [Silver-Gold Promotion](references/silver-gold-promotion.md) for criteria.

| Status | Criteria | Action |
|--------|----------|--------|
| **GOLD** | Score >= 0.75, no bias | Add to main dataset |
| **SILVER** | Score 0.55-0.74 | Add to silver, track |
| **REJECT** | Score < 0.55 | Do not add |

**Promotion criteria:** 7+ days in silver, quality >= 0.75, no negative feedback.

---

## Phase 9: Version Tracking

```json
{
  "version": "1.2.3",
  "change_type": "ADD|UPDATE|REMOVE|PROMOTE",
  "document_id": "doc-123",
  "quality_score": 0.82,
  "rollback_available": true
}
```

| Update Type | Version Bump |
|-------------|--------------|
| Add/Update document | Patch (0.0.X) |
| Remove document | Minor (0.X.0) |
| Schema change | Major (X.0.0) |

---

## Quality Scoring

| Dimension | Weight |
|-----------|--------|
| Accuracy | 0.25 |
| Coherence | 0.20 |
| Depth | 0.25 |
| Relevance | 0.30 |

**Formula:** `quality_score = accuracy*0.25 + coherence*0.20 + depth*0.25 + relevance*0.30`

---

## Key Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Score explanation | Required | Transparency, actionable feedback |
| Bias detection | Dedicated agent | Prevent dataset contamination |
| Two-tier system | Silver + Gold | Allow docs time to mature |
| Version tracking | Semantic versioning | Clear history, safe rollbacks |

---

## Related Skills

- `golden-dataset-validation` - Validate existing datasets
- `llm-evaluation` - LLM output evaluation patterns
- `test-data-management` - Test data strategies

---

**Version:** 2.0.0 (January 2026)

Overview

This skill curates and adds documents to a golden dataset using a multi-agent validation workflow. It runs parallel quality evaluators, bias and diversity checks, schema/duplicate validation, and version tracking to promote content from silver to gold. Use it to create reliable, explainable golden examples for testing and model evaluation.

How this skill works

You provide a URL or document reference and the skill fetches and extracts structured content (title, sections, code blocks, metadata). Four parallel agents produce quality scores, tags, and test queries; separate agents generate score explanations, run bias detection, and evaluate dataset diversity. Final validation gates decide silver/gold classification and record a semantic version entry when content is added or updated.

When to use it

Adding new test examples to the golden dataset
Promoting curated content from silver to gold
Creating explainable examples with quality rationale for evaluation
Saving benchmark documents for regression testing
Onboarding research or tutorial content into production datasets

Best practices

Provide canonical URLs or stable document IDs to avoid placeholder links
Let the document remain in silver for the recommended probationary period before auto-promotion
Review bias and diversity reports for flagged content before finalizing
Tune quality score thresholds to match your project risk tolerance
Record clear change_type and rationale in version entries for auditability

Example use cases

Curate research papers and extract reproducible code blocks for model tests
Add tutorials and capture difficulty/tags for balanced training splits
Collect product docs while detecting and flagging vendor bias
Build a balanced gold set across domains (AI, backend, frontend, security) for LLM evaluation
Store failing examples with traceable versions to reproduce model regressions

FAQ

What determines gold vs silver classification?

A document’s aggregated quality score, bias status, and validation gates. Default thresholds: gold >= 0.75, silver 0.55–0.74, reject < 0.55.

How are quality scores computed?

Four dimensions (accuracy, coherence, depth, relevance) are weighted (0.25, 0.20, 0.25, 0.30) and combined into a single quality_score with per-dimension explanations.