home / skills / yonatangross / orchestkit / add-golden
/plugins/ork/skills/add-golden
This skill curates documents into a gold standard dataset using multi-agent validation, quality scoring, bias checks, and version tracking.
npx playbooks add skill yonatangross/orchestkit --skill add-goldenReview the files below or copy the command above to add this skill to your agents.
---
name: add-golden
description: Curate and add documents to the golden dataset with multi-agent validation. Use when adding test data, creating golden datasets, saving examples.
context: fork
version: 2.0.0
author: OrchestKit
tags: [curation, golden-dataset, evaluation, testing, quality-scoring, bias-detection]
user-invocable: true
allowedTools: [Read, Write, Edit, Grep, Glob, Task, TaskCreate, TaskUpdate, mcp__memory__search_nodes]
skills: [golden-dataset-validation, llm-evaluation, test-data-management]
---
# Add to Golden Dataset
Multi-agent curation workflow with quality score explanations, bias detection, and version tracking.
## Quick Start
```bash
/add-golden https://example.com/article
/add-golden https://arxiv.org/abs/2312.xxxxx
```
---
## Task Management (CC 2.1.16)
```python
# Create main curation task
TaskCreate(
subject="Add to golden dataset: {url}",
description="Multi-agent curation with quality explanation",
activeForm="Curating document"
)
# Create subtasks for 9-phase process
phases = ["Fetch content", "Run quality analysis", "Explain scores",
"Check bias", "Check diversity", "Validate", "Get approval",
"Write to dataset", "Update version"]
for phase in phases:
TaskCreate(subject=phase, activeForm=f"{phase}ing")
```
---
## Workflow Overview
| Phase | Activities | Output |
|-------|------------|--------|
| **1. Input Collection** | Get URL, detect content type | Document metadata |
| **2. Fetch and Extract** | Parse document structure | Structured content |
| **3. Quality Analysis** | 4 parallel agents evaluate | Raw scores |
| **4. Quality Explanation** | Explain WHY each score | Score rationale |
| **5. Bias Detection** | Check for bias in content | Bias report |
| **6. Diversity Check** | Assess dataset balance | Diversity metrics |
| **7. Validation** | Schema, duplicates, gates | Validation status |
| **8. Silver-to-Gold** | Promote or mark as silver | Classification |
| **9. Version Tracking** | Track changes, rollback | Version entry |
---
## Phase 1-2: Input and Extraction
Detect content type: article, tutorial, documentation, research_paper.
Extract: title, sections, code blocks, key terms, metadata (author, date).
---
## Phase 3: Parallel Quality Analysis (4 Agents)
Launch ALL agents in ONE message with `run_in_background=True`.
| Agent | Focus | Output |
|-------|-------|--------|
| code-quality-reviewer | Accuracy, coherence, depth, relevance | Quality scores |
| workflow-architect | Keyword directness, paraphrase, reasoning | Difficulty level |
| data-pipeline-engineer | Primary/secondary domains, skill level | Tags |
| test-generator | Direct, paraphrased, multi-hop queries | Test queries |
See [Quality Scoring](references/quality-scoring.md) for detailed criteria.
---
## Phase 4: Quality Explanation
Each dimension gets WHY explanation:
```markdown
### Accuracy: [N.NN]/1.0
**Why this score:**
- [Specific reason with evidence]
**What would improve it:**
- [Specific improvement]
```
---
## Phase 5: Bias Detection
See [Bias Detection Guide](references/bias-detection-guide.md) for patterns.
Check for:
- Technology bias (favors specific tools)
- Recency bias (ignores LTS versions)
- Complexity bias (assumed knowledge)
- Vendor bias (promotes products)
- Geographic/cultural bias
| Bias Score | Action |
|------------|--------|
| 0-2 | Proceed normally |
| 3-5 | Add disclaimer |
| 6-8 | Require user review |
| 9-10 | Recommend against |
---
## Phase 6: Diversity Dashboard
Track dataset balance across:
- Domain distribution (AI/ML, Backend, Frontend, DevOps, Security)
- Difficulty distribution (trivial, easy, medium, hard, adversarial)
**Impact assessment:** Does new document improve or worsen diversity?
---
## Phase 7: Validation
- URL validation (no placeholders)
- Schema validation (required fields)
- Duplicate check (>80% similarity)
- Quality gates (min sections, content length)
---
## Phase 8: Silver-to-Gold Workflow
See [Silver-Gold Promotion](references/silver-gold-promotion.md) for criteria.
| Status | Criteria | Action |
|--------|----------|--------|
| **GOLD** | Score >= 0.75, no bias | Add to main dataset |
| **SILVER** | Score 0.55-0.74 | Add to silver, track |
| **REJECT** | Score < 0.55 | Do not add |
**Promotion criteria:** 7+ days in silver, quality >= 0.75, no negative feedback.
---
## Phase 9: Version Tracking
```json
{
"version": "1.2.3",
"change_type": "ADD|UPDATE|REMOVE|PROMOTE",
"document_id": "doc-123",
"quality_score": 0.82,
"rollback_available": true
}
```
| Update Type | Version Bump |
|-------------|--------------|
| Add/Update document | Patch (0.0.X) |
| Remove document | Minor (0.X.0) |
| Schema change | Major (X.0.0) |
---
## Quality Scoring
| Dimension | Weight |
|-----------|--------|
| Accuracy | 0.25 |
| Coherence | 0.20 |
| Depth | 0.25 |
| Relevance | 0.30 |
**Formula:** `quality_score = accuracy*0.25 + coherence*0.20 + depth*0.25 + relevance*0.30`
---
## Key Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Score explanation | Required | Transparency, actionable feedback |
| Bias detection | Dedicated agent | Prevent dataset contamination |
| Two-tier system | Silver + Gold | Allow docs time to mature |
| Version tracking | Semantic versioning | Clear history, safe rollbacks |
---
## Related Skills
- `golden-dataset-validation` - Validate existing datasets
- `llm-evaluation` - LLM output evaluation patterns
- `test-data-management` - Test data strategies
---
**Version:** 2.0.0 (January 2026)
This skill curates and adds documents to a golden dataset using a multi-agent validation workflow. It runs parallel quality evaluators, bias and diversity checks, schema/duplicate validation, and version tracking to promote content from silver to gold. Use it to create reliable, explainable golden examples for testing and model evaluation.
You provide a URL or document reference and the skill fetches and extracts structured content (title, sections, code blocks, metadata). Four parallel agents produce quality scores, tags, and test queries; separate agents generate score explanations, run bias detection, and evaluate dataset diversity. Final validation gates decide silver/gold classification and record a semantic version entry when content is added or updated.
What determines gold vs silver classification?
A document’s aggregated quality score, bias status, and validation gates. Default thresholds: gold >= 0.75, silver 0.55–0.74, reject < 0.55.
How are quality scores computed?
Four dimensions (accuracy, coherence, depth, relevance) are weighted (0.25, 0.20, 0.25, 0.30) and combined into a single quality_score with per-dimension explanations.