home / skills / doanchienthangdev / omgkit / dataset-engineering

This skill helps you build high-quality datasets by deduplicating, cleaning, and formatting data, and generating synthetic data for robust AI training.

npx playbooks add skill doanchienthangdev/omgkit --skill dataset-engineering

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.1 KB
---
name: dataset-engineering
description: Building and processing datasets - data quality, curation, deduplication, synthesis, annotation, formatting. Use when creating training data, improving data quality, or generating synthetic data.
---

# Dataset Engineering Skill

Building high-quality datasets for AI applications.

## Data Quality Dimensions

| Dimension | Description | Check |
|-----------|-------------|-------|
| Accuracy | Data is correct | Validation |
| Completeness | No missing values | Schema check |
| Consistency | No contradictions | Dedup |
| Timeliness | Up-to-date | Timestamps |
| Relevance | Matches use case | Filtering |

## Data Curation Pipeline

```python
class DataCurationPipeline:
    def run(self, raw_data):
        # 1. Inspect
        self.inspect(raw_data)

        # 2. Deduplicate
        data = self.deduplicator.dedupe(raw_data)

        # 3. Clean and filter
        data = self.cleaner.clean(data)
        data = self.filter.filter(data)

        # 4. Format
        return self.formatter.format(data)
```

## Deduplication

```python
from datasketch import MinHash, MinHashLSH

class Deduplicator:
    def __init__(self, threshold=0.8):
        self.lsh = MinHashLSH(threshold=threshold, num_perm=128)

    def minhash(self, text):
        m = MinHash(num_perm=128)
        for word in text.split():
            m.update(word.encode('utf8'))
        return m

    def dedupe(self, docs):
        unique = []
        for i, doc in enumerate(docs):
            mh = self.minhash(doc["text"])
            if not self.lsh.query(mh):
                self.lsh.insert(f"doc_{i}", mh)
                unique.append(doc)
        return unique
```

## Data Synthesis

### AI-Powered QA Generation
```python
def generate_qa(document, model, n=5):
    prompt = f"""Generate {n} QA pairs from:

{document}

Format: [{{"question": "...", "answer": "..."}}]"""

    return json.loads(model.generate(prompt))
```

### Self-Instruct
```python
def self_instruct(seeds, model, n=100):
    generated = []

    for _ in range(n):
        samples = random.sample(seeds + generated[-20:], 5)
        prompt = f"Examples:\n{format(samples)}\n\nNew task:"

        new = model.generate(prompt)
        if is_valid(new) and is_diverse(new, generated):
            generated.append(new)

    return generated
```

### Data Augmentation
```python
def augment_text(text):
    methods = [
        lambda t: synonym_replace(t),
        lambda t: back_translate(t),
        lambda t: model.rephrase(t)
    ]
    return random.choice(methods)(text)
```

## Data Formatting

### Instruction Format
```python
def format_instruction(example):
    return f"""### Instruction:
{example['instruction']}

### Input:
{example.get('input', '')}

### Response:
{example['output']}"""
```

### Chat Format
```python
def format_chat(conversation):
    return [
        {"role": turn["role"], "content": turn["content"]}
        for turn in conversation
    ]
```

## Best Practices

1. Inspect data before processing
2. Deduplicate before expensive operations
3. Use multiple synthesis methods
4. Validate synthetic data quality
5. Track data lineage

Overview

This skill helps build and process high-quality datasets for AI applications, covering quality checks, deduplication, synthesis, annotation, and formatting. It provides practical pipelines and utilities to inspect data, remove duplicates, generate synthetic examples, and convert data into instruction or chat formats. Use it to produce reliable training data and to streamline dataset engineering workflows.

How this skill works

The skill inspects raw inputs for quality dimensions like accuracy, completeness, consistency, timeliness, and relevance. It runs a pipeline that deduplicates content, cleans and filters records, applies augmentation or synthesis methods, and formats outputs for model training. Deduplication uses locality-sensitive hashing (MinHash/LSH) patterns to detect near-duplicates; synthesis includes QA generation, self-instruct loops, and augmentation strategies. Final outputs are validated and formatted as instruction-response pairs or chat transcripts.

When to use it

  • Creating labeled training sets for supervised learning
  • Preparing data for fine-tuning or instruction tuning large models
  • Cleaning noisy, cumulative datasets before expensive model runs
  • Generating synthetic data to cover rare cases or expand coverage
  • Standardizing data formats for evaluation or shared datasets

Best practices

  • Inspect samples and compute data quality metrics before large-scale transforms
  • Deduplicate early to avoid wasted compute on redundant records
  • Combine multiple synthesis techniques (QA, self-instruct, augmentation) to increase diversity
  • Validate synthetic outputs automatically and with human review where possible
  • Record provenance and lineage for every transformed or generated item

Example use cases

  • Build a QA dataset by generating question-answer pairs from long documents and filtering for factuality
  • Run deduplication over web-scraped text to remove near-duplicates before model training
  • Use back-translation and synonym-replacement to augment scarce intent examples for a chatbot
  • Convert annotation exports into instruction format for instruction-tuning workflows
  • Apply self-instruct generation to expand a seed set into diverse training samples while enforcing validation rules

FAQ

How does deduplication handle near-duplicates?

It uses MinHash plus LSH to detect high similarity; you tune the threshold to balance recall and precision.

When should I validate synthetic data with humans?

Use human review for high-impact or safety-sensitive examples and periodically sample outputs during automated validation.

What formats are supported for final outputs?

The skill produces instruction-response text blocks and structured chat role-content lists suitable for common fine-tuning pipelines.