home / skills / dexploarer / hyper-forge / knowledge-base-builder

knowledge-base-builder skill

/.claude/skills/knowledge-base-builder

This skill helps you build production-ready elizaOS knowledge bases with RAG, embeddings, and semantic search.

npx playbooks add skill dexploarer/hyper-forge --skill knowledge-base-builder

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
6.4 KB
---
name: knowledge-base-builder
description: Create and optimize elizaOS knowledge bases with RAG, embeddings, and semantic search. Triggers on "create knowledge base", "build RAG system", or "setup agent knowledge"
allowed-tools: [Write, Read, Edit, Grep, Glob, Bash]
---

# Knowledge Base Builder Skill

Build production-ready knowledge bases for elizaOS agents with document ingestion, embeddings, and semantic retrieval.

## When to Use

- "Create a knowledge base for [domain]"
- "Build RAG system with [documents]"
- "Setup agent knowledge from [sources]"
- "Implement semantic search for agent"

## Capabilities

1. šŸ“š Document ingestion (markdown, PDF, text)
2. āœ‚ļø Smart chunking strategies
3. šŸ” Embedding generation
4. šŸ—„ļø Vector storage configuration
5. šŸŽÆ Semantic search optimization
6. šŸ”„ Knowledge updates and versioning
7. šŸ“Š Knowledge quality metrics

## Workflow

### Phase 1: Knowledge Requirements

**Questions to ask:**
1. What domain expertise is needed?
2. What document sources exist?
3. How often does knowledge change?
4. What query patterns expected?

### Phase 2: Knowledge Structure

```
knowledge/
ā”œā”€ā”€ {domain}/
│   ā”œā”€ā”€ README.md           # Overview
│   ā”œā”€ā”€ core-concepts.md    # Fundamental knowledge
│   ā”œā”€ā”€ procedures.md       # Step-by-step guides
│   ā”œā”€ā”€ faq.md             # Common questions
│   ā”œā”€ā”€ examples.md        # Use case examples
│   └── glossary.md        # Terminology
└── embeddings/
    └── {domain}.json       # Pre-computed embeddings
```

### Phase 3: Document Format

```markdown
# {Topic Title}

## Summary
{Brief overview for quick reference}

## Key Concepts
- {Concept 1}: {Definition}
- {Concept 2}: {Definition}

## Detailed Explanation
{Comprehensive information}

## Examples
```{language}
{Code or usage examples}
```

## Related Topics
- [{Topic}](./related-topic.md)

## Last Updated
{Date}
```

### Phase 4: Character Integration

```typescript
export const character: Character = {
  // ... other config

  knowledge: [
    // Simple facts
    "Core fact about {domain}",
    "Important principle in {domain}",

    // File references
    {
      path: "./knowledge/{domain}/core-concepts.md",
      shared: true  // Available to all agents
    },

    // Directory loading
    {
      directory: "./knowledge/{domain}",
      shared: false  // Agent-specific
    }
  ],

  // Configure knowledge plugin
  plugins: [
    '@elizaos/plugin-knowledge',
    // ... other plugins
  ],

  settings: {
    // Embedding configuration
    embeddingModel: 'text-embedding-3-small',
    embeddingDimensions: 1536,

    // Retrieval settings
    knowledgeTopK: 5,            // Top results to return
    knowledgeMinScore: 0.7,       // Minimum similarity
    knowledgeDecay: 0.95,         // Time decay factor

    // Chunking strategy
    chunkSize: 1000,              // Characters per chunk
    chunkOverlap: 200,            // Overlap between chunks
  }
};
```

### Phase 5: Chunking Strategies

**Strategy 1: Fixed Size** (simple, balanced)
```typescript
function chunkFixedSize(text: string, size: number, overlap: number): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + size, text.length);
    chunks.push(text.slice(start, end));
    start += size - overlap;
  }

  return chunks;
}
```

**Strategy 2: Semantic** (intelligent, context-aware)
```typescript
function chunkSemantic(text: string): string[] {
  // Split on headers and sections
  const sections = text.split(/\n#{1,6}\s/);

  // Further split large sections
  return sections.flatMap(section => {
    if (section.length > 1000) {
      return chunkByParagraph(section);
    }
    return [section];
  });
}
```

**Strategy 3: Sliding Window** (comprehensive, overlapping)
```typescript
function chunkSlidingWindow(text: string, windowSize: number, step: number): string[] {
  const chunks: string[] = [];

  for (let i = 0; i < text.length; i += step) {
    const chunk = text.slice(i, i + windowSize);
    if (chunk.trim().length > 0) {
      chunks.push(chunk);
    }
  }

  return chunks;
}
```

### Phase 6: Embedding Optimization

```typescript
// Batch embedding generation
async function generateEmbeddings(
  chunks: string[],
  model: string = 'text-embedding-3-small'
): Promise<number[][]> {
  const batchSize = 100;
  const embeddings: number[][] = [];

  for (let i = 0; i < chunks.length; i += batchSize) {
    const batch = chunks.slice(i, i + batchSize);

    const response = await openai.embeddings.create({
      model,
      input: batch,
    });

    embeddings.push(...response.data.map(d => d.embedding));
  }

  return embeddings;
}
```

### Phase 7: Search Implementation

```typescript
// Semantic search with hybrid ranking
async function searchKnowledge(
  query: string,
  runtime: IAgentRuntime,
  topK: number = 5
): Promise<Memory[]> {
  // Generate query embedding
  const queryEmbedding = await generateEmbedding(query);

  // Semantic search
  const semanticResults = await runtime.searchMemories({
    embedding: queryEmbedding,
    limit: topK * 2,
    minScore: 0.7
  });

  // Keyword search
  const keywordResults = await runtime.searchMemories({
    query,
    limit: topK * 2
  });

  // Merge and rank results
  return mergeAndRank(semanticResults, keywordResults, topK);
}
```

### Phase 8: Quality Metrics

```typescript
interface KnowledgeMetrics {
  totalDocuments: number;
  totalChunks: number;
  avgChunkSize: number;
  embeddingCoverage: number;
  queryPerformance: {
    avgLatency: number;
    avgRelevance: number;
  };
}

async function assessKnowledgeQuality(
  runtime: IAgentRuntime
): Promise<KnowledgeMetrics> {
  // Implementation
  return {
    totalDocuments: 50,
    totalChunks: 500,
    avgChunkSize: 800,
    embeddingCoverage: 0.98,
    queryPerformance: {
      avgLatency: 150, // ms
      avgRelevance: 0.85
    }
  };
}
```

## Best Practices

1. **Document Structure**: Use clear headers and sections
2. **Chunk Size**: Balance between context and precision (500-1500 chars)
3. **Overlap**: Include 10-20% overlap for context preservation
4. **Updates**: Version knowledge files with dates
5. **Quality**: Regular review and refinement
6. **Performance**: Pre-compute embeddings when possible
7. **Privacy**: Never include sensitive data in knowledge base
8. **Organization**: Group related documents in directories
9. **Testing**: Validate retrieval quality with test queries
10. **Monitoring**: Track usage patterns and relevance scores

Overview

This skill builds and optimizes elizaOS knowledge bases using document ingestion, embeddings, and semantic retrieval for RAG-enabled agents. It guides structure, chunking, embedding generation, vector storage, and search tuning to make agent knowledge production-ready. The goal is reliable, fast, and maintainable semantic search for agent conversations and tools.

How this skill works

It ingests documents (Markdown, PDF, text), applies chunking strategies (fixed, semantic, sliding window), and generates embeddings in batches. Chunks and embeddings are stored in a vector store and served via semantic search with hybrid ranking (embeddings + keyword). It includes tools for ongoing updates, versioning, and quality metrics to monitor relevance and latency.

When to use it

  • Create a domain-specific knowledge base for an elizaOS agent
  • Build a RAG system from a large set of documents or manuals
  • Setup semantic search for agent responses and tool grounding
  • Migrate or version knowledge while preserving retrieval performance

Best practices

  • Use clear headers and consistent document structure for semantic chunking
  • Choose chunk sizes between 500–1,500 characters and include 10–20% overlap
  • Pre-compute embeddings and batch requests to reduce latency and cost
  • Version knowledge files and record last-updated dates for traceability

Example use cases

  • Ingest API docs and tutorials so the agent can answer developer questions with source citations
  • Build a game design knowledge base for asset generation rules, workflows, and examples
  • Implement semantic QA over technical manuals to reduce time-to-answer for support agents
  • Keep training data current by scheduling embeddings regeneration when documents change

FAQ

What chunking strategy should I choose?

Start with fixed-size chunks (balanced) and evaluate relevance. Use semantic chunking for structured docs and sliding windows when full context overlap improves retrieval.

How do I balance performance and accuracy?

Pre-compute embeddings, tune top-K and min-score thresholds, and use hybrid ranking (embedding + keyword) to improve precision with lower latency.