home / skills / zpankz / mcp-skillset / knowledge-graph-builder

knowledge-graph-builder skill

/knowledge-graph-builder

This skill helps you design and validate knowledge graphs, enabling semantic relationships, accurate fact verification, and advanced search across

npx playbooks add skill zpankz/mcp-skillset --skill knowledge-graph-builder

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
14.8 KB
---
name: Knowledge Graph Builder
description: Design and build knowledge graphs. Use when modeling complex relationships, building semantic search, or creating knowledge bases. Covers schema design, entity relationships, and graph database selection.
version: 1.0.0
---

# Knowledge Graph Builder

Build structured knowledge graphs for enhanced AI system performance through relational knowledge.

## Core Principle

**Knowledge graphs make implicit relationships explicit**, enabling AI systems to reason about connections, verify facts, and avoid hallucinations.

## When to Use Knowledge Graphs

### Use Knowledge Graphs When:
- ✅ Complex entity relationships are central to your domain
- ✅ Need to verify AI-generated facts against structured knowledge
- ✅ Semantic search and relationship traversal required
- ✅ Data has rich interconnections (people, organizations, products)
- ✅ Need to answer "how are X and Y related?" queries
- ✅ Building recommendation systems based on relationships
- ✅ Fraud detection or pattern recognition across connected data

### Don't Use Knowledge Graphs When:
- ❌ Simple tabular data (use relational DB)
- ❌ Purely document-based search (use RAG with vector DB)
- ❌ No significant relationships between entities
- ❌ Team lacks graph modeling expertise
- ❌ Read-heavy workload with no traversal (use traditional DB)

---

## 6-Phase Knowledge Graph Implementation

### Phase 1: Ontology Design

**Goal**: Define entities, relationships, and properties for your domain

**Entity Types** (Nodes):
- Person, Organization, Location, Product, Concept, Event, Document

**Relationship Types** (Edges):
- Hierarchical: IS_A, PART_OF, REPORTS_TO
- Associative: WORKS_FOR, LOCATED_IN, AUTHORED_BY, RELATED_TO
- Temporal: CREATED_ON, OCCURRED_BEFORE, OCCURRED_AFTER

**Properties** (Attributes):
- Node properties: id, name, type, created_at, metadata
- Edge properties: type, confidence, source, timestamp

**Example Ontology**:
```turtle
# RDF/Turtle format
@prefix : <http://example.org/ontology#> .

:Person a owl:Class ;
    rdfs:label "Person" .

:Organization a owl:Class ;
    rdfs:label "Organization" .

:worksFor a owl:ObjectProperty ;
    rdfs:domain :Person ;
    rdfs:range :Organization ;
    rdfs:label "works for" .
```

**Validation**:
- [ ] Entities cover all domain concepts
- [ ] Relationships capture key connections
- [ ] Ontology reviewed with domain experts
- [ ] Classification hierarchy defined (is-a relationships)

---

### Phase 2: Graph Database Selection

**Decision Matrix**:

**Neo4j** (Recommended for most):
- Pros: Mature, Cypher query language, graph algorithms, excellent visualization
- Cons: Licensing costs for enterprise, scaling complexity
- Use when: Complex queries, graph algorithms, team can learn Cypher

**Amazon Neptune**:
- Pros: Managed service, supports Gremlin and SPARQL, AWS integration
- Cons: Vendor lock-in, more expensive than self-hosted
- Use when: AWS infrastructure, need managed service, compliance requirements

**ArangoDB**:
- Pros: Multi-model (graph + document + key-value), JavaScript queries
- Cons: Smaller community, fewer graph-specific features
- Use when: Need document DB + graph in one system

**TigerGraph**:
- Pros: Best performance for deep traversals, parallel processing
- Cons: Complex setup, higher learning curve
- Use when: Massive graphs (billions of edges), real-time analytics

**Technology Stack**:
```yaml
graph_database: "Neo4j Community" # or Enterprise for production
vector_integration: "Pinecone" # For hybrid search
embeddings: "text-embedding-3-large" # OpenAI
etl: "Apache Airflow" # For data pipelines
```

**Neo4j Schema Setup**:
```cypher
// Create constraints for uniqueness
CREATE CONSTRAINT person_id IF NOT EXISTS
FOR (p:Person) REQUIRE p.id IS UNIQUE;

CREATE CONSTRAINT org_name IF NOT EXISTS
FOR (o:Organization) REQUIRE o.name IS UNIQUE;

// Create indexes for performance
CREATE INDEX entity_search IF NOT EXISTS
FOR (e:Entity) ON (e.name, e.type);

CREATE INDEX relationship_type IF NOT EXISTS
FOR ()-[r:RELATED_TO]-() ON (r.type, r.confidence);
```

---

### Phase 3: Entity Extraction & Relationship Building

**Goal**: Extract entities and relationships from data sources

**Data Sources**:
- Structured: Databases, APIs, CSV files
- Unstructured: Documents, web content, text files
- Semi-structured: JSON, XML, knowledge bases

**Entity Extraction Pipeline**:
```python
class EntityExtractionPipeline:
    def __init__(self):
        self.ner_model = load_ner_model()  # spaCy, Hugging Face
        self.entity_linker = EntityLinker()
        self.deduplicator = EntityDeduplicator()

    def process_text(self, text: str) -> List[Entity]:
        # 1. Extract named entities
        entities = self.ner_model.extract(text)

        # 2. Link to existing entities (entity resolution)
        linked_entities = self.entity_linker.link(entities)

        # 3. Deduplicate and resolve conflicts
        resolved_entities = self.deduplicator.resolve(linked_entities)

        return resolved_entities
```

**Relationship Extraction**:
```python
class RelationshipExtractor:
    def extract_relationships(self, entities: List[Entity],
                            text: str) -> List[Relationship]:
        relationships = []

        # Use dependency parsing or LLM for extraction
        doc = self.nlp(text)
        for sent in doc.sents:
            rels = self.extract_from_sentence(sent, entities)
            relationships.extend(rels)

        # Validate against ontology
        valid_relationships = self.validate_relationships(relationships)
        return valid_relationships
```

**LLM-Based Extraction** (for complex relationships):
```python
def extract_with_llm(text: str) -> List[Relationship]:
    prompt = f"""
    Extract entities and relationships from this text:
    {text}

    Format: (Entity1, Relationship, Entity2, Confidence)
    Only extract factual relationships.
    """

    response = llm.generate(prompt)
    relationships = parse_llm_response(response)
    return relationships
```

**Validation**:
- [ ] Entity extraction accuracy >85%
- [ ] Entity deduplication working
- [ ] Relationships validated against ontology
- [ ] Confidence scores assigned

---

### Phase 4: Hybrid Knowledge-Vector Architecture

**Goal**: Combine structured graph with semantic vector search

**Architecture**:
```python
class HybridKnowledgeSystem:
    def __init__(self):
        self.graph_db = Neo4jConnection()
        self.vector_db = PineconeClient()
        self.embedding_model = OpenAIEmbeddings()

    def store_entity(self, entity: Entity):
        # Store structured data in graph
        self.graph_db.create_node(entity)

        # Store embeddings in vector database
        embedding = self.embedding_model.embed(entity.description)
        self.vector_db.upsert(
            id=entity.id,
            values=embedding,
            metadata=entity.metadata
        )

    def hybrid_search(self, query: str, top_k: int = 10) -> SearchResults:
        # 1. Vector similarity search
        query_embedding = self.embedding_model.embed(query)
        vector_results = self.vector_db.query(
            vector=query_embedding,
            top_k=100
        )

        # 2. Graph traversal from vector results
        entity_ids = [r.id for r in vector_results.matches]
        graph_results = self.graph_db.get_subgraph(entity_ids, max_hops=2)

        # 3. Merge and rank results
        merged = self.merge_results(vector_results, graph_results)
        return merged[:top_k]
```

**Benefits of Hybrid Approach**:
- Vector search: Semantic similarity, flexible queries
- Graph traversal: Relationship-based reasoning, context expansion
- Combined: Best of both worlds

---

### Phase 5: Query Patterns & API Design

**Common Query Patterns**:

**1. Find Entity**:
```cypher
MATCH (e:Entity {id: $entity_id})
RETURN e
```

**2. Find Relationships**:
```cypher
MATCH (source:Entity {id: $entity_id})-[r]-(target)
RETURN source, r, target
LIMIT 20
```

**3. Path Between Entities**:
```cypher
MATCH path = shortestPath(
  (source:Person {id: $source_id})-[*..5]-(target:Person {id: $target_id})
)
RETURN path
```

**4. Multi-Hop Traversal**:
```cypher
MATCH (p:Person {name: $name})-[:WORKS_FOR]->(o:Organization)-[:LOCATED_IN]->(l:Location)
RETURN p.name, o.name, l.city
```

**5. Recommendation Query**:
```cypher
// Find people similar to this person based on shared organizations
MATCH (p1:Person {id: $person_id})-[:WORKS_FOR]->(o:Organization)<-[:WORKS_FOR]-(p2:Person)
WHERE p1 <> p2
RETURN p2, COUNT(o) AS shared_orgs
ORDER BY shared_orgs DESC
LIMIT 10
```

**Knowledge Graph API**:
```python
class KnowledgeGraphAPI:
    def __init__(self, graph_db):
        self.graph = graph_db

    def find_entity(self, entity_name: str) -> Entity:
        """Find entity by name with fuzzy matching"""
        query = """
        MATCH (e:Entity)
        WHERE e.name CONTAINS $name
        RETURN e
        ORDER BY apoc.text.levenshtein(e.name, $name)
        LIMIT 1
        """
        return self.graph.run(query, name=entity_name).single()

    def find_relationships(self, entity_id: str,
                         relationship_type: str = None,
                         max_hops: int = 2) -> List[Relationship]:
        """Find relationships within specified hops"""
        query = f"""
        MATCH (source:Entity {{id: $entity_id}})
        MATCH path = (source)-[r*1..{max_hops}]-(target)
        RETURN path, relationships(path) AS rels
        LIMIT 100
        """
        return self.graph.run(query, entity_id=entity_id).data()

    def get_subgraph(self, entity_ids: List[str],
                    max_hops: int = 2) -> Subgraph:
        """Get connected subgraph for multiple entities"""
        query = f"""
        MATCH (e:Entity)
        WHERE e.id IN $entity_ids
        CALL apoc.path.subgraphAll(e, {{maxLevel: {max_hops}}})
        YIELD nodes, relationships
        RETURN nodes, relationships
        """
        return self.graph.run(query, entity_ids=entity_ids).data()
```

---

### Phase 6: AI Integration & Hallucination Prevention

**Goal**: Use knowledge graph to ground LLM responses and detect hallucinations

**Knowledge Graph RAG**:
```python
class KnowledgeGraphRAG:
    def __init__(self, kg_api, llm_client):
        self.kg = kg_api
        self.llm = llm_client

    def retrieve_context(self, query: str) -> str:
        # Extract entities from query
        entities = self.extract_entities_from_query(query)

        # Retrieve relevant subgraph
        subgraph = self.kg.get_subgraph(
            [e.id for e in entities],
            max_hops=2
        )

        # Format subgraph for LLM
        context = self.format_subgraph_for_llm(subgraph)
        return context

    def generate_with_grounding(self, query: str) -> GroundedResponse:
        context = self.retrieve_context(query)

        prompt = f"""
        Context from knowledge graph:
        {context}

        User query: {query}

        Answer based only on the provided context. Include source entities.
        """

        response = self.llm.generate(prompt)

        return GroundedResponse(
            response=response,
            sources=self.extract_sources(context),
            confidence=self.calculate_confidence(response, context)
        )
```

**Hallucination Detection**:
```python
class HallucinationDetector:
    def __init__(self, knowledge_graph):
        self.kg = knowledge_graph

    def verify_claim(self, claim: str) -> VerificationResult:
        # Parse claim into (subject, predicate, object)
        parsed_claim = self.parse_claim(claim)

        # Query knowledge graph for evidence
        evidence = self.kg.find_evidence(
            parsed_claim.subject,
            parsed_claim.predicate,
            parsed_claim.object
        )

        if evidence:
            return VerificationResult(
                is_supported=True,
                evidence=evidence,
                confidence=evidence.confidence
            )

        # Check for contradictory evidence
        contradiction = self.kg.find_contradiction(parsed_claim)

        return VerificationResult(
            is_supported=False,
            is_contradicted=bool(contradiction),
            contradiction=contradiction
        )
```

---

## Key Principles

### 1. Start with Ontology
Define your schema before ingesting data. Changing ontology later is expensive.

### 2. Entity Resolution is Critical
Deduplicate entities aggressively. "Apple Inc", "Apple", "Apple Computer" → same entity.

### 3. Confidence Scores on Everything
Every relationship should have a confidence score (0.0-1.0) and source.

### 4. Incremental Building
Don't try to model entire domain at once. Start with core entities and expand.

### 5. Hybrid Architecture Wins
Combine graph traversal (structured) with vector search (semantic) for best results.

---

## Common Use Cases

**1. Question Answering**:
- Extract entities from question
- Traverse graph to find answer
- Return path as explanation

**2. Recommendation**:
- Find similar entities via shared relationships
- Rank by relationship strength
- Return top-K recommendations

**3. Fraud Detection**:
- Model transactions as graph
- Find suspicious patterns (cycles, anomalies)
- Flag for review

**4. Knowledge Discovery**:
- Identify implicit relationships
- Suggest missing connections
- Validate with domain experts

**5. Semantic Search**:
- Hybrid vector + graph search
- Expand context via relationships
- Return rich connected results

---

## Technology Recommendations

**For MVPs (<10K entities)**:
- Neo4j Community Edition (free)
- SQLite for metadata
- OpenAI embeddings
- FastAPI for API layer

**For Production (10K-1M entities)**:
- Neo4j Enterprise or ArangoDB
- Pinecone for vector search
- Airflow for ETL
- GraphQL API

**For Enterprise (1M+ entities)**:
- Neo4j Enterprise or TigerGraph
- Distributed vector DB (Pinecone, Weaviate)
- Kafka for streaming
- Kubernetes deployment

---

## Validation Checklist

- [ ] Ontology designed and validated with domain experts
- [ ] Graph database selected and set up
- [ ] Entity extraction pipeline tested (>85% accuracy)
- [ ] Relationship extraction validated
- [ ] Hybrid search (graph + vector) implemented
- [ ] Query API created and documented
- [ ] AI integration tested (RAG or hallucination detection)
- [ ] Performance benchmarks met (query <100ms for common patterns)
- [ ] Data quality monitoring in place
- [ ] Backup and recovery tested

---

## Related Resources

**Related Skills**:
- `rag-implementer` - For hybrid KG+RAG systems
- `multi-agent-architect` - For knowledge-graph-powered agents
- `api-designer` - For KG API design

**Related Patterns**:
- `META/DECISION-FRAMEWORK.md` - Graph DB selection
- `STANDARDS/architecture-patterns/knowledge-graph-pattern.md` - KG architectures (when created)

**Related Playbooks**:
- `PLAYBOOKS/deploy-neo4j.md` - Neo4j deployment (when created)
- `PLAYBOOKS/build-kg-rag-system.md` - KG-RAG integration (when created)

Overview

This skill designs and builds knowledge graphs to model complex relationships, power semantic search, and ground AI outputs. It covers ontology design, entity and relationship extraction, graph database selection, hybrid vector integration, query APIs, and hallucination prevention. Use it to create verifiable, traversable knowledge bases that improve reasoning and reduce AI errors.

How this skill works

I guide you through a six-phase implementation: define an ontology, pick a graph database, extract entities and relationships, combine the graph with vector search, create query patterns and APIs, and integrate with LLMs for grounding. The system stores structured nodes and edges with confidence and provenance, indexes textual descriptions as embeddings, and runs hybrid searches that merge vector similarity with graph traversal. Validation and incremental builds ensure accuracy and low risk of schema drift.

When to use it

  • You need to model rich, multi-entity relationships (people, orgs, products, events).
  • You require semantic search plus relationship-aware reasoning or multi-hop queries.
  • You must verify or cite facts to prevent LLM hallucinations.
  • You are building recommendations, fraud detection, or pattern discovery across connected data.
  • Your domain benefits from explicit ontologies and provenance for each assertion.

Best practices

  • Start with a clear ontology and validate it with domain experts before ingesting data.
  • Perform aggressive entity resolution and deduplication to avoid fractured identities.
  • Attach confidence scores and source metadata to every relationship and node.
  • Adopt a hybrid architecture: vector embeddings for semantics and the graph for traversal.
  • Build incrementally: model core entities first, expand iteratively, and monitor quality metrics.

Example use cases

  • Question answering: extract entities from a query, traverse subgraphs, and return the path as explanation.
  • Recommendation systems: find similar items via shared relationships and rank by relationship strength.
  • Fraud detection: represent transactions and accounts as a graph to surface suspicious cycles or clusters.
  • Semantic search: combine vector recall with graph expansion to return contextual, connected results.
  • Knowledge discovery: surface likely missing links and validate suggested connections with experts.

FAQ

Which graph database should I pick for an MVP?

For small to medium MVPs, Neo4j Community is a practical choice; it offers Cypher, easy setup, and good tooling. Use ArangoDB if you need multi-model capabilities.

How do I prevent ontology changes from breaking the system?

Design the ontology before ingestion, version it, and migrate incrementally. Keep backward-compatible properties and run validation tests when evolving schema.