home / skills / zpankz / mcp-skillset / hkgb

hkgb skill

safe

This skill helps you build hybrid knowledge graphs by bridging structured domain data with LLM-extracted lexical graphs through a reliable join key.

npx playbooks add skill zpankz/mcp-skillset --skill hkgb

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

5.8 KB

---
name: hkgb
description: "This skill should be used when building hybrid Knowledge Graphs that integrate structured data (CSV, databases) with automatically extracted entities from unstructured documents (PDFs, text). The pattern establishes a reliable join key between domain graphs and lexical graphs, enabling GraphRAG, document ingestion with metadata enrichment, and Knowledge Graph construction from heterogeneous sources using neo4j-graphrag SimpleKGPipeline."
---

# Hybrid Knowledge Graph Bridge

*Integration pattern for linking structured domain data to LLM-extracted lexical graphs*

## Problem

When building Knowledge Graphs from heterogeneous sources, two distinct graph types often need coexistence:

1. **Domain Graph** — Structured, curated data from CSV/databases representing business entities and relationships
2. **Lexical Graph** — Entities and relationships automatically extracted from unstructured documents via LLM

These graphs speak different languages: one is schema-driven and deterministic, the other is probabilistic and emergent. Without a deliberate bridge, they remain disconnected silos.

## Solution

The solution establishes a reliable join key between both graphs through five steps.

### Step 1: Specify the lexical graph schema

Before extraction, define the ontology that guides the LLM. This specification comprises three elements.

**Node Types** — The entities to extract. Some are simple labels, others are enriched with descriptions (to guide the LLM) and typed properties:

```python
NODE_TYPES = [
    "Entity",           # Simple label
    "Concept",
    "Process",
    {                   # Enriched with description
        "label": "Outcome",
        "description": "A result, benefit, or consequence of a process or action."
    },
    {                   # With typed properties
        "label": "Reference",
        "description": "An external resource such as a document, article, or dataset.",
        "properties": [
            {"name": "name", "type": "STRING", "required": True},
            {"name": "type", "type": "STRING"}
        ]
    },
]
```

**Relationship Types** — The possible verbs between entities:

```python
RELATIONSHIP_TYPES = [
    "RELATED_TO",
    "PART_OF",
    "USED_IN",
    "LEADS_TO",
    "REFERENCES"
]
```

**Patterns** — The valid combinations. The LLM can only extract conforming triplets:

```python
PATTERNS = [
    ("Entity", "RELATED_TO", "Entity"),
    ("Concept", "RELATED_TO", "Entity"),
    ("Process", "PART_OF", "Entity"),
    ("Process", "LEADS_TO", "Outcome"),
    ("Reference", "REFERENCES", "Entity"),
]
```

### Step 2: Configure the extraction pipeline

The pipeline assembles the LLM, embedder, text splitter, and schema:

```python
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings import OpenAIEmbeddings
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

llm = OpenAILLM(
    model_name="gpt-4o",
    model_params={
        "temperature": 0,
        "response_format": {"type": "json_object"},
    }
)

embedder = OpenAIEmbeddings(model="text-embedding-ada-002")
text_splitter = FixedSizeSplitter(chunk_size=500, chunk_overlap=100)

kg_builder = SimpleKGPipeline(
    llm=llm,
    driver=neo4j_driver,
    neo4j_database=os.getenv("NEO4J_DATABASE"),
    embedder=embedder,
    from_pdf=True,
    text_splitter=text_splitter,
    schema={
        "node_types": NODE_TYPES,
        "relationship_types": RELATIONSHIP_TYPES,
        "patterns": PATTERNS
    },
)
```

The pipeline performs: PDF → chunks → schema-guided LLM extraction → node/relationship creation → embeddings.

### Step 3: Transform the structured source into a dictionary

Each row of the CSV (representing the domain graph) becomes a Python dictionary:

```python
records = csv.DictReader(
    open(os.path.join(data_path, "metadata.csv"), encoding="utf8", newline='')
)
# Produces: {"filename": "doc1.pdf", "category": "...", "author": "...", ...}
```

### Step 4: Add the common key to the dictionary

The pipeline creates `Document` nodes with a `path` property. This property serves as the bridge between the two graphs. Enrich the dictionary with a key that matches exactly what the pipeline stores:

```python
record["file_path"] = os.path.join(data_path, record["filename"])
# The same value passed to the pipeline becomes Document.path
```

This same value is passed to the pipeline which generates the lexical graph:

```python
result = asyncio.run(
    kg_builder.run_async(file_path=record["file_path"])
)
```

### Step 5: Join the two graphs via Cypher

A query uses the common key to attach the domain graph to the lexical graph:

```cypher
MATCH (d:Document {path: $file_path})
MERGE (e:DomainEntity {id: $entity_id})
SET e.category = $category,
    e.author = $author
MERGE (d)-[:BELONGS_TO]->(e)
```

The enriched dictionary is passed as parameters:

```python
neo4j_driver.execute_query(cypher, parameters_=record)
```

## Consequences

The pattern works because the dictionary key and `Document.path` contain identical values. This implicit key connects the lexical graph (entities extracted according to the specified schema) to the domain graph (business structure from structured sources). If these values diverge, the bridge fails silently — orphaned nodes accumulate undetected.

## Verification

To ensure the bridge holds, verify that `Document` nodes are properly attached:

```cypher
// Orphan documents (broken bridge)
MATCH (d:Document)
WHERE NOT EXISTS { (d)-[:BELONGS_TO]->(:DomainEntity) }
RETURN d.path AS orphan

// Domain entities without documents (bridge never built)
MATCH (e:DomainEntity)
WHERE NOT EXISTS { (:Document)-[:BELONGS_TO]->(e) }
RETURN e.id AS missing
```

## Complete Reference

For a complete implementation example, see `references/full_example.py`.

Overview

This skill implements a hybrid Knowledge Graph bridge that links structured domain data (CSV, databases) with lexical graphs extracted from unstructured documents (PDFs, text). It defines a schema-driven extraction pipeline and a reliable join key pattern so domain nodes and LLM-extracted entities connect in neo4j. Use it to enable GraphRAG, metadata-enriched ingestion, and consistent KG construction from heterogeneous sources.

How this skill works

You specify a lexical graph schema (node types, relationship types, and valid patterns) then run a SimpleKGPipeline that splits text, calls an LLM to extract nodes/relations, and writes Document nodes with a path property. Structured rows are transformed into dictionaries and enriched with the exact file path used by the pipeline. A Cypher MERGE using the shared path attaches domain entities to the lexical graph, producing a unified hybrid Knowledge Graph.

When to use it

Ingesting PDFs or text alongside CSV or database records and wanting them connected in one KG.
Building GraphRAG retrieval where document-level metadata must map to extracted entities.
Creating a lineage between curated domain entities and LLM-extracted concepts or relations.
Enriching document ingestion with structured metadata for downstream querying and analytics.

Best practices

Define a clear lexical schema before extraction: list node types, relationship types, and allowed patterns.
Use a deterministic file path value as the join key and ensure CSV/database rows include that exact path.
Keep extraction temperature low and enforce structured response formats so the LLM follows the schema.
Validate bridge integrity regularly with Cypher queries that detect orphaned Documents or DomainEntity nodes.
Store typed properties on reference nodes to improve matching and downstream filtering.

Example use cases

Attach invoice PDFs to financial ledger rows so extracted line items and domain accounts link for auditing.
Index research papers and connect extracted concepts to a curated taxonomy stored in a database.
Ingest technical documentation and map LLM-extracted processes to product entities from a CSV inventory.
Power a QA system that pulls context from document extracts while showing authoritative domain metadata.

FAQ

What must match exactly to create the bridge?

The file path value stored on Document.path must match the key you add to your structured record; they are used as the join parameter in Cypher.

How do I detect broken bridges?

Run Cypher queries that find Document nodes without BELONGS_TO relations and DomainEntity nodes without incoming BELONGS_TO relationships.