home / skills / ancoleman / ai-design-components / using-graph-databases

using-graph-databases skill

safe

This skill helps you design and implement graph databases for relationship-rich applications, boosting traversal efficiency, pattern matching, and multi-model

npx playbooks add skill ancoleman/ai-design-components --skill using-graph-databases

Review the files below or copy the command above to add this skill to your agents.

Files (11)

SKILL.md

12.7 KB

---
name: using-graph-databases
description: Graph database implementation for relationship-heavy data models. Use when building social networks, recommendation engines, knowledge graphs, or fraud detection. Covers Neo4j (primary), ArangoDB, Amazon Neptune, Cypher query patterns, and graph data modeling.
---

# Graph Databases

## Purpose

This skill guides selection and implementation of graph databases for applications where relationships between entities are first-class citizens. Unlike relational databases that model relationships through foreign keys and joins, graph databases natively represent connections as properties, enabling efficient traversal-heavy queries.

## When to Use This Skill

Use graph databases when:
- **Deep relationship traversals** (4+ hops): "Friends of friends of friends"
- **Variable/evolving relationships**: Schema changes don't break existing queries
- **Path finding**: Shortest route, network analysis, dependency chains
- **Pattern matching**: Fraud detection, recommendation engines, access control

**Do NOT use graph databases when**:
- Fixed schema with shallow joins (2-3 tables) → Use PostgreSQL
- Primarily aggregations/analytics → Use columnar databases
- Key-value lookups only → Use Redis/DynamoDB

## Quick Decision Framework

```
DATA CHARACTERISTICS?
├── Fixed schema, shallow joins (≤3 hops)
│   └─ PostgreSQL (relational)
│
├── Already on PostgreSQL + simple graphs
│   └─ Apache AGE (PostgreSQL extension)
│
├── Deep traversals (4+ hops) + general purpose
│   └─ Neo4j (battle-tested, largest ecosystem)
│
├── Multi-model (documents + graph)
│   └─ ArangoDB
│
├── AWS-native, serverless
│   └─ Amazon Neptune
│
└── Real-time streaming, in-memory
    └─ Memgraph
```

## Core Concepts

### Property Graph Model

Graph databases store data as:
- **Nodes** (vertices): Entities with labels and properties
- **Relationships** (edges): Typed connections with properties
- **Properties**: Key-value pairs on nodes and relationships

```
(Person {name: "Alice", age: 28})-[:FRIEND {since: "2020-01-15"}]->(Person {name: "Bob"})
```

### Query Languages

| Language | Databases | Readability | Best For |
|----------|-----------|-------------|----------|
| **Cypher** | Neo4j, Memgraph, AGE | ⭐⭐⭐⭐⭐ SQL-like | General purpose |
| **Gremlin** | Neptune, JanusGraph | ⭐⭐⭐ Functional | Cross-database |
| **AQL** | ArangoDB | ⭐⭐⭐⭐ SQL-like | Multi-model |
| **SPARQL** | Neptune, RDF stores | ⭐⭐⭐ W3C standard | Semantic web |

## Common Cypher Patterns

Reference `references/cypher-patterns.md` for comprehensive examples.

### Pattern 1: Basic Matching
```cypher
// Find all users at a company
MATCH (u:User)-[:WORKS_AT]->(c:Company {name: 'Acme Corp'})
RETURN u.name, u.title
```

### Pattern 2: Variable-Length Paths
```cypher
// Find friends up to 3 degrees away
MATCH (u:User {name: 'Alice'})-[:FRIEND*1..3]->(friend)
WHERE u <> friend
RETURN DISTINCT friend.name
LIMIT 100
```

### Pattern 3: Shortest Path
```cypher
// Find shortest connection between two users
MATCH path = shortestPath(
  (a:User {name: 'Alice'})-[*]-(b:User {name: 'Bob'})
)
RETURN path, length(path) AS distance
```

### Pattern 4: Recommendations
```cypher
// Collaborative filtering: Products liked by similar users
MATCH (u:User {id: $userId})-[:PURCHASED]->(p:Product)<-[:PURCHASED]-(similar)
MATCH (similar)-[:PURCHASED]->(rec:Product)
WHERE NOT exists((u)-[:PURCHASED]->(rec))
RETURN rec.name, count(*) AS score
ORDER BY score DESC
LIMIT 10
```

### Pattern 5: Fraud Detection
```cypher
// Detect circular money flows
MATCH path = (a:Account)-[:SENT*3..6]->(a)
WHERE all(r IN relationships(path) WHERE r.amount > 1000)
RETURN path, [r IN relationships(path) | r.amount] AS amounts
```

## Database Selection Guide

### Neo4j (Primary Recommendation)

**Use for**: General-purpose graph applications

**Strengths**:
- Most mature (2007), largest community (2M+ developers)
- 65+ graph algorithms (GDS library): PageRank, Louvain, Dijkstra
- Best tooling: Neo4j Browser, Bloom visualization
- Comprehensive Cypher support

**Installation**:
```bash
# Python driver
pip install neo4j

# TypeScript driver
npm install neo4j-driver

# Rust driver
cargo add neo4rs
```

Reference: `references/neo4j.md`

### ArangoDB

**Use for**: Multi-model applications (documents + graph)

**Strengths**:
- Store documents AND graph in one database
- AQL combines document and graph queries
- Schema flexibility with relationships

Reference: `references/arangodb.md`

### Apache AGE

**Use for**: Adding graph capabilities to existing PostgreSQL

**Strengths**:
- Extend PostgreSQL with graph queries
- No new infrastructure needed
- Query both relational and graph data

Reference: Implementation details in examples/

### Amazon Neptune

**Use for**: AWS-native, serverless deployments

**Strengths**:
- Fully managed, auto-scaling
- Supports Gremlin AND SPARQL
- AWS ecosystem integration

## Graph Data Modeling Patterns

Reference `references/graph-modeling.md` for comprehensive patterns.

### Best Practice 1: Relationships as First-Class Citizens

**Anti-pattern** (storing relationships in node properties):
```cypher
// BAD
(:Person {name: 'Alice', friend_ids: ['b123', 'c456']})
```

**Pattern** (explicit relationships):
```cypher
// GOOD
(:Person {name: 'Alice'})-[:FRIEND]->(:Person {id: 'b123'})
(:Person {name: 'Alice'})-[:FRIEND]->(:Person {id: 'c456'})
```

### Best Practice 2: Relationship Properties for Metadata

```cypher
// Track interaction details on relationships
(:Person)-[:FRIEND {
  since: '2020-01-15',
  strength: 0.85,
  last_interaction: datetime()
}]->(:Person)
```

### Best Practice 3: Bounded Traversals for Performance

```cypher
// SLOW: Unbounded traversal
MATCH (a)-[:FRIEND*]->(distant)
RETURN distant

// FAST: Bounded depth with index
MATCH (a)-[:FRIEND*1..4]->(distant)
WHERE distant.active = true
RETURN distant
LIMIT 100
```

### Best Practice 4: Avoid Supernodes

**Problem**: Nodes with thousands of relationships slow traversals.

**Solution**: Intermediate aggregation nodes
```cypher
// Instead of: (:User)-[:POSTED]->(:Post) [1M relationships]

// Use time partitioning:
(:User)-[:POSTED_IN]->(:Year {year: 2025})
       -[:HAS_MONTH]->(:Month {month: 12})
       -[:HAS_POST]->(:Post)
```

## Use Case Examples

### Social Network

Schema and implementation in `examples/social-graph/`

**Key features**:
- Friend recommendations (friends-of-friends)
- Mutual connections
- News feed generation
- Influence metrics

### Knowledge Graph for AI/RAG

Integration example in `examples/knowledge-graph/`

**Key features**:
- Hybrid vector + graph search
- Entity relationship mapping
- Context expansion for LLM prompts
- Semantic relationship traversal

**Integration with Vector Databases**:
```python
# Step 1: Vector search in Qdrant/pgvector
vector_results = qdrant.search(collection="concepts", query_vector=embedding)

# Step 2: Expand with graph relationships
concept_ids = [r.id for r in vector_results]
graph_context = neo4j.run("""
  MATCH (c:Concept) WHERE c.id IN $ids
  MATCH (c)-[:RELATED_TO|IS_A*1..2]-(related)
  RETURN c, related, relationships(path)
""", ids=concept_ids)
```

### Recommendation Engine

Examples in `examples/social-graph/`

**Strategies**:
1. **Collaborative filtering**: "Users who bought X also bought Y"
2. **Content-based**: "Products similar to what you like"
3. **Session-based**: "Recently viewed items"

### Fraud Detection

Pattern detection in examples/

**Detection patterns**:
- Circular money flows
- Shared devices across accounts
- Rapid transaction chains
- Connection pattern anomalies

## Performance Optimization

Reference `references/cypher-patterns.md` for detailed optimization.

### Indexing
```cypher
// Single-property index
CREATE INDEX user_email FOR (u:User) ON (u.email)

// Composite index (Neo4j 5.x+)
CREATE INDEX user_name_location FOR (u:User) ON (u.name, u.location)

// Full-text search
CREATE FULLTEXT INDEX product_search FOR (p:Product) ON EACH [p.name, p.description]
```

### Caching Expensive Aggregations
```cypher
// Materialize friend count as property
MATCH (u:User)-[:FRIEND]->(f)
WITH u, count(f) AS friendCount
SET u.friend_count = friendCount

// Query becomes instant
MATCH (u:User) WHERE u.friend_count > 100
RETURN u.name, u.friend_count
```

### Scaling Strategies

| Scale | Strategy | Implementation |
|-------|----------|----------------|
| **Vertical** | Add RAM/CPU | In-memory caching, larger instances |
| **Horizontal (Read)** | Read replicas | Neo4j Cluster, ArangoDB Cluster |
| **Horizontal (Write)** | Sharding | ArangoDB SmartGraphs, JanusGraph |
| **Caching** | App-level cache | Redis for hot paths |

## Language Integration

### Python (Neo4j)

Complete example in `examples/social-graph/python-neo4j/`

```python
from neo4j import GraphDatabase

class GraphDB:
    def __init__(self, uri: str, user: str, password: str):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

    def find_friends_of_friends(self, user_id: str, max_depth: int = 2):
        query = """
        MATCH (u:User {id: $userId})-[:FRIEND*1..$maxDepth]->(fof)
        WHERE u <> fof
        RETURN DISTINCT fof.id, fof.name
        LIMIT 100
        """
        with self.driver.session() as session:
            result = session.run(query, userId=user_id, maxDepth=max_depth)
            return [dict(record) for record in result]

# Usage
db = GraphDB("bolt://localhost:7687", "neo4j", "password")
friends = db.find_friends_of_friends("u123", max_depth=3)
```

### TypeScript (Neo4j)

Complete example in `examples/social-graph/typescript-neo4j/`

```typescript
import neo4j, { Driver } from 'neo4j-driver'

class Neo4jService {
  private driver: Driver

  constructor(uri: string, username: string, password: string) {
    this.driver = neo4j.driver(uri, neo4j.auth.basic(username, password))
  }

  async findFriendsOfFriends(userId: string, maxDepth: number = 2) {
    const session = this.driver.session()
    try {
      const result = await session.run(
        `MATCH (u:User {id: $userId})-[:FRIEND*1..$maxDepth]->(fof)
         WHERE u <> fof
         RETURN DISTINCT fof.id, fof.name
         LIMIT 100`,
        { userId, maxDepth }
      )
      return result.records.map(r => r.toObject())
    } finally {
      await session.close()
    }
  }
}
```

### Go (ArangoDB)

```go
import (
    "github.com/arangodb/go-driver"
    "github.com/arangodb/go-driver/http"
)

func findFriendsOfFriends(db driver.Database, userId string, maxDepth int) ([]User, error) {
    query := `
        FOR vertex, edge, path IN 1..@maxDepth OUTBOUND @startVertex GRAPH 'socialGraph'
            FILTER vertex._id != @startVertex
            RETURN DISTINCT vertex
            LIMIT 100
    `

    cursor, err := db.Query(ctx, query, map[string]interface{}{
        "startVertex": userId,
        "maxDepth": maxDepth,
    })

    // Handle results...
}
```

## Schema Validation

Use `scripts/validate_graph_schema.py` to check for:
- Unbounded traversals (missing depth limits)
- Missing indexes on frequently queried properties
- Supernodes (nodes with excessive relationships)
- Relationship property consistency

Run validation:
```bash
python scripts/validate_graph_schema.py --database neo4j://localhost:7687
```

## Integration with Other Skills

### With databases-vector (Hybrid Search)
Combine vector similarity with graph context for AI/RAG applications.
See `examples/knowledge-graph/`

### With search-filter
Implement relationship-based queries: "Find all users within 3 degrees of connection"

### With ai-chat
Use knowledge graphs to enrich LLM context with structured relationships.

### With auth-security (ReBAC)
Implement relationship-based access control: "Can user X access resource Y through relation Z?"

## Common Schema Patterns

### Star Schema (Hub and Spokes)
```cypher
(:User)-[:PURCHASED]->(:Product)
(:User)-[:VIEWED]->(:Product)
(:User)-[:RATED]->(:Product)
```

### Hierarchical Schema (Trees)
```cypher
(:CEO)-[:MANAGES]->(:VP)-[:MANAGES]->(:Director)
```

### Temporal Schema (Event Sequences)
```cypher
(:Event {timestamp})-[:NEXT]->(:Event {timestamp})
```

## Getting Started

1. **Choose database**: Use decision framework above
2. **Design schema**: Reference `references/graph-modeling.md`
3. **Implement queries**: Use patterns from `references/cypher-patterns.md`
4. **Validate**: Run `scripts/validate_graph_schema.py`
5. **Optimize**: Add indexes, bound traversals, cache aggregations

## Further Reading

- `references/neo4j.md` - Neo4j setup, drivers, GDS algorithms
- `references/arangodb.md` - ArangoDB multi-model patterns
- `references/cypher-patterns.md` - Comprehensive Cypher query library
- `references/graph-modeling.md` - Data modeling best practices
- `examples/social-graph/` - Complete social network implementation
- `examples/knowledge-graph/` - Hybrid vector + graph for AI/RAG

Overview

This skill provides a practical guide to selecting and implementing graph databases for relationship-heavy applications. It focuses on Neo4j as the primary recommendation and also covers ArangoDB, Amazon Neptune, Apache AGE, query patterns (Cypher), and graph data modeling. The goal is to help engineers design performant graph schemas, write common queries, and integrate graphs with AI pipelines.

How this skill works

The skill explains the property graph model (nodes, relationships, properties) and maps common application needs to appropriate database choices via a decision framework. It presents core Cypher patterns (matching, variable-length paths, shortest path, recommendations, fraud detection), indexing and caching strategies, and language-specific examples in Python, TypeScript, Go. It also includes modeling best practices, performance tips, and validation scripts for common anti-patterns.

When to use it

Deep relationship traversals (4+ hops) or traversal-heavy queries
Variable or evolving relationship schemas where joins become brittle
Path finding, network analysis, dependency chains
Pattern matching tasks like fraud detection or complex recommendations
When you need multi-model capabilities (documents + graph) or AWS-native managed graph services

Best practices

Model relationships explicitly as first-class edges; avoid storing relationship lists as node properties
Store metadata on relationship properties (timestamps, strength) for richer queries
Bound traversals to a max depth and use indexes to prevent unbounded scans
Avoid supernodes by sharding or using intermediate aggregation/time-partition nodes
Materialize expensive aggregations (friend counts, scores) in node properties or cache results

Example use cases

Social network: friend recommendations, mutual connections, news feed and influence metrics
Recommendation engine: collaborative filtering, content-based and session-based strategies
Knowledge graph for AI/RAG: hybrid vector + graph search, entity expansion for LLM context
Fraud detection: detect circular money flows, shared devices, and anomalous transaction chains
AWS deployments: use Amazon Neptune for managed Gremlin/SPARQL workloads; pick ArangoDB for multi-model needs

FAQ

Which database should I pick for a general-purpose graph app?

Neo4j is the default choice for general-purpose graph workloads due to its maturity, Cypher support, and tooling. Choose ArangoDB for document+graph needs and Neptune for AWS-native managed services.

How do I avoid performance issues with large-degree nodes?

Avoid supernodes by partitioning (time or category), introduce intermediate aggregation nodes, and add appropriate indexes. Also bound traversal depth and materialize heavy aggregations when possible.