home / skills / kthorn / research-superpower / traversing-citations

traversing-citations skill

/skills/research/traversing-citations

This skill helps you traverse citation networks intelligently using Semantic Scholar to identify relevant backward and forward citations while deduplicating

npx playbooks add skill kthorn/research-superpower --skill traversing-citations

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
8.9 KB
---
name: Traversing Citation Networks
description: Smart backward and forward citation following via Semantic Scholar, with relevance filtering and deduplication
when_to_use: After finding relevant paper. When need to find related work. When following references or citations. When building citation graph. When exploring paper connections.
version: 1.0.0
---

# Traversing Citation Networks

## Overview

Intelligently follow citations backward (references) and forward (citing papers) using Semantic Scholar API.

**Core principle:** Only follow citations relevant to user's query. Avoid exponential explosion by filtering before traversing.

## When to Use

Use this skill when:
- Found a highly relevant paper (score ≥ 7)
- Need to find related work
- User asks "what papers cite this?"
- Building comprehensive understanding of a topic

**When NOT to use:**
- Paper scored < 7 (not relevant enough to follow)
- Already at 50 papers (check with user first)
- Citations look off-topic from abstract

## Citation Traversal Strategy

### 1. Get Paper ID from Semantic Scholar

**Lookup by DOI:**
```bash
curl "https://api.semanticscholar.org/graph/v1/paper/DOI:10.1234/example.2023?fields=paperId,title,year"
```

**Response:**
```json
{
  "paperId": "abc123def456",
  "title": "Paper Title",
  "year": 2023
}
```

**Save paperId** - needed for citations/references queries

### 2. Backward Traversal (References)

**Get references from paper:**
```bash
curl "https://api.semanticscholar.org/graph/v1/paper/abc123def456/references?fields=contexts,intents,title,year,abstract,externalIds&limit=100"
```

**Response format:**
```json
{
  "data": [
    {
      "citedPaper": {
        "paperId": "xyz789",
        "title": "Referenced Paper Title",
        "year": 2020,
        "abstract": "...",
        "externalIds": {
          "DOI": "10.5678/referenced.2020",
          "PubMed": "87654321"
        }
      },
      "contexts": [
        "...as described in previous work [15]...",
        "...we used the method from [15] to..."
      ],
      "intents": ["methodology", "background"]
    }
  ]
}
```

**Filter for relevance:**

For each reference, check:
1. **Context keywords**: Do citation contexts mention user's query terms?
   - Example: If user asks about "IC50 values", look for contexts mentioning "IC50", "activity", "potency"
2. **Title match**: Does title contain relevant keywords?
3. **Intent**: Is intent "methodology" or "result" (more relevant) vs "background" (less relevant)?

**Scoring:**
- Context keywords match: +3 points
- Title keywords match: +2 points
- Intent is methodology/result: +2 points
- Recent (< 5 years old): +1 point

**Only add to queue if score ≥ 5**

### 3. Forward Traversal (Citations)

**Get papers citing this one:**
```bash
curl "https://api.semanticscholar.org/graph/v1/paper/abc123def456/citations?fields=title,year,abstract,externalIds&limit=100"
```

**Response format:**
```json
{
  "data": [
    {
      "citingPaper": {
        "paperId": "def456ghi",
        "title": "Newer Paper Citing This",
        "year": 2024,
        "abstract": "We extended the work of [original paper]...",
        "externalIds": {
          "DOI": "10.9012/citing.2024"
        }
      }
    }
  ]
}
```

**Filter for relevance:**

For each citing paper:
1. **Title match**: Keywords present in title?
2. **Abstract match**: User's query terms in abstract?
3. **Recency**: Newer papers often build on findings (prioritize < 2 years)
4. **Citation count**: If Semantic Scholar provides, highly cited papers more likely relevant

**Scoring:**
- Title keywords match: +3 points
- Abstract keywords match: +2 points
- Recent (< 2 years): +2 points
- Moderate recency (2-5 years): +1 point

**Only add to queue if score ≥ 5**

### 4. Deduplication

**Before adding to queue:**

Check papers-reviewed.json:
```python
doi = paper["externalIds"].get("DOI")
if doi in papers_reviewed:
    skip  # Already processed
else:
    add to queue
```

**CRITICAL: After evaluating any paper from citation traversal, add it to papers-reviewed.json regardless of score. This prevents re-processing the same paper from multiple sources.**

**Track citation relationship** in citations/citation-graph.json:
```json
{
  "10.1234/example.2023": {
    "references": ["10.5678/ref1.2020", "10.5678/ref2.2021"],
    "cited_by": ["10.9012/cite1.2024", "10.9012/cite2.2024"]
  }
}
```

**CRITICAL: Use ONLY citation-graph.json for citation tracking. Do NOT create custom files like forward_citation_pmids.txt or citation_analysis.md. All findings go in SUMMARY.md.**

### 5. Process Queue

**Add relevant citations to processing queue:**
```json
{
  "doi": "10.5678/referenced.2020",
  "title": "Referenced Paper",
  "relevance_score": 7,
  "source": "backward_from:10.1234/example.2023",
  "context": "Method citation - describes IC50 measurement protocol"
}
```

**Then:**
- Evaluate using `evaluating-paper-relevance` skill
- If relevant, extract data and potentially traverse its citations too

## Smart Traversal Limits

**To avoid explosion:**
- Only traverse papers scoring ≥ 7 in initial evaluation
- Only follow citations scoring ≥ 5 in relevance filtering
- Limit traversal depth to 2 levels (original → references → references of references)
- Check with user after every 50 papers total

**Breadth-first strategy:**
1. Get all references + citations for current paper
2. Filter and score them
3. Add high-scoring ones to queue
4. Process next paper in queue
5. Repeat until queue empty or hit limit

## Progress Reporting

**Report as you traverse:**
```
🔗 Analyzing citations for: "Original Paper Title"
   → Found 45 references, 12 look relevant
   → Found 23 citing papers, 8 look relevant
   → Adding 20 papers to queue

📄 [51/127] Following reference: "Method for measuring IC50"
   Source: Referenced by original paper in Methods section
   Abstract score: 7 → Fetching full text...
```

## API Rate Limiting

**Semantic Scholar limits:**
- Free tier: 100 requests per 5 minutes
- With API key: 1000 requests per 5 minutes

**Be efficient:**
- Request multiple fields in one call (`?fields=title,abstract,externalIds,year`)
- Use `limit=100` to get more results per request
- Cache responses - don't re-fetch same paper

**If rate limited:**
- Wait 5 minutes
- Report to user: "⏸️ Rate limited by Semantic Scholar API. Waiting 5 minutes..."
- Consider getting API key for higher limits

## Integration with Other Skills

**After traversing citations:**
1. Queue now has N new papers to evaluate
2. For each, use `evaluating-paper-relevance` skill
3. If relevant, extract to SUMMARY.md
4. If highly relevant (≥9), traverse its citations too
5. Update citation-graph.json to track relationships

## Quick Reference

| Task | API Endpoint |
|------|--------------|
| Get paper by DOI | `GET /graph/v1/paper/DOI:{doi}?fields=paperId,title` |
| Get references | `GET /graph/v1/paper/{paperId}/references?fields=contexts,title,abstract,externalIds` |
| Get citations | `GET /graph/v1/paper/{paperId}/citations?fields=title,abstract,externalIds` |
| Check if processed | Look up DOI in papers-reviewed.json |
| Filter relevance | Score based on context/title/intent/recency |

## Relevance Filtering Checklist

Before adding citation to queue:
- [ ] Check if already in papers-reviewed.json (skip if yes)
- [ ] Score based on context/title keywords (need ≥ 5)
- [ ] Verify external ID (DOI or PMID) exists
- [ ] Add source tracking ("backward_from:DOI" or "forward_from:DOI")
- [ ] Add to queue with metadata

## Common Mistakes

**Not tracking all evaluated papers:** Only adding relevant papers to papers-reviewed.json → Add EVERY paper after evaluation to prevent re-review
**Creating custom analysis files:** Making forward_citation_pmids.txt, CITATION_ANALYSIS.md, etc. → Use ONLY citation-graph.json and SUMMARY.md
**Following all citations:** Exponential explosion → Filter before adding to queue
**Ignoring context:** Citation might be tangential → Read context strings
**Not deduplicating:** Re-process same papers → Always check papers-reviewed.json before and after evaluation
**Too deep:** Following 5+ levels → Limit to 2 levels, check with user
**Missing forward citations:** Only checking references → Use both backward and forward
**No rate limiting awareness:** API blocks you → Add delays, handle 429 errors

## Example Workflow

```
1. User asks: "Find selectivity data for BTK inhibitors"
2. Search finds Paper A (score: 9, has great IC50 data)
3. Traverse citations for Paper A:
   - References: 45 total, 12 relevant (mention "selectivity", "IC50")
   - Citations: 23 total, 8 relevant (newer papers on BTK)
4. Add 20 papers to queue
5. Evaluate first queued paper (score: 8)
6. Extract data, traverse its citations (add 5 more)
7. Continue until queue empty or user says stop
```

## Next Steps

After traversing citations:
- Process queued papers with `evaluating-paper-relevance`
- Update SUMMARY.md with new findings
- Check if reached checkpoint (50 papers or 5 minutes)
- If checkpoint: ask user to continue or stop

Overview

This skill intelligently traverses citation networks using the Semantic Scholar API to follow references (backward) and citing papers (forward) while filtering for relevance and deduplicating results. It prevents exponential exploration by scoring and queuing only relevant citations and tracking progress in a citation graph. The goal is focused discovery of related work and efficient expansion of a literature set.

How this skill works

Lookup a paper by DOI to obtain its Semantic Scholar paperId, then fetch references and citations with batched API calls. Each candidate citation is scored on context keywords, title/abstract matches, intent, recency, and citation count. Only items meeting score thresholds are queued, added to the reviewed registry, and recorded in the citation graph to avoid re-processing.

When to use it

  • You found a highly relevant seed paper (initial score ≥ 7) and want related work.
  • User asks “what papers cite this?” or requests follow-up literature for a result or method.
  • Building a comprehensive, bounded literature map for a topic or metric (e.g., IC50).
  • You need a breadth-first, limited-depth traversal to discover influential newer or foundational papers.
  • Avoid when the seed paper is low relevance (score < 7) or when already processing ~50 papers without user confirmation.

Best practices

  • Request multiple fields per API call and use limit=100 to reduce requests and stay within rate limits.
  • Filter before traversing: apply the relevance scoring rules and only queue items above thresholds.
  • Deduplicate aggressively: record every evaluated paper in the reviewed registry and use citation-graph.json for relationships.
  • Limit depth to two levels and checkpoint with the user after every 50 papers to avoid runaway exploration.
  • Respect rate limits: cache responses, batch fields, and implement backoff on 429 responses.

Example use cases

  • Expand from a paper with useful IC50 measurements to find methods papers and newer validation studies.
  • Identify recent papers that build on a seminal method to get the latest improvements or applications.
  • Construct a curated citation graph for a review or meta-analysis while avoiding tangential citations.
  • Find methodological sources cited in a results section to reproduce or adapt experimental protocols.
  • Quickly gather candidate papers for deeper evaluation by a separate relevance-evaluation skill.

FAQ

How do you avoid re-processing the same paper?

Every evaluated paper is added to the reviewed registry; the system checks this before queuing and records relationships in citation-graph.json.

What thresholds control traversal?

References and citations are scored; only references with score ≥ 5 and initial papers with score ≥ 7 are followed. Depth is limited to two levels and checkpoints every 50 papers.