home / skills / kthorn / research-superpower / searching-literature

searching-literature skill

/skills/research/searching-literature

This skill searches PubMed with optimized queries, parses results, and extracts metadata to assemble a ready-to-evaluate literature set.

npx playbooks add skill kthorn/research-superpower --skill searching-literature

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
5.8 KB
---
name: Searching Scientific Literature
description: PubMed search with keyword optimization, result parsing, and metadata extraction
when_to_use: When starting literature search. When user asks about papers, publications, studies. When need to find scientific articles. When building initial paper list for research question.
version: 1.0.0
---

# Searching Scientific Literature

## Overview

Search PubMed for scientific literature using optimized queries. Extract metadata and prepare papers for relevance evaluation.

**Core principle:** Cast a wide enough net to find relevant papers, but use targeted keywords to keep results manageable.

## When to Use

Use this skill when:
- Starting a new research question
- User asks "find papers about..."
- Need initial paper set for evaluation
- Searching for specific methods, compounds, diseases, techniques

## Search Strategy

### 1. Parse User Query

Extract:
- **Keywords**: Main concepts (e.g., "BTK inhibitor", "selectivity", "kinase")
- **Data types**: What user needs (IC50 values, methods, structures, results)
- **Constraints**: Date ranges, specific journals, author names
- **Synonyms**: Alternative terms (e.g., "Bruton's tyrosine kinase" = "BTK")

### 2. Construct PubMed Query

**Boolean operators:**
- AND - narrow results (must have both terms)
- OR - broaden results (either term)
- NOT - exclude terms

**Example queries:**
```
"BTK inhibitor"[Title/Abstract] AND selectivity[Title/Abstract]

("kinase inhibitor" OR "protein kinase") AND (selectivity OR "off-target")

"ibrutinib"[Title/Abstract] AND ("IC50" OR "inhibitory concentration")
```

**Field tags:**
- `[Title/Abstract]` - search title and abstract only
- `[Title]` - title only (more precise)
- `[Author]` - specific author
- `[Journal]` - specific journal
- `[Date]` - date range

### 3. Execute Search

**API endpoint:**
```bash
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?\
db=pubmed&\
term=YOUR_QUERY&\
retmax=100&\
retmode=json&\
sort=relevance
```

**Parameters:**
- `db=pubmed` - search PubMed database
- `term=` - your query (URL encode spaces and special chars)
- `retmax=100` - max results (start with 100)
- `retmode=json` - return JSON
- `sort=relevance` - most relevant first (or `pub_date` for newest)

**Example bash:**
```bash
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=BTK+inhibitor+selectivity&retmax=100&retmode=json&sort=relevance"
```

**Response format:**
```json
{
  "esearchresult": {
    "count": "156",
    "retmax": "100",
    "idlist": ["12345678", "87654321", ...]
  }
}
```

### 4. Fetch Paper Metadata

**API endpoint:**
```bash
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?\
db=pubmed&\
id=12345678,87654321&\
retmode=json
```

**Extract from response:**
- Title
- Authors (list)
- Journal name
- Publication date
- Abstract (via separate efetch call or use esummary)
- PMID
- DOI (if available in `articleids`)

**Getting DOI from PMID:**
```json
"articleids": [
  {"idtype": "pubmed", "value": "12345678"},
  {"idtype": "doi", "value": "10.1234/example.2023"}
]
```

**If DOI missing:**
- Use PMID as fallback identifier
- Try to resolve DOI via PubMed Central or publisher APIs later

## Output Format

Create list of paper objects:

```json
[
  {
    "pmid": "12345678",
    "doi": "10.1234/example.2023",
    "title": "Selective BTK inhibitors for autoimmune diseases",
    "authors": ["Smith J", "Doe A", "Johnson B"],
    "journal": "Nature Chemical Biology",
    "year": "2023",
    "abstract": "We developed a series of...",
    "source": "pubmed_search"
  }
]
```

## Error Handling

**Rate limits (CRITICAL - shared across all processes/subagents):**
- No API key: 3 requests/second (official limit)
- With API key: 10 requests/second
- **Single agent/script:** Use 500ms delays (2 req/sec, safe margin)
  - 350ms is theoretically sufficient but causes ~20% HTTP 429 errors in practice
- **Multiple parallel subagents:** Use longer delays to share capacity
  - 2 parallel: 1 second each (2 total req/sec)
  - 3 parallel: 1.5 seconds each (2 total req/sec)
  - 5 parallel: 2.5 seconds each (2 total req/sec)
  - Formula: `delay_seconds = (num_parallel / rate_limit) + safety_margin`
- **If you get HTTP 429 errors:** Wait 5 seconds, resume with doubled delays

**Empty results:**
- Try broader terms
- Remove field tags
- Check for typos
- Use OR to add synonyms

**Too many results (>500):**
- Add more specific terms
- Use field tags to narrow
- Add date constraints
- Consider splitting into sub-queries

## Integration with Other Skills

After search completes:
1. **Save results** to research folder as `initial-search-results.json`
2. **For each paper**, call `evaluating-paper-relevance` skill
3. **Track in** `papers-reviewed.json` (use DOI as key, fallback to PMID)

## Quick Reference

| Task | Command |
|------|---------|
| Search PubMed | `curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=QUERY&retmax=100&retmode=json"` |
| Get metadata | `curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=PMID1,PMID2&retmode=json"` |
| URL encode query | Replace spaces with `+`, special chars with `%XX` |
| Narrow results | Use AND, add field tags, more specific terms |
| Broaden results | Use OR, remove field tags, add synonyms |

## Common Mistakes

**Too narrow:** Only 5 results → Use OR, remove constraints
**Too broad:** 5000 results → Add AND terms, use field tags
**Missing abstracts:** Use efetch instead of esummary for full abstract text
**DOI not found:** Many older papers lack DOI - use PMID as fallback
**Rate limiting:** Add 500ms delays (single agent) or longer (parallel subagents sharing rate limit)

## Next Steps

After completing search:
- Announce: "Found N papers matching query"
- Begin evaluation using `skills/research/evaluating-paper-relevance`
- Update user with progress as papers are screened

Overview

This skill performs targeted PubMed searches with keyword optimization, result parsing, and metadata extraction to produce a clean initial set of papers for review. It balances broad coverage with focused filters so you find relevant literature without overwhelming volume. Outputs are ready for downstream relevance evaluation and tracking.

How this skill works

It parses the user query to extract keywords, data types (e.g., IC50, methods), constraints (date range, journals, authors), and synonyms. It builds PubMed queries using Boolean operators and field tags, executes searches via the NCBI E-utilities, retrieves PMIDs, and fetches metadata (title, authors, journal, date, abstract, DOI). Results are returned as structured paper objects and saved for follow-up processing.

When to use it

  • Starting a new literature review or research question
  • User asks “find papers about X” or requests evidence for a claim
  • Gathering an initial corpus of papers for manual or automated screening
  • Searching for specific methods, compounds, diseases, or metrics (e.g., IC50)
  • Preparing inputs for downstream relevance-evaluation or data-extraction skills

Best practices

  • Extract synonyms and include OR groups to avoid missing relevant papers
  • Start with retmax=100 and iterate—narrow with AND or field tags if too many results
  • Respect PubMed rate limits; use ~500 ms delay for single-agent scripts
  • If abstracts are missing, use efetch instead of esummary to retrieve full abstracts
  • Use DOI where available; fall back to PMID for older papers or missing DOIs

Example use cases

  • Find recent and classic papers on BTK inhibitors and selectivity, collecting IC50 reports
  • Compile a set of methods papers for a protocol comparison across years
  • Search for clinical trials and safety data on a specific compound or drug class
  • Create an initial corpus for systematic review screening with metadata and identifiers
  • Resolve DOIs for a list of PMIDs to prepare citation lists or DOI-based lookups

FAQ

What if the search returns too many results?

Narrow using AND terms, field tags like [Title/Abstract], date ranges, or split the topic into sub-queries.

How do you handle rate limits?

Use 500 ms delays per request for single-agent scripts. For parallel workers lengthen delays so total throughput stays within limits; exponential backoff on HTTP 429.