home / skills / kthorn / research-superpower / finding-open-access-papers

finding-open-access-papers skill

/skills/research/finding-open-access-papers

This skill finds free full-text versions of paywalled papers using Unpaywall, helping you access articles quickly.

npx playbooks add skill kthorn/research-superpower --skill finding-open-access-papers

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
8.6 KB
---
name: Finding Open Access Papers
description: Use Unpaywall API to find free full-text versions of paywalled papers
when_to_use: When paper behind paywall. When PMC full text not available. When DOI resolution hits paywall. When need free access to paper. Before giving up on full text access.
version: 1.0.0
---

# Finding Open Access Papers

## Overview

Use Unpaywall to find legally available open access versions of papers that appear to be behind paywalls.

**Core principle:** Many paywalled papers have free versions (preprints, author manuscripts, institutional repositories). Unpaywall finds them.

## When to Use

Use this skill when:
- DOI resolution hits a paywall
- Paper not available in PubMed Central
- Publisher site requires subscription
- Need full text for highly relevant paper (score ≥7)

**Use BEFORE giving up on full text access**

## Unpaywall API

**Simple REST API - no authentication required for reasonable usage**

### Basic Request

```bash
curl "https://api.unpaywall.org/v2/DOI?email=YOUR_EMAIL"
```

**Parameters:**
- `DOI` - The paper's DOI (URL-encoded if needed)
- `email` - User's email (required, for courtesy/contact)

**IMPORTANT: Ask user for their email at the start of research session. Do NOT use placeholder emails like [email protected] or [email protected].**

**Example:**
```bash
curl "https://api.unpaywall.org/v2/10.1038/[email protected]"
```

### Response Format

```json
{
  "doi": "10.1038/nature12373",
  "title": "Paper Title",
  "is_oa": true,
  "best_oa_location": {
    "url": "https://europepmc.org/articles/pmc3858213",
    "url_for_pdf": "https://europepmc.org/articles/pmc3858213?pdf=render",
    "version": "publishedVersion",
    "license": "cc-by",
    "host_type": "repository"
  },
  "oa_locations": [
    {
      "url": "https://europepmc.org/articles/pmc3858213",
      "version": "publishedVersion"
    },
    {
      "url": "https://arxiv.org/abs/1234.5678",
      "version": "submittedVersion"
    }
  ]
}
```

## Key Response Fields

**`is_oa`** (boolean)
- `true` - Open access version available
- `false` - No free version found

**`best_oa_location`** (object or null)
- Unpaywall's recommended best open access source
- Prioritizes published versions over preprints
- Includes PDF URL when available

**`oa_locations`** (array)
- All known open access locations
- Includes repositories, preprint servers, institutional sites
- Ordered by quality/version

**`version`** types:
- `publishedVersion` - Final published version (best)
- `acceptedVersion` - Author's accepted manuscript (good)
- `submittedVersion` - Preprint before peer review (useful)

## Implementation Pattern

### 1. Check Unpaywall After Paywall Hit

```bash
# Try DOI first
curl -L "https://doi.org/10.1234/example.2023"

# If paywall detected (403, subscription required, etc):
curl "https://api.unpaywall.org/v2/10.1234/[email protected]"
```

### 2. Extract Best URL

```bash
# Parse JSON response
response=$(curl -s "https://api.unpaywall.org/v2/DOI?email=EMAIL")

# Check if OA available
is_oa=$(echo $response | jq -r '.is_oa')

if [ "$is_oa" = "true" ]; then
  # Get best PDF URL
  pdf_url=$(echo $response | jq -r '.best_oa_location.url_for_pdf // .best_oa_location.url')

  # Download
  curl -L -o "papers/paper.pdf" "$pdf_url"
fi
```

### 3. Report to User

**When OA found:**
```
⚠️ Paper behind paywall at publisher
✓ Found open access version via Unpaywall!
   Source: Europe PMC (published version)
   PDF: https://europepmc.org/articles/pmc3858213?pdf=render
   → Downloading...
```

**When no OA found:**
```
⚠️ Paper behind paywall at publisher
✗ No open access version found via Unpaywall
   Options:
   - Request via institutional access
   - Contact authors for preprint
   - Continue with abstract only
```

### 4. Prioritize by Version

If multiple locations available:

**Priority order:**
1. `publishedVersion` from publisher or PMC
2. `acceptedVersion` from institutional repository
3. `submittedVersion` from preprint server (arXiv, bioRxiv)

## Integration with evaluating-paper-relevance

**Add to full text fetching workflow:**

```
Stage 2: Fetch Full Text

Try in order:
A. PubMed Central (free full text)
B. DOI resolution → If paywall, try Unpaywall
C. Unpaywall direct lookup
D. Preprints (bioRxiv, arXiv)
```

**Updated workflow:**

```bash
# 1. Try PMC
pmc_result=$(curl "https://eutils.ncbi.nlm.nih.gov/...")
if has_pmc_fulltext; then
  fetch_pmc
  exit 0
fi

# 2. Try DOI
doi_result=$(curl -L "https://doi.org/$doi")
if is_paywall; then
  # 3. Try Unpaywall
  unpaywall_result=$(curl "https://api.unpaywall.org/v2/$doi?email=$EMAIL")
  if has_oa; then
    fetch_unpaywall_pdf
    exit 0
  fi
fi

# 4. No full text available
report_no_fulltext
```

## Rate Limiting

**Free tier (with email):**
- 100,000 requests per day
- No hard rate limit, but be respectful
- Include email in requests (required)

**Best practices:**
- Add 100ms delay between requests
- Cache responses (don't re-check same DOI)
- Only check for papers you actually need

## Python Helper Example

```python
import requests
import time

def find_open_access(doi, email):
    """
    Find open access version via Unpaywall
    Returns: (pdf_url, version, source) or (None, None, None)
    """
    url = f"https://api.unpaywall.org/v2/{doi}"
    params = {"email": email}

    try:
        response = requests.get(url, params=params, timeout=10)
        response.raise_for_status()
        data = response.json()

        if not data.get('is_oa'):
            return None, None, None

        best_loc = data.get('best_oa_location')
        if not best_loc:
            return None, None, None

        pdf_url = best_loc.get('url_for_pdf') or best_loc.get('url')
        version = best_loc.get('version', 'unknown')
        source = best_loc.get('host_type', 'unknown')

        return pdf_url, version, source

    except Exception as e:
        print(f"Error checking Unpaywall for {doi}: {e}")
        return None, None, None

# Usage
doi = "10.1038/nature12373"
pdf_url, version, source = find_open_access(doi, "[email protected]")

if pdf_url:
    print(f"Found {version} at {source}")
    print(f"PDF: {pdf_url}")
    # Download PDF
    response = requests.get(pdf_url)
    with open(f'papers/{doi.replace("/", "_")}.pdf', 'wb') as f:
        f.write(response.content)
else:
    print("No open access version found")

time.sleep(0.1)  # Rate limiting
```

## Common Sources Found

**Repositories:**
- Europe PMC / PubMed Central
- Institutional repositories (university sites)
- PubMed Central international mirrors

**Preprint servers:**
- bioRxiv (biology)
- medRxiv (medicine)
- arXiv (physics, CS, math)
- ChemRxiv (chemistry)

**Publisher sites:**
- Open access journals
- Hybrid journals (OA articles in subscription journals)
- Delayed open access (embargo expired)

## Error Handling

**DOI not found:**
```json
{
  "error": "true",
  "message": "DOI not found"
}
```
→ Check DOI format, try alternative identifiers

**Network errors:**
- Retry with exponential backoff
- Maximum 3 attempts
- Report to user if all fail

**Malformed response:**
- Check for `is_oa` field
- Fallback to `oa_locations` array if `best_oa_location` missing

## Quick Reference

| Task | Command |
|------|---------|
| Check if OA available | `curl "https://api.unpaywall.org/v2/DOI?email=EMAIL"` |
| Get best PDF URL | Parse `.best_oa_location.url_for_pdf` |
| List all OA sources | Parse `.oa_locations[]` |
| Check version type | Look at `.version` field |
| Download PDF | `curl -L -o paper.pdf "$pdf_url"` |

## Integration Points

**Called by:**
- `evaluating-paper-relevance` - When full text not in PMC
- `answering-research-questions` - For highly relevant papers

**Updates:**
- `papers-reviewed.json` - Note if OA found
- `SUMMARY.md` - Include OA source info

## Common Mistakes

**Using placeholder email:** Using [email protected] or [email protected] → Ask user for their real email
**Not including email:** Required parameter, requests will fail
**Checking every paper:** Only check when needed (score ≥7, no PMC)
**Ignoring version type:** Published version better than preprint
**Single source only:** Check `oa_locations` array for alternatives
**No rate limiting:** Add delays even though no hard limit

## Success Criteria

Successful when:
- Paywalled paper's OA version found and downloaded
- Version type recorded (published/accepted/submitted)
- User informed about source and version
- Fallback options provided if no OA available

## Next Steps

After finding OA version:
- Download PDF to papers/ folder
- Note source and version in SUMMARY.md
- Continue with deep dive analysis
- If no OA: note in summary, continue with abstract only

Overview

This skill uses the Unpaywall API to locate legally available open access copies of papers that appear paywalled. It helps recover published versions, accepted manuscripts, or preprints so you can obtain full text without subscription access. It is designed to plug into a full-text fetching workflow and report source, version, and PDF URL.

How this skill works

Given a DOI, the skill queries the Unpaywall REST API (email parameter required) and inspects the JSON response fields like is_oa, best_oa_location, and oa_locations. If an open access location is found it prioritizes publishedVersion then acceptedVersion then submittedVersion, extracts a PDF or landing-page URL, and optionally downloads the file to a local folder. The skill includes simple rate-limiting, caching recommendations, and error handling for missing DOIs or network failures.

When to use it

  • After DOI resolution triggers a paywall or subscription prompt
  • When PubMed Central does not provide full text
  • Before contacting authors or requesting interlibrary loan
  • When a paper is highly relevant and you need the full text (e.g., relevance score ≥7)
  • As part of an automated full-text fetching stage in a research pipeline

Best practices

  • Ask the user for their real email at the start of the session and include it in all Unpaywall requests (do not use placeholders).
  • Cache Unpaywall responses and add a small delay (e.g., 100 ms) between requests to be respectful of the service.
  • Prefer best_oa_location.url_for_pdf when available; fall back to best_oa_location.url or entries in oa_locations.
  • Treat publishedVersion as highest quality, then acceptedVersion, then submittedVersion; record version and host_type with downloads.
  • Retry transient network errors with exponential backoff (max 3 attempts) and report failures clearly to the user.

Example use cases

  • You attempted DOI resolution and hit a subscription wall; this skill finds a Europe PMC published PDF and downloads it.
  • Automated pipeline checks for full text: PMC → DOI resolution → Unpaywall; if OA found, pipeline saves PDF and notes version.
  • Manual research: user supplies a DOI and email; skill returns best OA URL, version, and host so researcher can access full text.
  • Batch processing: script checks a shortlist of high-priority DOIs (score ≥7), caches results, and downloads available OA PDFs.

FAQ

Do I need an API key or authentication?

No API key is required, but you must include a valid email address in requests for courtesy and rate-tracking.

What happens if Unpaywall reports is_oa false?

If is_oa is false, no OA copies were found; fallback options include institutional access, contacting authors, or using the abstract only.

How should I handle multiple OA locations?

Prefer the best_oa_location (publishedVersion or PDF URL). If missing, inspect oa_locations and choose by version priority: published > accepted > submitted.