home / skills / bdambrosio / cognitive_workbench / semantic-scholar

semantic-scholar skill

/src/tools/semantic-scholar

This skill searches academic papers via Semantic Scholar API and returns structured notes with full text when available.

npx playbooks add skill bdambrosio/cognitive_workbench --skill semantic-scholar

Review the files below or copy the command above to add this skill to your agents.

Files (2)
Skill.md
3.1 KB
---
name: semantic-scholar
type: python
description: "Search academic papers. Returns Collection of JSON Notes with fields text (full paper text via GROBID when PDF available, otherwise abstract), metadata.title, metadata.authors, metadata.year, metadata.citations, metadata.uri (alias: pdf_url), metadata.venue"
---

# semantic-scholar

Search academic papers using Semantic Scholar API. Returns Collection of structured Notes with full paper text when PDF available.

## Input

- `query`: Query string (e.g., "attention mechanisms in neural networks")
- `limit`: Optional result limit (int, default: 10)

## Output

Success (`status: "success"`):
- `resource_id`: Collection ID containing structured Notes, each with:
  - `text`: Full paper text (via GROBID) or abstract
  - `format`: "paper"
  - `metadata.title`: Paper title
  - `metadata.authors`: List of authors
  - `metadata.year`: Publication year
  - `metadata.citations`: Citation count
  - `metadata.uri`: PDF URL (may be null for paywalled papers)
  - `metadata.venue`: Conference/journal name
  - `char_count`: Character count

## Behavior

- When GROBID configured and PDF available, `text` contains full paper content
- Otherwise `text` contains the abstract
- Requires `SEMANTIC_SCHOLAR_API_KEY` environment variable
- Requires `grobid_url` in YAML config for full text extraction

## Content Structure

Each Note in the returned Collection has the following JSON structure:
```json
{
  "text": "Full paper text or abstract...",
  "format": "paper",
  "metadata": {
    "title": "Paper Title",
    "authors": ["Author 1", "Author 2"],
    "year": 2023,
    "citations": 150,
    "uri": "https://example.com/paper.pdf",
    "venue": "NeurIPS"
  },
  "char_count": 5000
}
```

**Important:** All result data is in the Note's `content` field (a dict). Engine metadata (creation date, source tool, etc.) is separate and accessed via `get_resource_metadata()`, not via `content['metadata']`.

## Key Principle

**Results already contain full paper text in the `text` field.** Use extract/synthesize directly on the Collection — do NOT project metadata.uri for fetching. The URI is a PDF link for reference only; the text content is already loaded.

## Common Workflows

**Direct synthesis (preferred):**
```json
{"type":"semantic-scholar","query":"BERT model","out":"$papers"}
{"type":"synthesize","target":"$papers","focus":"key contributions of BERT","out":"$summary"}
```

**Per-paper extraction then synthesis:**
```json
{"type":"semantic-scholar","query":"attention mechanisms","out":"$papers"}
{"type":"map","target":"$papers","operation":"extract","instruction":"Extract the main architectural innovation","out":"$innovations"}
{"type":"synthesize","target":"$innovations","focus":"comparison of approaches","out":"$report"}
```

**Filter by year then analyze:**
```json
{"type":"filter-structured","target":"$papers","where":"metadata.year > 2020","out":"$recent_papers"}
{"type":"synthesize","target":"$recent_papers","focus":"recent advances","out":"$summary"}
```

**Extract paper metadata:**
```json
{"type":"project","target":"$papers","fields":["metadata.title","metadata.year","metadata.citations"],"out":"$paper_info"}
```

Overview

This skill searches academic papers via the Semantic Scholar API and returns a Collection of structured Notes containing paper text (full text when available) and rich metadata. It is designed to feed downstream extraction and synthesis workflows with ready-to-use paper content. Use it to gather papers, analyze contributions, or build literature summaries quickly.

How this skill works

The skill queries Semantic Scholar by a user-provided query and optional limit, then returns a Collection where each Note contains text (full paper via GROBID when a PDF is available, otherwise the abstract) plus metadata fields like title, authors, year, citations, pdf URL, and venue. It requires SEMANTIC_SCHOLAR_API_KEY in the environment and a grobid_url in the configuration if you want full-text extraction. Results are already loaded in Note.content.text, so downstream steps should operate on those texts rather than fetching the pdf_uri.

When to use it

  • Collect a batch of papers for literature review or surveys
  • Extract key contributions, methods, or results across papers
  • Preprocess corpora for model training or citation analysis
  • Filter papers by year, venue, or citation count before synthesis
  • Quickly assemble inputs for automated summarization or meta-analysis

Best practices

  • Prefer direct extraction/synthesis on the returned Collection since Note.content.text already contains full text or abstract
  • Set reasonable limits to avoid excessive API usage; filter results locally by metadata.year or citations
  • Ensure SEMANTIC_SCHOLAR_API_KEY is set and configure grobid_url to enable full-text GROBID extraction
  • Do not rely on metadata.uri for fetching text — metadata.uri is a reference only and may point to paywalled PDFs
  • Project only the fields you need (title, year, citations) when building lightweight views

Example use cases

  • Search for 'attention mechanisms' and synthesize a summary of architectural innovations
  • Collect recent 2021–2024 NLP papers, filter by venue, and extract abstracts for a reading list
  • Build a citation-ranked list of seminal papers on a topic using metadata.citations
  • Map each paper to its primary methodological contribution and produce a comparative report
  • Feed full paper texts into an extraction pipeline to pull datasets, model details, or experimental results

FAQ

Do I need to download PDFs to get the full text?

No. When GROBID is configured and a PDF is available, the skill returns full paper text in Note.content.text. You should not fetch the metadata.uri to obtain content.

What configuration is required to get full texts?

Set the SEMANTIC_SCHOLAR_API_KEY environment variable and provide a grobid_url in the skill configuration. Without GROBID or a PDF link, the skill falls back to abstracts.