home / skills / bdambrosio / cognitive_workbench / extract-references

extract-references skill

/src/tools/extract-references

This skill extracts references from PDFs using GROBID and returns a collection of notes with structured metadata for each citation.

npx playbooks add skill bdambrosio/cognitive_workbench --skill extract-references

Review the files below or copy the command above to add this skill to your agents.

Files (2)
Skill.md
1.4 KB
---
name: extract-references
type: python
description: "Extract bibliography/references from PDF files using GROBID and return a Collection of Notes (one per reference)."
schema_hint: {"path": "string (PDF file path or Note ID)", "grobid_url": "string"}
---

# extract-references

Extract bibliography/references from PDF files using GROBID. Returns a Collection of Notes, where each Note contains structured metadata for one reference (compatible with format-citation).

## Input

- `path`: PDF file path (absolute) or Note ID containing PDF URL/metadata (required)
- `grobid_url`: Optional GROBID server URL (from world_config)

## Output

Success returns:
- `resource_id`: Collection ID containing Notes (one Note per reference)
- Each Note contains:
  - `data`: Structured reference metadata (title, authors, year, venue, doi, url)
  - `metadata`: Source PDF, reference index, raw citation text

## Behavior

- Uses GROBID to parse PDF and extract references from `<bibl>` elements
- Creates one Note per reference with structured metadata
- Returns empty Collection if no references found
- Reference Notes are compatible with `format-citation` tool

## Examples

```json
{"type":"extract-references","path":"/path/to/paper.pdf","out":"$refs"}
{"type":"extract-references","path":"$paper_note","out":"$refs"}
{"type":"extract-references","path":"$refs","format-citation":"bibtex","out":"$bibtex"}
```

Overview

This skill extracts bibliography and reference entries from PDF files using a GROBID server and returns a Collection of Notes, one per reference. Each Note contains structured citation metadata (title, authors, year, venue, DOI, URL) plus source metadata to support downstream citation formatting or linking. The Collection is compatible with standard citation formatting tools.

How this skill works

The skill sends the PDF (or a Note that contains a PDF URL/metadata) to a GROBID instance, which parses the document and locates <bibl> elements representing bibliography entries. For each extracted reference it builds a Note that includes structured fields and the raw citation text, then stores all Notes in a Collection and returns the Collection ID. If no references are found, an empty Collection is returned.

When to use it

  • You have a PDF of a research paper and need machine-readable bibliographic records.
  • You want to convert the reference list into Notes for indexing, citation formatting, or enrichment.
  • You need to extract metadata in bulk from PDFs for a literature database or review.
  • You plan to feed structured references into downstream tools like format-citation.
  • You want to attach provenance (source PDF and reference index) to every extracted citation.

Best practices

  • Run a local or dedicated GROBID instance for bulk or frequent processing to reduce latency and rate-limit issues.
  • Provide absolute PDF paths or Note IDs that include direct PDF URLs to avoid download failures.
  • Validate key fields (DOI, year, authors) after extraction and apply heuristics or manual checks for noisy PDFs.
  • Batch PDFs and monitor GROBID logs for parsing errors; reprocess problematic files with different settings if needed.
  • Store the returned Collection ID and Note IDs for traceability and future reformatting or deduplication.

Example use cases

  • Extract all references from a dissertation PDF to create a searchable citation database.
  • Convert a conference paper's bibliography into structured Notes for import into citation managers.
  • Automate building a literature review by extracting references from a folder of PDFs.
  • Generate BibTeX or other citation formats by feeding the returned Collection into a format-citation tool.
  • Link extracted references to DOIs and external databases for enrichment and deduplication.

FAQ

What input formats are supported?

Input must be a PDF file path (absolute) or a Note ID that contains a PDF URL/metadata. Other formats are not supported.

What happens if GROBID fails to parse references?

If GROBID cannot find <bibl> elements or parsing fails, the skill returns an empty Collection. Check logs, PDF quality, or try a different GROBID instance.