home / skills / bdambrosio / cognitive_workbench / extract-struct

extract-struct skill

/src/tools/extract-struct

This skill extracts structured metadata from academic paper text using an LLM and outputs JSON with title, authors, year, venue, and abstract.

npx playbooks add skill bdambrosio/cognitive_workbench --skill extract-struct

Review the files below or copy the command above to add this skill to your agents.

Files (1)
Skill.md
1.0 KB
---
name: extract-struct
type: prompt_augmentation
description: "Extract structured metadata (title, authors, year) from paper text using LLM. Use to convert online search results metadata to JSON"
---

# extract-struct

Extract structured metadata from academic paper text using LLM analysis.

## Input

- `target`: Note ID or variable containing full text or first pages of academic paper

## Output

Returns JSON Note with:
- `title`: Paper title
- `authors`: List of author names
- `year`: Publication year
- `venue`: Conference/journal if identifiable
- `abstract`: Paper abstract if present

## Behavior

- Uses LLM to analyze paper text and extract structured fields
- Handles various paper formats and layouts
- Returns only JSON, no explanation text

## Planning Notes

- Provide full text or first few pages for best results
- Works best with academic papers that have clear title/author sections
- Use with `fetch-text` to get paper content first

## Example

```json
{"type":"extract-struct","target":"$paper_text","out":"$metadata"}
```

Overview

This skill extracts structured metadata (title, authors, year, venue, abstract) from academic paper text using a large language model. It converts unstructured paper content into clean JSON suitable for databases, search results, or bibliographic tools. The skill focuses on robustness across common paper layouts and noisy text extractions.

How this skill works

The skill analyzes the provided paper text (full text or first pages) with an LLM prompt tuned to identify title, authors, year, venue, and abstract. It handles variations in formatting, headers, and page artifacts, and normalizes the results into a single JSON object. The output is strictly JSON with the fields title, authors (array), year, venue, and abstract when present.

When to use it

  • You have raw paper text from PDFs or scraped pages and need structured citation metadata.
  • Converting search result snippets or web scrape metadata into consistent JSON.
  • Preparing bibliographic records for ingestion into a database or reference manager.
  • Batch-processing collections of papers for indexing or analysis.

Best practices

  • Provide the first pages or full text; title/author blocks are usually on the first page.
  • If text extraction is noisy, run OCR or a cleaner extraction step first.
  • When possible, include contextual headers or surrounding lines to improve venue/year detection.
  • Validate critical fields (author spelling, year) against a trusted source when high accuracy is required.

Example use cases

  • Convert returned paper text from a fetch-text step into a JSON metadata object for each result.
  • Normalize metadata for papers scraped from conference websites before adding to a catalog.
  • Extract abstracts and authors for training dataset curation or literature review tools.
  • Automatically populate fields in a citation manager or research database during batch imports.

FAQ

What input does the skill expect?

A Note ID or variable containing the paper text; ideally the first pages or full text.

What does the skill return?

A JSON object with title, authors (list), year, venue, and abstract when available.