home / skills / bdambrosio / cognitive_workbench / project

project skill

/src/primitives/project

This skill extracts specified nested metadata from each note in a collection, returning only the requested fields for structured downstream use.

npx playbooks add skill bdambrosio/cognitive_workbench --skill project

Review the files below or copy the command above to add this skill to your agents.

Files (1)
Skill.md
3.0 KB
---
name: project
type: primitive
description: Extract metadata/structured fields from each Note in Collection (SQL SELECT)
---

# Project

## INPUT CONTRACT

- `target`: Collection (variable or ID)
- `fields`: List of field paths (strings, supports dot notation like `metadata.uri`)
- `out`: Variable name

**REQUIREMENTS:**
- Collection MUST contain Notes (not Collections)
- Each Note MUST be dict/JSON object
- Fields MUST exist as keys in each Note (missing fields cause Note to be excluded)

**NOT SUPPORTED:**
- ❌ Note (must be Collection)
- ❌ Collection of arrays (must be dict Notes)
- ❌ Text parsing (use `refine` tool for LLM-based extraction from text)

## OUTPUT

Returns Collection of Notes, each containing only the requested fields. Notes missing any requested field are excluded.

## CONTENT STRUCTURE

**For JSON Notes, content is a dict with fields:**
- Top-level fields: `text`, `format`, `char_count`
- Nested fields: `metadata.*` (e.g., `metadata.uri`, `metadata.title`, `metadata.year`)

**Example Note content structure (from semantic-scholar/search-web):**
```json
{
  "text": "Full text content...",
  "format": "paper",
  "metadata": {
    "title": "Paper Title",
    "authors": ["Author 1", "Author 2"],
    "year": 2023,
    "uri": "https://example.com/paper.pdf",
    "score": 0.95
  },
  "char_count": 5000
}
```

## FIELD ACCESS EXAMPLES

**Extract single field:**
```json
{"type":"project","target":"$papers","fields":["metadata.title"],"out":"$titles"}
```

**Extract multiple fields:**
```json
{"type":"project","target":"$papers","fields":["metadata.title","metadata.year"],"out":"$paper_info"}
```

**Extract nested metadata fields:**
```json
{"type":"project","target":"$search_results","fields":["metadata.uri","metadata.score"],"out":"$urls"}
```

**Extract top-level and nested fields:**
```json
{"type":"project","target":"$results","fields":["text","metadata.uri","char_count"],"out":"$filtered"}
```

## FAILURE SEMANTICS

**Empty Collection = expected when:**
- No Notes have all requested fields
- Type contract violated (non-dict Notes)

**Empty ≠ error** — indicates no matches, not failure.

**Actual failures:** Invalid target type, missing parameters, or malformed fields list.

## REPRESENTATION INVARIANTS

- Note containing JSON array ≠ Collection
- Use `split` to convert array → Collection before projecting
- Projected Notes preserve nested structure (e.g., `metadata.uri` stays as `metadata.uri`)

## ANTI-PATTERNS

❌ `project(target=$note)` → Must be Collection
❌ `project(target=$coll_of_arrays)` → Elements must be dicts
❌ `project(target=$results, fields=["extract the author"])` → Use `refine` for text extraction
❌ Treating empty result as error → Empty = no matches

## USE CASES

- Extract `metadata.uri` from search results for `fetch-text`
- Extract `metadata.title` and `metadata.year` from papers for filtering
- Extract `metadata.source_id` and `metadata.score` from search results for analysis
- Project specific fields before `join` operations

Overview

This skill extracts structured metadata fields from every Note in a Collection, returning a new Collection where each Note contains only the requested fields. It is designed for JSON/dict Notes and preserves nested structure (e.g., metadata.uri). Notes missing any requested field are excluded rather than causing an error.

How this skill works

Provide a Collection target and a list of field paths (supports dot notation like metadata.uri). The skill inspects each Note (must be a dict/JSON object), selects the requested keys, and emits a Collection of Notes each containing only those fields. If a Note lacks any requested field it is filtered out; an empty result means no Notes matched the projection.

When to use it

  • You need to extract specific metadata (titles, URIs, years, scores) from search or paper results.
  • Preparing a slimmed-down dataset before running joins, aggregations, or downstream transforms.
  • Creating a list of URIs to feed into a fetch-text or downloader step.
  • Filtering out Notes that don't contain required fields for a subsequent pipeline stage.

Best practices

  • Ensure the target is a Collection of dict/JSON Notes; single Note or collections of arrays are unsupported.
  • Specify only existing field paths; missing fields cause Notes to be excluded (not an error).
  • Use dot notation for nested metadata (metadata.title, metadata.uri) to preserve structure.
  • Convert arrays to a Collection with split before projecting elements inside arrays.
  • Treat an empty output as a valid 'no matches' result, not a failure.

Example use cases

  • Extract metadata.uri from search results to build a list of links for fetch-text.
  • Project metadata.title and metadata.year from a papers Collection for metadata-driven filtering.
  • Select metadata.source_id and metadata.score to rank or analyze search hits.
  • Trim Notes to {text, metadata.uri, char_count} before joining with another dataset.

FAQ

What happens if some Notes lack one of the requested fields?

Those Notes are excluded from the output; the resulting Collection contains only Notes that have all requested fields.

Can I project fields from a Collection of arrays or plain text?

No. Elements must be dict/JSON objects. Convert arrays to a Collection first or use the refine tool for text parsing.