home / skills / benjaminjackson / exa-skills / find-similar

find-similar skill

safe

This skill finds content similar to a given URL using AI, helping you discover related articles, papers, and sites quickly.

npx playbooks add skill benjaminjackson/exa-skills --skill find-similar

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

7.9 KB

---
name: exa-find-similar
description: Find web content similar to a given URL using AI-powered similarity matching. Use when you have an example page and want to discover related articles, papers, or websites with similar content, style, or topic.
---

# Exa Find Similar

Token-efficient strategies for finding similar content using exa-ai.

**Use `--help` to see available commands and verify usage before running:**
```bash
exa-ai <command> --help
```

## Critical Requirements

**MUST follow these rules when using exa-ai find-similar:**

### Shared Requirements

This skill inherits requirements from [Common Requirements](../../../docs/common-requirements.md):
- Schema design patterns → All schema operations
- Output format selection → All output operations

### MUST NOT Rules

1. **Avoid --text flag**: Prefer structured output with schemas over raw text extraction for better token efficiency

## Token Optimization

**Apply these strategies:**

- **Use toon format**: `--output-format toon` for 40% fewer tokens than JSON (use when reading output directly)
- **Use JSON + jq**: Extract only needed fields with jq (use when piping/processing output)
- **Use --summary**: Get AI-generated summaries instead of full page text
- **Use schemas**: Extract structured data with `--summary-schema` (always pipe to jq)
- **Limit results**: Use `--num-results N` to get only what you need

**IMPORTANT**: Choose one approach, don't mix them:
- **Approach 1: toon only** - Compact YAML-like output for direct reading
- **Approach 2: JSON + jq** - Extract specific fields programmatically
- **Approach 3: Schemas + jq** - Get structured data, always use JSON output (default) and pipe to jq

Examples:
```bash
# ❌ High token usage
exa-ai find-similar "https://example.com" --num-results 10

# ✅ Approach 1: toon format for direct reading (60% reduction)
exa-ai find-similar "https://example.com" --num-results 3 --output-format toon

# ✅ Approach 2: JSON + jq for field extraction (90% reduction)
exa-ai find-similar "https://example.com" --num-results 3 | jq -r '.results[].title'

# ❌ Don't mix toon with jq (toon is YAML-like, not JSON)
exa-ai find-similar "https://example.com" --output-format toon | jq -r '.results[].title'
```

## Quick Start

### Basic Similar Search
```bash
exa-ai find-similar "https://anthropic.com/claude" --num-results 5 --output-format toon
```

### Exclude Source Domain
```bash
exa-ai find-similar "https://openai.com/research/gpt-4" \
  --exclude-source-domain \
  --num-results 10
```

### Find Similar with Structured Data
```bash
exa-ai find-similar "https://techcrunch.com/ai-startup-funding" \
  --summary \
  --summary-schema '{"type":"object","properties":{"company_name":{"type":"string"},"funding_amount":{"type":"string"}}}' \
  --num-results 5 | jq -r '.results[].summary | fromjson | "\(.company_name): \(.funding_amount)"'
```

### Category-Specific Search
```bash
exa-ai find-similar "https://arxiv.org/abs/2305.10601" \
  --category "research paper" \
  --num-results 10
```

## Detailed Reference

For complete options, examples, and advanced usage, consult [REFERENCE.md](REFERENCE.md).

### Shared Requirements

<shared-requirements>

## Schema Design

### MUST: Use object wrapper for schemas

**Applies to**: answer, search, find-similar, get-contents

When using schema parameters (`--output-schema` or `--summary-schema`), always wrap properties in an object:

```json
{"type":"object","properties":{"field_name":{"type":"string"}}}
```

**DO NOT** use bare properties without the object wrapper:
```json
{"properties":{"field_name":{"type":"string"}}}  // ❌ Missing "type":"object"
```

**Why**: The Exa API requires a valid JSON Schema with an object type at the root level. Omitting this causes validation errors.

**Examples**:
```bash
# ✅ CORRECT - object wrapper included
exa-ai search "AI news" \
  --summary-schema '{"type":"object","properties":{"headline":{"type":"string"}}}'

# ❌ WRONG - missing object wrapper
exa-ai search "AI news" \
  --summary-schema '{"properties":{"headline":{"type":"string"}}}'
```

---

## Output Format Selection

### MUST NOT: Mix toon format with jq

**Applies to**: answer, context, search, find-similar, get-contents

`toon` format produces YAML-like output, not JSON. DO NOT pipe toon output to jq for parsing:

```bash
# ❌ WRONG - toon is not JSON
exa-ai search "query" --output-format toon | jq -r '.results'

# ✅ CORRECT - use JSON (default) with jq
exa-ai search "query" | jq -r '.results[].title'

# ✅ CORRECT - use toon for direct reading only
exa-ai search "query" --output-format toon
```

**Why**: jq expects valid JSON input. toon format is designed for human readability and produces YAML-like output that jq cannot parse.

### SHOULD: Choose one output approach

**Applies to**: answer, context, search, find-similar, get-contents

Pick one strategy and stick with it throughout your workflow:

1. **Approach 1: toon only** - Compact YAML-like output for direct reading
   - Use when: Reading output directly, no further processing needed
   - Token savings: ~40% reduction vs JSON
   - Example: `exa-ai search "query" --output-format toon`

2. **Approach 2: JSON + jq** - Extract specific fields programmatically
   - Use when: Need to extract specific fields or pipe to other commands
   - Token savings: ~80-90% reduction (extracts only needed fields)
   - Example: `exa-ai search "query" | jq -r '.results[].title'`

3. **Approach 3: Schemas + jq** - Structured data extraction with validation
   - Use when: Need consistent structured output across multiple queries
   - Token savings: ~85% reduction + consistent schema
   - Example: `exa-ai search "query" --summary-schema '{...}' | jq -r '.results[].summary | fromjson'`

**Why**: Mixing approaches increases complexity and token usage. Choosing one approach optimizes for your use case.

---

## Shell Command Best Practices

### MUST: Run commands directly, parse separately

**Applies to**: monitor, search (websets), research, and all skills using complex commands

When using the Bash tool with complex shell syntax, run commands directly and parse output in separate steps:

```bash
# ❌ WRONG - nested command substitution
webset_id=$(exa-ai webset-create --search '{"query":"..."}' | jq -r '.webset_id')

# ✅ CORRECT - run directly, then parse
exa-ai webset-create --search '{"query":"..."}'
# Then in a follow-up command:
webset_id=$(cat output.json | jq -r '.webset_id')
```

**Why**: Complex nested `$(...)` command substitutions can fail unpredictably in shell environments. Running commands directly and parsing separately improves reliability and makes debugging easier.

### MUST NOT: Use nested command substitutions

**Applies to**: All skills when using complex multi-step operations

Avoid nesting multiple levels of command substitution:

```bash
# ❌ WRONG - deeply nested
result=$(exa-ai search "$(cat query.txt | tr '\n' ' ')" --num-results $(cat config.json | jq -r '.count'))

# ✅ CORRECT - sequential steps
query=$(cat query.txt | tr '\n' ' ')
count=$(cat config.json | jq -r '.count')
exa-ai search "$query" --num-results $count
```

**Why**: Nested command substitutions are fragile and hard to debug when they fail. Sequential steps make each operation explicit and easier to troubleshoot.

### SHOULD: Break complex commands into sequential steps

**Applies to**: All skills when working with multi-step workflows

For readability and reliability, break complex operations into clear sequential steps:

```bash
# ❌ Less maintainable - everything in one line
exa-ai webset-create --search '{"query":"startups","count":1}' | jq -r '.webset_id' | xargs -I {} exa-ai webset-search-create {} --query "AI" --behavior override

# ✅ More maintainable - clear steps
exa-ai webset-create --search '{"query":"startups","count":1}'
webset_id=$(jq -r '.webset_id' < output.json)
exa-ai webset-search-create $webset_id --query "AI" --behavior override
```

**Why**: Sequential steps are easier to understand, debug, and modify. Each step can be verified independently.

</shared-requirements>

Overview

This skill finds web content similar to a given URL using AI-powered similarity matching. It helps you discover related articles, papers, or sites that match the example page's topic, tone, or structure. The tool is optimized for token efficiency and supports compact output formats and structured schemas.

How this skill works

Provide an example page URL and the skill computes semantic embeddings to locate related pages across its index. You can control result count, exclude the source domain, request summaries or structured fields, and choose compact output formats (toon, JSON with jq, or schemas + jq) to reduce token usage. Commands return searchable results with metadata and optional AI-generated summaries.

When to use it

You have one example page and want more pages with the same topic or style.
You need quick discovery of related research papers or news on the same subject.
You want to build curated reading lists or competitor monitoring from a seed URL.
You must extract consistent structured data (company, funding, author) across similar pages.
You need token-efficient output for downstream automation or pipelines.

Best practices

Pick one output strategy and stick with it: toon for reading, JSON+jq for field extraction, or schemas+jq for structured results.
Avoid the --text flag; prefer summaries or schemas to limit token usage.
Wrap summary schemas in an object with type: 'object' to satisfy API validation.
Use --num-results to limit returned items and reduce token cost.
Do not pipe toon output to jq; toon is YAML-like. Use JSON output if you intend to parse with jq.
Break complex shell workflows into sequential steps and avoid nested command substitutions.

Example use cases

Find news articles and blog posts similar to a high-traffic industry report for content planning.
Discover academic papers related to a known arXiv URL when preparing a literature review.
Extract funding amounts and company names from similar startup coverage using a summary schema piped to jq.
Create a daily watchlist of competitor pages similar to a flagship product page while excluding the source domain.
Quickly produce a compact human-readable list of three top similar pages using --output-format toon.

FAQ

Which output format should I choose?

Choose based on your workflow: use toon for compact human-readable output, JSON+jq for simple field extraction, or schemas+jq for validated structured data.

How do I keep token usage low?

Use summaries or summary schemas, limit results with --num-results, and avoid raw --text extraction. Prefer toon or JSON+jq patterns described above.