home / skills / jmagly / aiwg / llms-txt-support

This skill detects and fetches llms.txt files to enable rapid, LLM-optimized documentation ingestion before site scraping.

npx playbooks add skill jmagly/aiwg --skill llms-txt-support

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
6.5 KB
---
name: llms-txt-support
description: Detect and use llms.txt files for LLM-optimized documentation. Use when checking if a site has LLM-ready docs before scraping.
tools: Read, Write, WebFetch
---

# llms.txt Support Skill

## Purpose

Single responsibility: Detect, fetch, and utilize llms.txt files that provide LLM-optimized documentation, enabling 10x faster documentation ingestion. (BP-4)

## Background

The llms.txt standard (https://llmstxt.org/) provides a convention for websites to expose LLM-friendly documentation. Instead of scraping entire sites, check for llms.txt first.

**File hierarchy (check in order):**
1. `llms-full.txt` - Complete documentation (largest)
2. `llms.txt` - Standard documentation
3. `llms-small.txt` - Condensed documentation (smallest)

## Grounding Checkpoint (Archetype 1 Mitigation)

Before executing, VERIFY:

- [ ] Base URL is accessible
- [ ] Check all three llms.txt variants in order
- [ ] Validate file content is actual documentation (not error page)
- [ ] Confirm file size is reasonable for the documentation scope

**DO NOT assume llms.txt exists. Always probe first.**

## Uncertainty Escalation (Archetype 2 Mitigation)

ASK USER instead of guessing when:

- Multiple llms.txt variants found - which size to use?
- llms.txt content appears partial or outdated
- File returns but content seems like error page
- Site has llms.txt but content doesn't match expected documentation

**NEVER assume llms.txt quality without verification.**

## Context Scope (Archetype 3 Mitigation)

| Context Type | Included | Excluded |
|--------------|----------|----------|
| RELEVANT | Target base URL, llms.txt content | Full site scraping |
| PERIPHERAL | llms.txt spec reference | Other sites' llms.txt |
| DISTRACTOR | Previous scraping attempts | Unrelated documentation |

## Workflow Steps

### Step 1: Detect llms.txt (Grounding)

```bash
# Check for llms.txt variants (in order of preference)
curl -I https://example.com/llms-full.txt
curl -I https://example.com/llms.txt
curl -I https://example.com/llms-small.txt

# Check common alternate locations
curl -I https://example.com/.well-known/llms.txt
curl -I https://docs.example.com/llms.txt
```

### Step 2: Validate Content

```bash
# Fetch and inspect first 100 lines
curl -s https://example.com/llms.txt | head -100

# Check file size
curl -sI https://example.com/llms.txt | grep -i content-length

# Verify it's not an error page
curl -s https://example.com/llms.txt | grep -i "not found\|error\|404" && echo "WARNING: May be error page"
```

### Step 3: Choose Variant

| Variant | Size | Use Case |
|---------|------|----------|
| `llms-full.txt` | Large (1MB+) | Complete documentation, full API reference |
| `llms.txt` | Medium | Standard use, balanced coverage |
| `llms-small.txt` | Small (<100KB) | Quick reference, limited context windows |

**Decision tree:**
1. If context window is limited → `llms-small.txt`
2. If need complete coverage → `llms-full.txt`
3. Default → `llms.txt`

### Step 4: Fetch and Process

```bash
# Download llms.txt
curl -o docs/llms.txt https://example.com/llms.txt

# Convert to skill format (if using skill-seekers)
skill-seekers scrape --llms-txt docs/llms.txt --name myskill

# Or process manually
# llms.txt is already LLM-optimized markdown
cp docs/llms.txt output/myskill/references/complete.md
```

### Step 5: Validate Output

```bash
# Check content structure
head -50 output/myskill/references/complete.md

# Verify sections
grep "^#" output/myskill/references/complete.md | head -20

# Check for code examples
grep -c '```' output/myskill/references/complete.md
```

## Recovery Protocol (Archetype 4 Mitigation)

On error:

1. **PAUSE** - Note which variant failed
2. **DIAGNOSE** - Check error type:
   - `404 Not Found` → Try next variant or alternate location
   - `403 Forbidden` → May need authentication or user-agent
   - `Timeout` → Retry with longer timeout
   - `Invalid content` → Fall back to traditional scraping
3. **ADAPT** - Try alternate approach
4. **RETRY** - Next variant (max 3 attempts per variant)
5. **ESCALATE** - Inform user llms.txt unavailable, suggest scraping

## Checkpoint Support

State saved to: `.aiwg/working/checkpoints/llms-txt-support/`

```
checkpoints/llms-txt-support/
├── detection_results.json    # Which variants found
├── selected_variant.txt      # Which was chosen
└── content_hash.txt          # For cache validation
```

## llms.txt Format Reference

Standard llms.txt structure:

```markdown
# Project Name

> Brief description of the project

## Overview
[High-level explanation]

## Installation
[Setup instructions]

## Quick Start
[Getting started guide]

## API Reference
[Detailed API documentation]

## Examples
[Code examples]

## FAQ
[Common questions]
```

## Detection Results Output

```json
{
  "base_url": "https://example.com",
  "detected": {
    "llms-full.txt": {
      "found": true,
      "url": "https://example.com/llms-full.txt",
      "size": 1523456,
      "last_modified": "2025-01-15T10:30:00Z"
    },
    "llms.txt": {
      "found": true,
      "url": "https://example.com/llms.txt",
      "size": 245678,
      "last_modified": "2025-01-15T10:30:00Z"
    },
    "llms-small.txt": {
      "found": false
    }
  },
  "recommended": "llms.txt",
  "reason": "Standard size, good for most use cases"
}
```

## Known Sites with llms.txt

Sites known to support llms.txt (verify before use):

- Anthropic documentation
- Many modern API documentation sites
- Framework documentation following the standard

**Always verify - this list may be outdated.**

## Troubleshooting

| Issue | Diagnosis | Solution |
|-------|-----------|----------|
| No llms.txt found | Site doesn't support | Fall back to doc-scraper |
| Content seems wrong | Error page or redirect | Check actual content, verify URL |
| File too large | llms-full.txt overwhelming | Use llms.txt or llms-small.txt |
| Outdated content | llms.txt not maintained | Consider scraping + llms.txt merge |

## Integration with doc-scraper

If llms.txt is incomplete or outdated, combine approaches:

```bash
# 1. Fetch llms.txt as base
curl -o base.md https://example.com/llms.txt

# 2. Scrape for additional/updated content
skill-seekers scrape --config config.json --skip-covered-by base.md

# 3. Merge results
# llms.txt provides structure, scraping fills gaps
```

## References

- llms.txt Standard: https://llmstxt.org/
- Skill Seekers llms.txt Detection: https://github.com/jmagly/Skill_Seekers/blob/main/docs/LLMS_TXT_SUPPORT.md
- REF-001: Production-Grade Agentic Workflows (BP-4, BP-9)
- REF-002: LLM Failure Modes (Archetype 1-4 mitigations)

Overview

This skill detects, fetches, and uses llms.txt files to unlock LLM-optimized documentation before any scraping. It checks llms-full.txt, llms.txt, and llms-small.txt in order, validates content quality, and selects the best variant for the task. The goal is faster, safer ingestion and fewer unnecessary site scrapes.

How this skill works

The skill probes standard locations and common alternates (.well-known and docs subdomains) and returns detection metadata including size and URLs. It validates that the file contains real documentation rather than an error or redirect, applies a decision rule to choose full/standard/small variants, and downloads the chosen file for downstream processing. It records checkpoints, supports retry and recovery logic, and escalates to the user when ambiguity or low confidence is detected.

When to use it

  • Before scraping a site to see if LLM-ready docs are available
  • When you need a compact, structured source for fast documentation ingestion
  • To minimize crawl scope and bandwidth by preferring llms.txt over full scrapes
  • When building agentic workflows that require reliable documentation grounding
  • When you need repeatable detection and caching of documentation sources

Best practices

  • Always probe in this order: llms-full.txt, llms.txt, llms-small.txt and common alternate locations
  • Validate content for error pages, reasonable size, and expected markdown structure before trusting it
  • Ask the user when multiple variants are present or content looks partial or outdated
  • Use llms-small.txt for tight context windows, llms-full.txt for complete coverage, llms.txt as the default balance
  • Save detection results and content hashes to checkpoints for caching and auditability

Example use cases

  • Automated CI job that checks third-party APIs for LLM-ready docs and decides whether to scrape
  • Onboarding a new codebase: fetch llms.txt to provide immediate LLM context and examples
  • Agent orchestration step that verifies documentation quality before running knowledge extraction
  • Fallback flow where llms.txt is missing or invalid and a scraper is invoked to fill gaps
  • Merging llms.txt as the authoritative base and augmenting with targeted scraping for updates

FAQ

What if no llms.txt exists?

The skill falls back to a documented recovery protocol and recommends targeted scraping; it reports findings to the user.

Which variant should I pick when multiple are present?

Default to llms.txt for balance; choose llms-small.txt for strict context limits or llms-full.txt when full coverage is required; ask the user if unsure.

How do you detect error pages?

Validation checks look for common error strings, anomalous size, redirects, and unexpected HTML to avoid treating an error page as documentation.