home / skills / danielmiessler / personal_ai_infrastructure / parser

Parser skill

/Releases/v3.0/.claude/skills/Parser

This skill parses URLs, videos, and documents into structured JSON with entity extraction and deduplication, speeding data integration.

npx playbooks add skill danielmiessler/personal_ai_infrastructure --skill parser

Review the files below or copy the command above to add this skill to your agents.

Files (30)
SKILL.md
3.9 KB
---
name: Parser
description: Parse URLs, files, videos to JSON. USE WHEN parse, extract, URL, transcript, entities, JSON, batch, content, YouTube, PDF, article. SkillSearch('parser') for docs.
---

## Customization

**Before executing, check for user customizations at:**
`~/.claude/skills/PAI/USER/SKILLCUSTOMIZATIONS/Parser/`

If this directory exists, load and apply any PREFERENCES.md, configurations, or resources found there. These override default behavior. If the directory does not exist, proceed with skill defaults.


## 🚨 MANDATORY: Voice Notification (REQUIRED BEFORE ANY ACTION)

**You MUST send this notification BEFORE doing anything else when this skill is invoked.**

1. **Send voice notification**:
   ```bash
   curl -s -X POST http://localhost:8888/notify \
     -H "Content-Type: application/json" \
     -d '{"message": "Running the WORKFLOWNAME workflow in the Parser skill to ACTION"}' \
     > /dev/null 2>&1 &
   ```

2. **Output text notification**:
   ```
   Running the **WorkflowName** workflow in the **Parser** skill to ACTION...
   ```

**This is not optional. Execute this curl command immediately upon skill invocation.**

# Parser

Parse any content into structured JSON with entity extraction and collision detection.

---


## Workflow Routing

**When executing a workflow, output this notification:**

```
Running the **WorkflowName** workflow in the **Parser** skill to ACTION...
```

| Workflow | Trigger | File |
|----------|---------|------|
| **ParseContent** | "parse this", "extract from URL" | `Workflows/ParseContent.md` |
| **BatchEntityExtractionGemini3** | "batch extract", "Gemini extraction" | `Workflows/BatchEntityExtractionGemini3.md` |
| **CollisionDetection** | "check duplicates", "entity collision" | `Workflows/CollisionDetection.md` |
| **DetectContentType** | "what type is this", "auto-detect" | `Workflows/DetectContentType.md` |

### Content Type Workflows

| Workflow | Trigger | File |
|----------|---------|------|
| **ExtractNewsletter** | "parse newsletter" | `Workflows/ExtractNewsletter.md` |
| **ExtractTwitter** | "parse tweet", "X thread" | `Workflows/ExtractTwitter.md` |
| **ExtractArticle** | "parse article", "web page" | `Workflows/ExtractArticle.md` |
| **ExtractYoutube** | "parse YouTube", "video transcript" | `Workflows/ExtractYoutube.md` |
| **ExtractPdf** | "parse PDF", "document" | `Workflows/ExtractPdf.md` |

### Security Workflows

| Workflow | Trigger | File |
|----------|---------|------|
| **ExtractBrowserExtension** | "analyze extension", "browser extension security" | `Workflows/ExtractBrowserExtension.md` |

---

## Context Files

- **EntitySystem.md** - Entity extraction, GUIDs, collision detection reference

---

## Core Paths

- **Schema:** `Schema/content-schema.json`
- **Entity Index:** `entity-index.json`
- **Output:** `Output/`

---

## Examples

**Example 1: Parse YouTube video**
```
User: "parse this YouTube video for the newsletter"
--> Invokes Youtube workflow
--> Extracts transcript via YouTube API
--> Identifies people, companies, topics mentioned
--> Returns structured JSON with entities and key insights
```

**Example 2: Batch parse article URLs**
```
User: "parse these 5 URLs into JSON for the database"
--> Invokes ParseContent workflow for each
--> Detects content type for each URL
--> Extracts entities with collision detection
--> Assigns GUIDs, checks for duplicates
--> Returns validated JSON per schema
```

**Example 3: Check for duplicate content**
```
User: "have I already parsed this article?"
--> Invokes CollisionDetection workflow
--> Checks URL against entity index
--> Returns existing content ID if found
--> Skips re-parsing, saves time
```

---

## Quick Reference

- **Schema Version:** 1.0.0
- **Output Format:** JSON validated against `Schema/content-schema.json`
- **Entity Types:** people, companies, links, sources, topics
- **Deduplication:** Via entity-index.json with UUID v4 GUIDs

Overview

This skill parses URLs, files, and videos into validated JSON with entity extraction, deduplication, and schema-guided output. It is designed to convert web pages, PDFs, YouTube videos, newsletters, and batches of sources into structured content suitable for databases and downstream automation. The skill supports GUID assignment and collision detection to avoid duplicate records.

How this skill works

Before any parsing begins the skill emits a required voice/text notification indicating the workflow and action. It auto-detects content type, extracts raw text or transcripts, runs entity extraction (people, companies, topics, links, sources), and validates output against the content schema. For batch jobs it iterates URLs/files, applies collision detection against an entity index, assigns UUID v4 GUIDs, and writes validated JSON to the Output path.

When to use it

  • Convert a YouTube video transcript or podcast into structured JSON for analysis
  • Parse PDFs, articles, newsletters, or tweets into a normalized schema
  • Batch-extract entities from multiple URLs for database ingestion
  • Detect whether content has already been parsed using collision detection
  • Auto-detect content type before applying the appropriate extraction workflow

Best practices

  • Place user customizations in ~/.claude/skills/PAI/USER/SKILLCUSTOMIZATIONS/Parser/ to override defaults
  • Run batch jobs with moderate concurrency to avoid rate limits on external APIs
  • Validate outputs against Schema/content-schema.json before importing into production systems
  • Use CollisionDetection to skip re-parsing known content and conserve resources
  • Keep entity-index.json backed up and versioned to preserve GUID stability

Example use cases

  • Parse a YouTube video to extract transcript, named entities, and key insights for a newsletter
  • Batch-parse five article URLs into JSON for a knowledge base import
  • Run collision detection to determine if an article URL is already in the entity index
  • Extract entities from a PDF research report and output a schema-compliant JSON file
  • Auto-detect the content type of mixed input and route each item to the appropriate extractor

FAQ

Is a notification sent before parsing?

Yes. The skill sends a required voice/text notification that names the workflow and action before any processing occurs.

What output format does the skill produce?

Output is JSON validated against Schema/content-schema.json and written to the Output directory.

How does deduplication work?

Deduplication uses entity-index.json with UUID v4 GUIDs to detect collisions and prevent re-parsing or duplicate records.