home / skills / jwynia / agent-skills / document-to-narration

document-to-narration skill

/skills/education/document-to-narration

This skill converts written documents into narrated video scripts with word-level timing, generating scenes, audio, and VTT captions for smooth video narration.

npx playbooks add skill jwynia/agent-skills --skill document-to-narration

Review the files below or copy the command above to add this skill to your agents.

Files (14)
SKILL.md
14.0 KB
---
name: document-to-narration
description: "Convert written documents to narrated video scripts with TTS audio and word-level timing. Use when preparing essays, blog posts, or articles for video narration. Outputs scene files, audio, and VTT with precise word timestamps. Keywords: narration, voiceover, TTS, scenes, audio, timing, video script, spoken."
license: MIT
compatibility: Requires Deno, Python 3.12 with venv, ffmpeg, whisper-cpp
metadata:
  author: jwynia
  version: "1.0"
  domain: video-production
  type: generator
  mode: generative
---

# Document to Narration

Convert written documents into narrated video scripts with precise word-level timing.

## Core Principle

**The agent interprets; the document guides.** Rather than rigid template-based splits, this skill uses agent judgment to find where the content naturally breathes, argues, and transitions. The document's argument flow determines scene breaks, not a predetermined structure.

## When to Use This Skill

Use this skill when:
- Converting a blog post or essay to video narration
- Preparing content for TTS audio generation
- Breaking long-form content into digestible scenes
- Creating word-level synchronized captions for video

Do NOT use this skill when:
- The content is already in scene/script format
- You need real-time voice synthesis (this is batch processing)
- Working with dialogue or multi-speaker content (single voice only)

## Prerequisites

- **Deno** installed (https://deno.land/)
- **Python 3.12** with venv support
- **ffmpeg** for audio conversion
- **whisper-cpp** (installed via @remotion/install-whisper-cpp)
- **TTS model** at `tts/model/` (not in git due to size - see Model Setup below)

## Complete Pipeline

There are two approaches: **per-scene** (legacy) and **full narration** (recommended).

### Full Narration Pipeline (Recommended)

Generates a single audio file for consistent volume and pacing:

```
Document (.md)
    ↓ [agent interprets scene breaks]
Scene .txt files (01-scene-name.txt, 02-scene-name.txt, ...)
    ↓ [TTS via narrate-full.py - SINGLE PASS]
full-narration.wav (one consistent audio file)
    ↓ [Whisper via transcribe-full.py]
full-narration.json + full-narration.vtt (word-level timing)
    ↓ [extract-scene-boundaries.py]
Scene timing boundaries for video composition
```

### Per-Scene Pipeline (Legacy)

Generates separate audio per scene - **can cause volume inconsistencies**:

```
Scene .txt files
    ↓ [TTS via narrate-scenes.py - MULTIPLE PASSES]
Scene .wav files (volume may vary between scenes)
    ↓ [concatenate]
Combined audio (may have clipping at boundaries)
```

> **Warning:** Per-scene TTS generates audio with different volume levels and pacing. When concatenated, this causes audible jumps and clipping. Use the full narration pipeline instead.

## Quick Start

### Full Narration Pipeline (Recommended)

```bash
cd .claude/skills/document-to-narration
source tts/.venv/bin/activate

# 1. Split document into scenes (manual or scripted)
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/

# 2. Generate single audio file
python scripts/narrate-full.py ./output/scenes/

# 3. Transcribe with word-level timestamps
python scripts/transcribe-full.py ./output/full-narration.wav

# 4. Extract scene boundaries for video timing
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript
```

### Legacy Per-Scene Pipeline

```bash
# 1. Split document into scenes
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/

# 2. Generate audio per scene (may have volume inconsistencies)
source tts/.venv/bin/activate
python scripts/narrate-scenes.py ./output/scenes/

# 3. Transcribe (DEPRECATED: transcribe-scenes.ts requires whisper-cpp)
# Use transcribe-full.py instead after concatenating audio
```

## Instructions

### Step 1: Setup (First Time Only)

#### Create Python Virtual Environment

```bash
cd .claude/skills/document-to-narration/tts
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

#### TTS Model Setup

The fine-tuned voice model (~7.8GB) is not included in git due to size.
Place your Qwen3-TTS model files in `tts/model/`:

```
tts/model/
├── config.json
├── generation_config.json
├── model.safetensors      # Main model weights
├── tokenizer_config.json
├── vocab.json
├── merges.txt
└── speech_tokenizer/
    └── ...
```

#### Install Whisper (if not already installed)

The @remotion/install-whisper-cpp package handles this:
```typescript
import { installWhisperCpp, downloadWhisperModel } from '@remotion/install-whisper-cpp';

await installWhisperCpp({ to: './whisper-cpp', version: '1.5.5' });
await downloadWhisperModel({ model: 'medium', folder: './whisper-cpp' });
```

### Step 2: Prepare Your Document

The skill works best with:
- Markdown documents with clear heading structure (H1, H2)
- Well-structured arguments with distinct sections
- Content that reads naturally aloud

### Step 3: Run the Pipeline

```bash
deno run -A scripts/full-pipeline.ts /path/to/essay.md --output ./output/essay-name/
```

### Step 4: Review Output

```
output/essay-name/
├── scenes/
│   ├── 01-opening-hook.txt      # Scene script
│   ├── 01-opening-hook.wav      # Generated audio
│   ├── 01-opening-hook.vtt      # Word-level captions
│   ├── 02-core-argument.txt
│   ├── 02-core-argument.wav
│   ├── 02-core-argument.vtt
│   └── ...
└── manifest.json                # Complete timing data
```

## Scene Boundary Heuristics

The agent identifies scene breaks using these heuristics:

### Strong Boundaries (Almost Always Break)
- H2 heading changes
- "Here's the thing" / "The point is" pivot statements
- Major metaphor introduction
- Explicit enumeration ("First...", "Second...")
- Significant perspective shifts

### Moderate Boundaries (Consider Breaking)
- Long paragraph after short ones (or vice versa)
- Example-to-principle transitions
- "But" / "However" / "Meanwhile" at paragraph start
- Question-then-answer patterns

### Weak Boundaries (Usually Keep Together)
- Paragraph-to-paragraph within same example
- Sequential evidence for same point
- Build-up to a punchline/reveal

### Scene Length Guidance
- **Target**: 100-300 words per scene (30-90 seconds of audio)
- **Minimum**: 50 words (avoid micro-scenes)
- **Maximum**: 500 words (avoid cognitive overload)

## Anti-Patterns

### The Paragraph Slicer
**Pattern:** Breaking at every paragraph or heading mechanically.
**Problem:** Ignores argument flow. Scenes feel choppy and disconnected.
**Fix:** Look for rhetorical units, not structural units. Multiple paragraphs often form one scene.

### The Wall of Text
**Pattern:** Keeping entire sections as single scenes.
**Problem:** Creates TTS audio that's too long. Loses natural breathing room.
**Fix:** Target 100-300 words. Find the natural pause point within sections.

### The Verbatim Transcriber
**Pattern:** Copying written text exactly without spoken adaptation.
**Problem:** Written conventions don't work when spoken. Parentheticals, complex punctuation, and nested clauses confuse TTS and listeners.
**Fix:** Apply adaptation rules. Read it aloud mentally.

### The Over-Adapter
**Pattern:** Rewriting content so heavily it loses the author's voice.
**Problem:** The result doesn't sound like the original author.
**Fix:** Preserve voice, adjust mechanics. If the author uses rhetorical questions, keep them.

## Available Scripts

### scripts/split-to-scenes.ts

Parse a markdown document and output scene text files.

```bash
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/
deno run --allow-read --allow-write scripts/split-to-scenes.ts input.md --output ./output/ --adapt
deno run --allow-read scripts/split-to-scenes.ts input.md --dry-run
```

**Options:**
- `--output` - Directory for scene files (created if doesn't exist)
- `--adapt` - Apply spoken adaptation rules
- `--dry-run` - Preview scene breaks without writing files

**Output:** Numbered `.txt` files and initial `manifest.json`

### scripts/narrate-full.py (Recommended)

Generate a single TTS audio file from all scene files. Produces consistent volume and pacing.

```bash
python scripts/narrate-full.py ./output/scenes/
python scripts/narrate-full.py ./output/scenes/ --force
python scripts/narrate-full.py ./output/scenes/ --speaker jwynia
python scripts/narrate-full.py ./output/scenes/ --output ./custom/path/audio.wav
```

**Options:**
- `--force` - Regenerate even if output exists
- `--speaker` - Speaker name (default: jwynia)
- `--output` - Custom output path (default: `../full-narration.wav`)

**Output:** Single `full-narration.wav` in parent directory of scenes

### scripts/narrate-scenes.py (Legacy)

Generate TTS audio for each scene file separately. **Not recommended** - can cause volume inconsistencies when concatenated.

```bash
python scripts/narrate-scenes.py ./output/scenes/
python scripts/narrate-scenes.py ./output/scenes/ --force
python scripts/narrate-scenes.py ./output/scenes/ --speaker jwynia
```

**Options:**
- `--force` - Regenerate even if output exists
- `--speaker` - Speaker name (default: jwynia)

**Output:** `.wav` files alongside each `.txt` file

### scripts/transcribe-full.py (Recommended)

Transcribe audio with word-level timestamps using Python's openai-whisper.

```bash
python scripts/transcribe-full.py ./output/full-narration.wav
python scripts/transcribe-full.py ./output/full-narration.wav --model large-v3
python scripts/transcribe-full.py ./output/full-narration.wav --output-dir ./captions/
```

**Options:**
- `--model` - Whisper model: tiny, base, small, medium, large, large-v2, large-v3 (default: medium)
- `--output-dir` - Output directory (default: same as audio file)

**Output:**
- `.vtt` file with word-level timestamps
- `.json` file with captions array for Remotion

**Dependencies:** Requires `openai-whisper` in Python environment:
```bash
pip install openai-whisper
```

### scripts/extract-scene-boundaries.py

Extract scene timing boundaries from transcript by matching scene opening phrases.

```bash
# Human-readable table
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json

# JSON output
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --json

# TypeScript for Video.tsx
python scripts/extract-scene-boundaries.py ./output/scenes/ ./output/full-narration.json --typescript
```

**Options:**
- `--json` - Output as JSON array
- `--typescript` - Output as TypeScript code for Video.tsx scenes array

**Output:** Scene numbers, slugs, start times, and durations

### scripts/transcribe-scenes.ts (Deprecated)

> **Deprecated:** Requires whisper-cpp binary which may not be installed. Use `transcribe-full.py` instead.

Transcribe per-scene audio files using whisper-cpp.

```bash
deno run --allow-read --allow-write --allow-run scripts/transcribe-scenes.ts ./output/scenes/
```

**Output:** `.vtt` files with word-level timestamps

### scripts/full-pipeline.ts

Orchestrate the complete pipeline.

```bash
deno run -A scripts/full-pipeline.ts input.md --output ./output/project-name/
```

**Options:**
- `--output` - Output directory (required)
- `--adapt` - Apply spoken adaptation
- `--skip-tts` - Skip audio generation (text only)
- `--skip-transcribe` - Skip Whisper transcription

## Output Format

### manifest.json

```json
{
  "source": "appliance-vs-trade-tool-draft.md",
  "created_at": "2024-01-15T10:30:00Z",
  "total_scenes": 9,
  "total_duration_seconds": 420,
  "scenes": [
    {
      "number": 1,
      "slug": "popcorn-opening",
      "word_count": 185,
      "audio_duration_seconds": 55.2,
      "files": {
        "text": "scenes/01-popcorn-opening.txt",
        "audio": "scenes/01-popcorn-opening.wav",
        "captions": "scenes/01-popcorn-opening.vtt"
      },
      "captions": [
        { "text": "Two", "startMs": 0, "endMs": 180, "confidence": 0.98 },
        { "text": "people", "startMs": 180, "endMs": 450, "confidence": 0.97 }
      ]
    }
  ]
}
```

### VTT Format

```vtt
WEBVTT

00:00.000 --> 00:00.180
Two

00:00.180 --> 00:00.450
people

00:00.450 --> 00:00.720
walk

00:00.720 --> 00:01.100
into
```

## Spoken Adaptation

When `--adapt` is enabled, the skill transforms written conventions to spoken equivalents:

| Written | Spoken |
|---------|--------|
| Parenthetical asides | Em-dash or separate sentence |
| "e.g." | "for example" |
| "i.e." | "that is" |
| Long nested clauses | Split into multiple sentences |
| Semicolons | Periods |
| `*emphasis*` | Context-appropriate stress |

**Preserve:**
- Author's voice and tone
- Rhetorical questions
- Deliberate repetition
- Key phrases and memorable formulations

## Integration

### With remotion-designer
- Pass manifest scene list to remotion-designer
- Each scene becomes a visual design unit
- Word-level timing drives text animation

### With Remotion Compositions
```tsx
import { Audio, useCurrentFrame, Sequence } from 'remotion';
import manifest from './output/manifest.json';

// Use scene durations for Sequence timing
{manifest.scenes.map((scene, i) => (
  <Sequence
    from={accumulatedFrames}
    durationInFrames={scene.audio_duration_seconds * fps}
  >
    <Audio src={staticFile(scene.files.audio)} />
    <CaptionRenderer captions={scene.captions} />
  </Sequence>
))}
```

## Technical Notes

### WAV Format Conversion
Whisper requires 16kHz mono WAV. The pipeline handles conversion automatically:
```bash
ffmpeg -i input.wav -ar 16000 -ac 1 output_16khz.wav
```

### TTS Model
The fine-tuned voice model (~7.8GB) is bundled at `tts/model/`. Uses Qwen3-TTS with custom speaker embedding.

### Performance
- TTS: ~5-30 seconds per sentence (Apple Silicon MPS or NVIDIA CUDA)
- Whisper: ~0.5-2x realtime depending on model size
- Full essay (~2000 words): ~10-20 minutes total processing

## What This Skill Does NOT Do

- Generate video visuals (use remotion-designer)
- Real-time voice synthesis
- Multi-speaker dialogue
- Edit or improve the content's argument
- Make editorial changes beyond mechanical spoken adaptation

Overview

This skill converts written documents into narrated video scripts with TTS audio and precise word-level timing. It interprets the document to determine natural scene breaks, then outputs scene text files, a single consistent narration audio file (recommended), word-timed captions (VTT/JSON), and a manifest describing scene timings. Use it to prepare essays, blog posts, or articles for voiceover-driven video composition.

How this skill works

The agent analyzes the document and locates natural rhetorical breaks rather than applying a rigid template. It can split the input into numbered scene text files, apply optional spoken-adaptation rules, generate one unified TTS audio file for consistent volume and pacing, transcribe that audio to produce word-level timestamps, and map timestamps back to scene boundaries. Final outputs include per-scene scripts, audio/captions, and a manifest usable by video composition tools.

When to use it

  • Turning a long-form article, essay, or blog post into a narrated video script
  • Preparing content for high-quality TTS voiceover with consistent pacing and volume
  • Creating word-synchronized captions or VTT files for text animations
  • Breaking long content into digestible scenes for visual editing
  • When you want a single-voice batch pipeline rather than real-time or multi-speaker synthesis

Best practices

  • Prefer the full-narration pipeline to avoid volume and pacing jumps between scenes
  • Target 100–300 words per scene (about 30–90 seconds) and avoid micro-scenes under 50 words
  • Use Markdown with clear headings to help the agent identify strong boundaries
  • Enable spoken adaptation for smoother natural-speech phrasing without changing the author’s voice
  • Review scene boundaries and the manifest before final composition to tweak pacing or wording

Example use cases

  • Convert a how-to blog post into a sequence of scenes with timed captions for a tutorial video
  • Prepare an opinion essay for a narrated short documentary with synchronized on-screen text
  • Create a marketing explainer from a whitepaper, producing single-pass TTS audio and VTT captions for animations
  • Produce accessibility-friendly captions and precise timing for existing video projects
  • Batch-process a newsletter series into narrated clips for social platforms

FAQ

Does this support multiple speakers or dialogue?

No. The pipeline targets a single-narrator voice. Multi-speaker dialogue and speaker-attribution are not supported.

Why use a single audio file instead of per-scene audio?

A single full narration produces consistent volume and pacing. Per-scene TTS can create audible jumps and clipping when concatenated, so the full-narration pipeline is recommended.