home / skills / barefootford / buttercut / analyze-video

analyze-video skill

safe

This skill adds visual descriptions to audio transcripts by extracting frames with ffmpeg and incrementally editing a visual transcript for engaging video

npx playbooks add skill barefootford/buttercut --skill analyze-video

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

3.0 KB

---
name: analyze-video
description: Adds visual descriptions to transcripts by extracting and analyzing video frames with ffmpeg. Creates visual transcript with periodic visual descriptions of the video clip. Use when all files have audio transcripts present (transcript) but don't yet have visual transcripts created (visual_transcript).
---

# Skill: Analyze Video

Add visual descriptions to audio transcripts by extracting JPG frames with ffmpeg and analyzing them. **Never read video files directly** - extract frames first.

## Prerequisites

Videos must have audio transcripts. Run **transcribe-audio** skill first if needed.

## Workflow

### 1. Copy & Clean Audio Transcript

Don't read the audio transcript, just copy it and then prepare it by using the prepare_visual_script.rb file. This removes word-level timing data and prettifies the JSON for easier editing:

```bash
cp libraries/[library]/transcripts/video.json libraries/[library]/transcripts/visual_video.json
ruby .claude/skills/analyze-video/prepare_visual_script.rb libraries/[library]/transcripts/visual_video.json
```

### 2. Extract Frames (Binary Search)

Create frame directory: `mkdir -p tmp/frames/[video_name]`

**Videos ≤30s:** Extract one frame at 2s
**Videos >30s:** Extract start (2s), middle (duration/2), end (duration-2s)

```bash
ffmpeg -ss 00:00:02 -i video.mov -vframes 1 -vf "scale=1280:-1" tmp/frames/[video_name]/start.jpg
```

**Subdivide when:** Footage start, middle and end have different subjects, setting or angle changes
**Stop when:** The footage no longer seems to be changing or only has minor changes
**Never sample** more frequently than once per 30 seconds

### 3. Add Visual Descriptions

Read the visual video json file that you created earlier.

**Read the JPG frames** from `tmp/frames/[video_name]/` using Read tool, then **Edit** `visual_video.json`:

Do these incrementally. You don't need to create a program or script to do this, just incrementally edit the json whenever you read new frames.

**Dialogue segments - add `visual` field:**
```json
{
  "start": 2.917,
  "end": 7.586,
  "text": "Hey, good afternoon everybody.",
  "visual": "Man in red shirt speaking to camera in medium shot. Home office with bookshelf. Natural lighting.",
  "words": [...]
}
```

**B-roll segments - insert new entries:**
```json
{
  "start": 35.474,
  "end": 56.162,
  "text": "",
  "visual": "Green bicycle parked in front of building. Urban street with trees.",
  "b_roll": true,
  "words": []
}
```

**Guidelines:**
- Descriptions should be 3 sentences max.
- First segment: detailed (subject, setting, shot type, lighting, camera style)
- Continuing shots: brief if similar, otherwise can be up to 3 sentences if drastically different.

### 4. Cleanup & Return

```bash
rm -rf tmp/frames/[video_name]
```

Return structured response:
```
✓ [video_filename.mov] analyzed successfully
  Visual transcript: libraries/[library]/transcripts/visual_video.json
  Video path: /full/path/to/video_filename.mov
```

**DO NOT update library.yaml** - parent agent handles this to avoid race conditions in parallel execution.

Overview

This skill adds visual descriptions to existing audio transcripts by extracting and analyzing video frames with ffmpeg. It creates a visual transcript JSON that pairs short scene descriptions with the corresponding transcript segments, without ever reading video files directly.

How this skill works

First, copy and prepare the audio transcript JSON to remove word-level timing and make it editable. Then extract a small set of JPG frames (start, middle, end, or additional samples when the scene changes) using ffmpeg. Read the extracted frames and incrementally edit the visual transcript JSON, inserting concise visual descriptions for dialogue and b-roll segments. Finally, clean up temporary frames and return a structured success message.

When to use it

You already have an audio transcript for the video but no visual_transcript present.
You need accessible visual descriptions for editors, producers, or accessibility workflows.
You want short contextual scene annotations to guide video editing or scene selection.
You are preparing content for platforms that require visual metadata alongside captions.

Best practices

Always run the audio transcription first and work from the prepared visual_video.json file.
Extract frames into a temporary directory and never read video files directly—use ffmpeg frame exports.
Sample at most once per 30 seconds unless the footage clearly changes in subject, setting, or angle.
Keep descriptions to three sentences max; make the first description detailed (subject, setting, shot type, lighting).
Edit incrementally as you view frames rather than attempting to auto-generate everything in one pass.

Example use cases

Add short visual descriptions to interview transcripts so editors can quickly locate cutaways and z-cuts.
Generate visual metadata for long-form lectures where slides and camera framing change occasionally.
Create accessibility-friendly transcripts for documentary clips with mixed dialogue and b-roll.
Produce visual markers to guide automated or manual highlight extraction from event footage.

FAQ

Do I need special tools installed?

Yes — ffmpeg is required to extract JPG frames, and Ruby is used to run the prepare_visual_script.rb helper.

How many frames should I extract?

For videos ≤30s extract one frame at 2s. For longer videos start, middle, end. Subdivide only when visual content clearly changes; never sample more often than once per 30s.