home / skills / barefootford / buttercut / transcribe-audio

transcribe-audio skill

safe

This skill transcribes video audio with WhisperX, preserving original timestamps and producing a word-level JSON transcript for accurate playback and analysis.

npx playbooks add skill barefootford/buttercut --skill transcribe-audio

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

2.5 KB

---
name: transcribe-audio
description: Transcribes video audio using WhisperX, preserving original timestamps. Creates JSON transcript with word-level timing. Use when you need to generate audio transcripts for videos.
---

# Skill: Transcribe Audio

Transcribes video audio using WhisperX and creates clean JSON transcripts with word-level timing data.

## When to Use
- Videos need audio transcripts before visual analysis

## Critical Requirements

Use WhisperX, NOT standard Whisper. WhisperX preserves the original video timeline including leading silence, ensuring transcripts match actual video timestamps. Run WhisperX directly on video files. Don't extract audio separately - this ensures timestamp alignment.

## Workflow

### 1. Read Language from Library File

Read the library's `library.yaml` to get the language code:

```yaml
# Library metadata
library_name: [library-name]
language: en  # Language code stored here
...
```

### 2. Run WhisperX

```bash
whisperx "/full/path/to/video.mov" \
  --language en \
  --model medium \
  --compute_type float32 \
  --device cpu \
  --output_format json \
  --output_dir libraries/[library-name]/transcripts
```

### 3. Prepare Audio Transcript

After WhisperX completes, format the JSON using our prepare_audio_script:

```bash
ruby .claude/skills/transcribe-audio/prepare_audio_script.rb \
  libraries/[library-name]/transcripts/video_name.json \
  /full/path/to/original/video_name.mov
```

This script:
- Adds video source path as metadata
- Removes unnecessary fields to reduce file size
- Prettifies JSON

### 4. Return Success Response

After audio preparation completes, return this structured response to the parent agent:

```
✓ [video_filename.mov] transcribed successfully
  Audio transcript: libraries/[library-name]/transcripts/video_name.json
  Video path: /full/path/to/video_filename.mov
```

**DO NOT update library.yaml** - the parent agent will handle this to avoid race conditions when running multiple transcriptions in parallel.

## Running in Parallel

This skill is designed to run inside a Task agent for parallel execution:
- Each agent handles ONE video file
- Multiple agents can run simultaneously
- Parent thread updates library.yaml sequentially after each agent completes
- No race conditions on shared YAML file

## Next Step

After audio transcription, use the **analyze-video** skill to add visual descriptions and create the visual transcript.

## Installation

Ensure WhisperX is installed. Use the **setup** skill to verify dependencies.

Overview

This skill transcribes video audio with WhisperX and produces JSON transcripts that preserve original video timestamps. It generates word-level timing and includes the video source path as metadata. Use it when you need accurate, timeline-aligned transcripts for downstream analysis.

How this skill works

The skill runs WhisperX directly on video files (do not extract audio) so timestamps align with the original video timeline, including leading silence. After WhisperX completes, a Ruby preparation script cleans and prettifies the JSON, removes unnecessary fields, and injects video source metadata. The skill returns a concise success response pointing to the transcript path and original video file.

When to use it

You need accurate, timeline-aligned transcripts for videos before visual analysis.
You require word-level timing for subtitle generation or search indexing.
You must preserve original video timestamps including leading silence.
You plan to run many transcriptions in parallel inside task agents.
You want clean, compact JSON transcripts with source metadata included.

Best practices

Run WhisperX directly on video files; do not extract audio separately to avoid timestamp drift.
Read the library language code from the library metadata file and pass it to WhisperX.
Use the prepare script after WhisperX to remove extra fields and add video path metadata for traceability.
Run one video per task agent to enable safe parallel execution and avoid race conditions.
Do not update the library metadata file from this skill; let the parent agent handle library-level updates.

Example use cases

Generate searchable, timestamped transcripts for a video archive before running visual content analysis.
Produce word-accurate subtitles for training machine learning models that require aligned text/audio pairs.
Create JSON transcripts with per-word timing for indexing and video search features.
Run many simultaneous transcriptions across a media library using task agents, then hand results to a parent agent for cataloging.
Preprocess videos for a pipeline that combines audio transcripts with visual scene descriptions from a separate analysis skill.

FAQ

Why must WhisperX run on the original video file?

WhisperX preserves the original timeline when run on the video file. Extracting audio can shift or remove leading silence and break timestamp alignment.

Can this skill update library metadata when done?

No. The skill returns the transcript and video path. The parent agent should update library metadata to avoid race conditions when many transcriptions run in parallel.