home / skills / benchflow-ai / skillsbench / whisper-transcription

This skill transcribes audio to text with word-level timestamps using OpenAI Whisper, enabling precise timing insights for transcripts, subtitles, and

npx playbooks add skill benchflow-ai/skillsbench --skill whisper-transcription

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
4.4 KB
---
name: whisper-transcription
description: "Transcribe audio/video to text with word-level timestamps using OpenAI Whisper. Use when you need speech-to-text with accurate timing information for each word."
---

# Whisper Transcription

OpenAI Whisper provides accurate speech-to-text with word-level timestamps.

## Installation

```bash
pip install openai-whisper
```

## Model Selection

**Use the `tiny` model for fast transcription** - it's sufficient for most tasks and runs much faster:

| Model | Size | Speed | Accuracy |
|-------|------|-------|----------|
| tiny | 39 MB | Fastest | Good for clear speech |
| base | 74 MB | Fast | Better accuracy |
| small | 244 MB | Medium | High accuracy |

**Recommendation: Start with `tiny` - it handles clear interview/podcast audio well.**

## Basic Usage with Word Timestamps

```python
import whisper
import json

def transcribe_with_timestamps(audio_path, output_path):
    """
    Transcribe audio and get word-level timestamps.

    Args:
        audio_path: Path to audio/video file
        output_path: Path to save JSON output
    """
    # Use tiny model for speed
    model = whisper.load_model("tiny")

    # Transcribe with word timestamps
    result = model.transcribe(
        audio_path,
        word_timestamps=True,
        language="en"  # Specify language for better accuracy
    )

    # Extract words with timestamps
    words = []
    for segment in result["segments"]:
        if "words" in segment:
            for word_info in segment["words"]:
                words.append({
                    "word": word_info["word"].strip(),
                    "start": word_info["start"],
                    "end": word_info["end"]
                })

    with open(output_path, "w") as f:
        json.dump(words, f, indent=2)

    return words
```

## Detecting Specific Words

```python
def find_words(transcription, target_words):
    """
    Find specific words in transcription with their timestamps.

    Args:
        transcription: List of word dicts with 'word', 'start', 'end'
        target_words: Set of words to find (lowercase)

    Returns:
        List of matches with word and timestamp
    """
    matches = []
    target_lower = {w.lower() for w in target_words}

    for item in transcription:
        word = item["word"].lower().strip()
        # Remove punctuation for matching
        clean_word = ''.join(c for c in word if c.isalnum())

        if clean_word in target_lower:
            matches.append({
                "word": clean_word,
                "timestamp": item["start"]
            })

    return matches
```

## Complete Example: Find Filler Words

```python
import whisper
import json

# Filler words to detect
FILLER_WORDS = {
    "um", "uh", "hum", "hmm", "mhm",
    "like", "so", "well", "yeah", "okay",
    "basically", "actually", "literally"
}

def detect_fillers(audio_path, output_path):
    # Load tiny model (fast!)
    model = whisper.load_model("tiny")

    # Transcribe
    result = model.transcribe(audio_path, word_timestamps=True, language="en")

    # Find fillers
    fillers = []
    for segment in result["segments"]:
        for word_info in segment.get("words", []):
            word = word_info["word"].lower().strip()
            clean = ''.join(c for c in word if c.isalnum())

            if clean in FILLER_WORDS:
                fillers.append({
                    "word": clean,
                    "timestamp": round(word_info["start"], 2)
                })

    with open(output_path, "w") as f:
        json.dump(fillers, f, indent=2)

    return fillers

# Usage
detect_fillers("/root/input.mp4", "/root/annotations.json")
```

## Audio Extraction (if needed)

Whisper can process video files directly, but for cleaner results:

```bash
# Extract audio as 16kHz mono WAV
ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
```

## Multi-Word Phrases

For detecting phrases like "you know" or "I mean":

```python
def find_phrases(transcription, phrases):
    """Find multi-word phrases in transcription."""
    matches = []
    words = [w["word"].lower().strip() for w in transcription]

    for phrase in phrases:
        phrase_words = phrase.lower().split()
        phrase_len = len(phrase_words)

        for i in range(len(words) - phrase_len + 1):
            if words[i:i+phrase_len] == phrase_words:
                matches.append({
                    "word": phrase,
                    "timestamp": transcription[i]["start"]
                })

    return matches
```

Overview

This skill transcribes audio and video to text using OpenAI Whisper and returns word-level timestamps. It is built for accurate speech-to-text needs where precise timing per word is required. Use it to annotate, search, or analyze spoken content with time-aligned words.

How this skill works

The skill loads a Whisper model (recommend starting with the tiny model for speed) and runs transcription with word_timestamps enabled. It extracts each word along with its start and end times from Whisper's segment output and writes a JSON list of word objects. Optional helpers detect specific words, filler words, or multi-word phrases by scanning the timestamped word list.

When to use it

  • Create searchable transcripts where you need exact word locations for jump-to-play in media players.
  • Generate annotations or subtitles requiring precise timing for each word.
  • Analyze speech patterns such as filler word usage, phrase frequency, or pacing.
  • Index long interviews, podcasts, or meetings for fast keyword lookup with timestamps.
  • Preprocess audio/video for downstream NLP tasks that need word-level alignment.

Best practices

  • Start with the tiny model for fast, clear-audio transcriptions and switch to base or small for higher accuracy on noisy audio.
  • Specify language explicitly to improve recognition accuracy.
  • If processing video, extract a 16 kHz mono WAV via ffmpeg for cleaner, faster transcription.
  • Normalize words (lowercase and strip punctuation) before matching to avoid false negatives.
  • Round timestamps for reporting but keep full-precision times for alignment tasks.

Example use cases

  • Produce time-aligned subtitles for a podcast and let users jump to specific words.
  • Detect and timestamp filler words in presentation coaching to give targeted feedback.
  • Find mentions of brand names or sensitive phrases in recorded interviews with exact playback offsets.
  • Build a phrase search feature that returns the start time of each phrase occurrence.
  • Create a searchable index of meeting transcripts so participants can navigate to exact discussion points.

FAQ

Which Whisper model should I choose?

Use tiny for fast, clear-audio cases; use base or small if accuracy is more important than speed.

Do I need to extract audio from video first?

Whisper accepts video directly, but extracting a 16 kHz mono WAV via ffmpeg often yields cleaner and faster results.