home / skills / benchflow-ai / skillsbench / whisper-transcription

whisper-transcription skill

safe

/tasks/video-filler-word-remover/environment/skills/whisper-transcription

This skill transcribes audio to text with word-level timestamps using OpenAI Whisper, enabling precise timing insights for transcripts, subtitles, and

npx playbooks add skill benchflow-ai/skillsbench --skill whisper-transcription

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

4.4 KB

---
name: whisper-transcription
description: "Transcribe audio/video to text with word-level timestamps using OpenAI Whisper. Use when you need speech-to-text with accurate timing information for each word."
---

# Whisper Transcription

OpenAI Whisper provides accurate speech-to-text with word-level timestamps.

## Installation

```bash
pip install openai-whisper
```

## Model Selection

**Use the `tiny` model for fast transcription** - it's sufficient for most tasks and runs much faster:

| Model | Size | Speed | Accuracy |
|-------|------|-------|----------|
| tiny | 39 MB | Fastest | Good for clear speech |
| base | 74 MB | Fast | Better accuracy |
| small | 244 MB | Medium | High accuracy |

**Recommendation: Start with `tiny` - it handles clear interview/podcast audio well.**

## Basic Usage with Word Timestamps

```python
import whisper
import json

def transcribe_with_timestamps(audio_path, output_path):
    """
    Transcribe audio and get word-level timestamps.

    Args:
        audio_path: Path to audio/video file
        output_path: Path to save JSON output
    """
    # Use tiny model for speed
    model = whisper.load_model("tiny")

    # Transcribe with word timestamps
    result = model.transcribe(
        audio_path,
        word_timestamps=True,
        language="en"  # Specify language for better accuracy
    )

    # Extract words with timestamps
    words = []
    for segment in result["segments"]:
        if "words" in segment:
            for word_info in segment["words"]:
                words.append({
                    "word": word_info["word"].strip(),
                    "start": word_info["start"],
                    "end": word_info["end"]
                })

    with open(output_path, "w") as f:
        json.dump(words, f, indent=2)

    return words
```

## Detecting Specific Words

```python
def find_words(transcription, target_words):
    """
    Find specific words in transcription with their timestamps.

    Args:
        transcription: List of word dicts with 'word', 'start', 'end'
        target_words: Set of words to find (lowercase)

    Returns:
        List of matches with word and timestamp
    """
    matches = []
    target_lower = {w.lower() for w in target_words}

    for item in transcription:
        word = item["word"].lower().strip()
        # Remove punctuation for matching
        clean_word = ''.join(c for c in word if c.isalnum())

        if clean_word in target_lower:
            matches.append({
                "word": clean_word,
                "timestamp": item["start"]
            })

    return matches
```

## Complete Example: Find Filler Words

```python
import whisper
import json

# Filler words to detect
FILLER_WORDS = {
    "um", "uh", "hum", "hmm", "mhm",
    "like", "so", "well", "yeah", "okay",
    "basically", "actually", "literally"
}

def detect_fillers(audio_path, output_path):
    # Load tiny model (fast!)
    model = whisper.load_model("tiny")

    # Transcribe
    result = model.transcribe(audio_path, word_timestamps=True, language="en")

    # Find fillers
    fillers = []
    for segment in result["segments"]:
        for word_info in segment.get("words", []):
            word = word_info["word"].lower().strip()
            clean = ''.join(c for c in word if c.isalnum())

            if clean in FILLER_WORDS:
                fillers.append({
                    "word": clean,
                    "timestamp": round(word_info["start"], 2)
                })

    with open(output_path, "w") as f:
        json.dump(fillers, f, indent=2)

    return fillers

# Usage
detect_fillers("/root/input.mp4", "/root/annotations.json")
```

## Audio Extraction (if needed)

Whisper can process video files directly, but for cleaner results:

```bash
# Extract audio as 16kHz mono WAV
ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
```

## Multi-Word Phrases

For detecting phrases like "you know" or "I mean":

```python
def find_phrases(transcription, phrases):
    """Find multi-word phrases in transcription."""
    matches = []
    words = [w["word"].lower().strip() for w in transcription]

    for phrase in phrases:
        phrase_words = phrase.lower().split()
        phrase_len = len(phrase_words)

        for i in range(len(words) - phrase_len + 1):
            if words[i:i+phrase_len] == phrase_words:
                matches.append({
                    "word": phrase,
                    "timestamp": transcription[i]["start"]
                })

    return matches
```

Overview

This skill transcribes audio and video to text using OpenAI Whisper and returns word-level timestamps. It is built for accurate speech-to-text needs where precise timing per word is required. Use it to annotate, search, or analyze spoken content with time-aligned words.

How this skill works

The skill loads a Whisper model (recommend starting with the tiny model for speed) and runs transcription with word_timestamps enabled. It extracts each word along with its start and end times from Whisper's segment output and writes a JSON list of word objects. Optional helpers detect specific words, filler words, or multi-word phrases by scanning the timestamped word list.

When to use it

Create searchable transcripts where you need exact word locations for jump-to-play in media players.
Generate annotations or subtitles requiring precise timing for each word.
Analyze speech patterns such as filler word usage, phrase frequency, or pacing.
Index long interviews, podcasts, or meetings for fast keyword lookup with timestamps.
Preprocess audio/video for downstream NLP tasks that need word-level alignment.

Best practices

Start with the tiny model for fast, clear-audio transcriptions and switch to base or small for higher accuracy on noisy audio.
Specify language explicitly to improve recognition accuracy.
If processing video, extract a 16 kHz mono WAV via ffmpeg for cleaner, faster transcription.
Normalize words (lowercase and strip punctuation) before matching to avoid false negatives.
Round timestamps for reporting but keep full-precision times for alignment tasks.

Example use cases

Produce time-aligned subtitles for a podcast and let users jump to specific words.
Detect and timestamp filler words in presentation coaching to give targeted feedback.
Find mentions of brand names or sensitive phrases in recorded interviews with exact playback offsets.
Build a phrase search feature that returns the start time of each phrase occurrence.
Create a searchable index of meeting transcripts so participants can navigate to exact discussion points.

FAQ

Which Whisper model should I choose?

Use tiny for fast, clear-audio cases; use base or small if accuracy is more important than speed.

Do I need to extract audio from video first?

Whisper accepts video directly, but extracting a 16 kHz mono WAV via ffmpeg often yields cleaner and faster results.