home / skills / benchflow-ai / skillsbench / whisper-transcription
/tasks/video-filler-word-remover/environment/skills/whisper-transcription
This skill transcribes audio to text with word-level timestamps using OpenAI Whisper, enabling precise timing insights for transcripts, subtitles, and
npx playbooks add skill benchflow-ai/skillsbench --skill whisper-transcriptionReview the files below or copy the command above to add this skill to your agents.
---
name: whisper-transcription
description: "Transcribe audio/video to text with word-level timestamps using OpenAI Whisper. Use when you need speech-to-text with accurate timing information for each word."
---
# Whisper Transcription
OpenAI Whisper provides accurate speech-to-text with word-level timestamps.
## Installation
```bash
pip install openai-whisper
```
## Model Selection
**Use the `tiny` model for fast transcription** - it's sufficient for most tasks and runs much faster:
| Model | Size | Speed | Accuracy |
|-------|------|-------|----------|
| tiny | 39 MB | Fastest | Good for clear speech |
| base | 74 MB | Fast | Better accuracy |
| small | 244 MB | Medium | High accuracy |
**Recommendation: Start with `tiny` - it handles clear interview/podcast audio well.**
## Basic Usage with Word Timestamps
```python
import whisper
import json
def transcribe_with_timestamps(audio_path, output_path):
"""
Transcribe audio and get word-level timestamps.
Args:
audio_path: Path to audio/video file
output_path: Path to save JSON output
"""
# Use tiny model for speed
model = whisper.load_model("tiny")
# Transcribe with word timestamps
result = model.transcribe(
audio_path,
word_timestamps=True,
language="en" # Specify language for better accuracy
)
# Extract words with timestamps
words = []
for segment in result["segments"]:
if "words" in segment:
for word_info in segment["words"]:
words.append({
"word": word_info["word"].strip(),
"start": word_info["start"],
"end": word_info["end"]
})
with open(output_path, "w") as f:
json.dump(words, f, indent=2)
return words
```
## Detecting Specific Words
```python
def find_words(transcription, target_words):
"""
Find specific words in transcription with their timestamps.
Args:
transcription: List of word dicts with 'word', 'start', 'end'
target_words: Set of words to find (lowercase)
Returns:
List of matches with word and timestamp
"""
matches = []
target_lower = {w.lower() for w in target_words}
for item in transcription:
word = item["word"].lower().strip()
# Remove punctuation for matching
clean_word = ''.join(c for c in word if c.isalnum())
if clean_word in target_lower:
matches.append({
"word": clean_word,
"timestamp": item["start"]
})
return matches
```
## Complete Example: Find Filler Words
```python
import whisper
import json
# Filler words to detect
FILLER_WORDS = {
"um", "uh", "hum", "hmm", "mhm",
"like", "so", "well", "yeah", "okay",
"basically", "actually", "literally"
}
def detect_fillers(audio_path, output_path):
# Load tiny model (fast!)
model = whisper.load_model("tiny")
# Transcribe
result = model.transcribe(audio_path, word_timestamps=True, language="en")
# Find fillers
fillers = []
for segment in result["segments"]:
for word_info in segment.get("words", []):
word = word_info["word"].lower().strip()
clean = ''.join(c for c in word if c.isalnum())
if clean in FILLER_WORDS:
fillers.append({
"word": clean,
"timestamp": round(word_info["start"], 2)
})
with open(output_path, "w") as f:
json.dump(fillers, f, indent=2)
return fillers
# Usage
detect_fillers("/root/input.mp4", "/root/annotations.json")
```
## Audio Extraction (if needed)
Whisper can process video files directly, but for cleaner results:
```bash
# Extract audio as 16kHz mono WAV
ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav
```
## Multi-Word Phrases
For detecting phrases like "you know" or "I mean":
```python
def find_phrases(transcription, phrases):
"""Find multi-word phrases in transcription."""
matches = []
words = [w["word"].lower().strip() for w in transcription]
for phrase in phrases:
phrase_words = phrase.lower().split()
phrase_len = len(phrase_words)
for i in range(len(words) - phrase_len + 1):
if words[i:i+phrase_len] == phrase_words:
matches.append({
"word": phrase,
"timestamp": transcription[i]["start"]
})
return matches
```
This skill transcribes audio and video to text using OpenAI Whisper and returns word-level timestamps. It is built for accurate speech-to-text needs where precise timing per word is required. Use it to annotate, search, or analyze spoken content with time-aligned words.
The skill loads a Whisper model (recommend starting with the tiny model for speed) and runs transcription with word_timestamps enabled. It extracts each word along with its start and end times from Whisper's segment output and writes a JSON list of word objects. Optional helpers detect specific words, filler words, or multi-word phrases by scanning the timestamped word list.
Which Whisper model should I choose?
Use tiny for fast, clear-audio cases; use base or small if accuracy is more important than speed.
Do I need to extract audio from video first?
Whisper accepts video directly, but extracting a 16 kHz mono WAV via ffmpeg often yields cleaner and faster results.