home / skills / benchflow-ai / skillsbench / automatic-speech-recognition
/tasks/speaker-diarization-subtitles/environment/skills/automatic-speech-recognition
This skill transcribes diarized audio with Whisper models, aligning timestamps and producing speaker-labeled transcripts for subtitles and searchable text.
npx playbooks add skill benchflow-ai/skillsbench --skill automatic-speech-recognitionReview the files below or copy the command above to add this skill to your agents.
---
name: Automatic Speech Recognition (ASR)
description: Transcribe audio segments to text using Whisper models. Use larger models (small, base, medium, large-v3) for better accuracy, or faster-whisper for optimized performance. Always align transcription timestamps with diarization segments for accurate speaker-labeled subtitles.
---
# Automatic Speech Recognition (ASR)
## Overview
After speaker diarization, you need to transcribe each speech segment to text. Whisper is the current state-of-the-art for ASR, with multiple model sizes offering different trade-offs between accuracy and speed.
## When to Use
- After speaker diarization is complete
- Need to generate speaker-labeled transcripts
- Creating subtitles from audio segments
- Converting speech segments to text
## Whisper Model Selection
### Model Size Comparison
| Model | Size | Speed | Accuracy | Best For |
|-------|------|-------|----------|----------|
| tiny | 39M | Fastest | Lowest | Quick testing, low accuracy needs |
| base | 74M | Fast | Low | Fast processing with moderate accuracy |
| small | 244M | Medium | Good | **Recommended balance** |
| medium | 769M | Slow | Very Good | High accuracy needs |
| large-v3 | 1550M | Slowest | **Best** | **Maximum accuracy** |
### Recommended: Use `small` or `large-v3`
**For best accuracy (recommended for this task):**
```python
import whisper
model = whisper.load_model("large-v3") # Best accuracy
result = model.transcribe(audio_path)
```
**For balanced performance:**
```python
import whisper
model = whisper.load_model("small") # Good balance
result = model.transcribe(audio_path)
```
## Faster-Whisper (Optimized Alternative)
For faster processing with similar accuracy, use `faster-whisper`:
```python
from faster_whisper import WhisperModel
# Use small model with CPU int8 quantization
model = WhisperModel("small", device="cpu", compute_type="int8")
# Transcribe
segments, info = model.transcribe(audio_path, beam_size=5)
# Process segments
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
```
**Advantages:**
- Faster than standard Whisper
- Lower memory usage with quantization
- Similar accuracy to standard Whisper
## Aligning Transcriptions with Diarization Segments
After diarization, you need to map Whisper transcriptions to speaker segments:
```python
# After diarization, you have turns with speaker labels
turns = [
{'start': 0.8, 'duration': 0.86, 'speaker': 'SPEAKER_01'},
{'start': 5.34, 'duration': 0.21, 'speaker': 'SPEAKER_01'},
# ...
]
# Run Whisper transcription
model = whisper.load_model("large-v3")
result = model.transcribe(audio_path)
# Map transcriptions to turns
transcripts = {}
for i, turn in enumerate(turns):
turn_start = turn['start']
turn_end = turn['start'] + turn['duration']
# Find overlapping Whisper segments
overlapping_text = []
for seg in result['segments']:
seg_start = seg['start']
seg_end = seg['end']
# Check if Whisper segment overlaps with diarization turn
if seg_start < turn_end and seg_end > turn_start:
overlapping_text.append(seg['text'].strip())
# Combine overlapping segments
transcripts[i] = ' '.join(overlapping_text) if overlapping_text else '[INAUDIBLE]'
```
## Handling Empty or Inaudible Segments
```python
# If no transcription found for a segment
if not overlapping_text:
transcripts[i] = '[INAUDIBLE]'
# Or skip very short segments
if turn['duration'] < 0.3:
transcripts[i] = '[INAUDIBLE]'
```
## Language Detection
Whisper can auto-detect language, but you can also specify:
```python
# Auto-detect (recommended)
result = model.transcribe(audio_path)
# Or specify language for better accuracy
result = model.transcribe(audio_path, language="en")
```
## Best Practices
1. **Use larger models for better accuracy**: `small` minimum, `large-v3` for best results
2. **Align timestamps carefully**: Match Whisper segments with diarization turns
3. **Handle overlaps**: Multiple Whisper segments may overlap with one diarization turn
4. **Handle gaps**: Some diarization turns may have no corresponding transcription
5. **Post-process text**: Clean up punctuation, capitalization if needed
## Common Issues
1. **Low transcription accuracy**: Use larger model (small → medium → large-v3)
2. **Slow processing**: Use faster-whisper or smaller model
3. **Misaligned timestamps**: Check time alignment between diarization and transcription
4. **Missing transcriptions**: Check for very short segments or silence
## Integration with Subtitle Generation
After transcription, combine with speaker labels for subtitles:
```python
def generate_subtitles_ass(turns, transcripts, output_path):
# ... header code ...
for i, turn in enumerate(turns):
start_time = format_time(turn['start'])
end_time = format_time(turn['start'] + turn['duration'])
speaker = turn['speaker']
text = transcripts.get(i, "[INAUDIBLE]")
# Format: SPEAKER_XX: text
f.write(f"Dialogue: 0,{start_time},{end_time},Default,,0,0,0,,{speaker}: {text}\n")
```
## Performance Tips
1. **For accuracy**: Use `large-v3` model
2. **For speed**: Use `faster-whisper` with `small` model
3. **For memory**: Use `faster-whisper` with `int8` quantization
4. **Batch processing**: Process multiple segments together if possible
This skill transcribes audio segments to text using Whisper-family models and faster-whisper for optimized performance. It prioritizes accuracy by recommending larger models (small, medium, large-v3) while offering faster alternatives and quantized inference for constrained environments. The skill ensures transcription timestamps are aligned with diarization turns so output can include accurate speaker-labeled subtitles.
Run speaker diarization first to produce time-bounded speaker turns. Use Whisper or faster-whisper to transcribe the full audio or chunks, producing timed segments. Map Whisper segments to diarization turns by checking time overlap and concatenate overlapping transcript pieces per turn. Handle inaudible or very short turns by inserting placeholders or skipping them.
Which Whisper model should I choose?
Use ‘small’ for a balanced speed/accuracy trade-off and ‘large-v3’ when maximum accuracy matters; switch to faster-whisper if you need faster CPU inference.
How do I handle segments with no transcription?
Mark very short or silent diarization turns as [INAUDIBLE] or skip them, and verify alignment thresholds so missed overlaps are minimized.