home / skills / benchflow-ai / skillsbench / automatic-speech-recognition

automatic-speech-recognition skill

safe

/tasks/speaker-diarization-subtitles/environment/skills/automatic-speech-recognition

This skill transcribes diarized audio with Whisper models, aligning timestamps and producing speaker-labeled transcripts for subtitles and searchable text.

npx playbooks add skill benchflow-ai/skillsbench --skill automatic-speech-recognition

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

5.3 KB

---
name: Automatic Speech Recognition (ASR)
description: Transcribe audio segments to text using Whisper models. Use larger models (small, base, medium, large-v3) for better accuracy, or faster-whisper for optimized performance. Always align transcription timestamps with diarization segments for accurate speaker-labeled subtitles.
---

# Automatic Speech Recognition (ASR)

## Overview

After speaker diarization, you need to transcribe each speech segment to text. Whisper is the current state-of-the-art for ASR, with multiple model sizes offering different trade-offs between accuracy and speed.

## When to Use

- After speaker diarization is complete
- Need to generate speaker-labeled transcripts
- Creating subtitles from audio segments
- Converting speech segments to text

## Whisper Model Selection

### Model Size Comparison

| Model | Size | Speed | Accuracy | Best For |
|-------|------|-------|----------|----------|
| tiny | 39M | Fastest | Lowest | Quick testing, low accuracy needs |
| base | 74M | Fast | Low | Fast processing with moderate accuracy |
| small | 244M | Medium | Good | **Recommended balance** |
| medium | 769M | Slow | Very Good | High accuracy needs |
| large-v3 | 1550M | Slowest | **Best** | **Maximum accuracy** |

### Recommended: Use `small` or `large-v3`

**For best accuracy (recommended for this task):**
```python
import whisper

model = whisper.load_model("large-v3")  # Best accuracy
result = model.transcribe(audio_path)
```

**For balanced performance:**
```python
import whisper

model = whisper.load_model("small")  # Good balance
result = model.transcribe(audio_path)
```

## Faster-Whisper (Optimized Alternative)

For faster processing with similar accuracy, use `faster-whisper`:

```python
from faster_whisper import WhisperModel

# Use small model with CPU int8 quantization
model = WhisperModel("small", device="cpu", compute_type="int8")

# Transcribe
segments, info = model.transcribe(audio_path, beam_size=5)

# Process segments
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
```

**Advantages:**
- Faster than standard Whisper
- Lower memory usage with quantization
- Similar accuracy to standard Whisper

## Aligning Transcriptions with Diarization Segments

After diarization, you need to map Whisper transcriptions to speaker segments:

```python
# After diarization, you have turns with speaker labels
turns = [
    {'start': 0.8, 'duration': 0.86, 'speaker': 'SPEAKER_01'},
    {'start': 5.34, 'duration': 0.21, 'speaker': 'SPEAKER_01'},
    # ...
]

# Run Whisper transcription
model = whisper.load_model("large-v3")
result = model.transcribe(audio_path)

# Map transcriptions to turns
transcripts = {}
for i, turn in enumerate(turns):
    turn_start = turn['start']
    turn_end = turn['start'] + turn['duration']
    
    # Find overlapping Whisper segments
    overlapping_text = []
    for seg in result['segments']:
        seg_start = seg['start']
        seg_end = seg['end']
        
        # Check if Whisper segment overlaps with diarization turn
        if seg_start < turn_end and seg_end > turn_start:
            overlapping_text.append(seg['text'].strip())
    
    # Combine overlapping segments
    transcripts[i] = ' '.join(overlapping_text) if overlapping_text else '[INAUDIBLE]'
```

## Handling Empty or Inaudible Segments

```python
# If no transcription found for a segment
if not overlapping_text:
    transcripts[i] = '[INAUDIBLE]'
    
# Or skip very short segments
if turn['duration'] < 0.3:
    transcripts[i] = '[INAUDIBLE]'
```

## Language Detection

Whisper can auto-detect language, but you can also specify:

```python
# Auto-detect (recommended)
result = model.transcribe(audio_path)

# Or specify language for better accuracy
result = model.transcribe(audio_path, language="en")
```

## Best Practices

1. **Use larger models for better accuracy**: `small` minimum, `large-v3` for best results
2. **Align timestamps carefully**: Match Whisper segments with diarization turns
3. **Handle overlaps**: Multiple Whisper segments may overlap with one diarization turn
4. **Handle gaps**: Some diarization turns may have no corresponding transcription
5. **Post-process text**: Clean up punctuation, capitalization if needed

## Common Issues

1. **Low transcription accuracy**: Use larger model (small → medium → large-v3)
2. **Slow processing**: Use faster-whisper or smaller model
3. **Misaligned timestamps**: Check time alignment between diarization and transcription
4. **Missing transcriptions**: Check for very short segments or silence

## Integration with Subtitle Generation

After transcription, combine with speaker labels for subtitles:

```python
def generate_subtitles_ass(turns, transcripts, output_path):
    # ... header code ...
    
    for i, turn in enumerate(turns):
        start_time = format_time(turn['start'])
        end_time = format_time(turn['start'] + turn['duration'])
        speaker = turn['speaker']
        text = transcripts.get(i, "[INAUDIBLE]")
        
        # Format: SPEAKER_XX: text
        f.write(f"Dialogue: 0,{start_time},{end_time},Default,,0,0,0,,{speaker}: {text}\n")
```

## Performance Tips

1. **For accuracy**: Use `large-v3` model
2. **For speed**: Use `faster-whisper` with `small` model
3. **For memory**: Use `faster-whisper` with `int8` quantization
4. **Batch processing**: Process multiple segments together if possible

Overview

This skill transcribes audio segments to text using Whisper-family models and faster-whisper for optimized performance. It prioritizes accuracy by recommending larger models (small, medium, large-v3) while offering faster alternatives and quantized inference for constrained environments. The skill ensures transcription timestamps are aligned with diarization turns so output can include accurate speaker-labeled subtitles.

How this skill works

Run speaker diarization first to produce time-bounded speaker turns. Use Whisper or faster-whisper to transcribe the full audio or chunks, producing timed segments. Map Whisper segments to diarization turns by checking time overlap and concatenate overlapping transcript pieces per turn. Handle inaudible or very short turns by inserting placeholders or skipping them.

When to use it

After speaker diarization is complete and you need speaker-labeled text
Generating subtitles or SRT/ASS/SSA files with speaker tags
Converting meeting, interview, or podcast audio into searchable transcripts
When you need a trade-off between accuracy and speed (choose model accordingly)
When you need language auto-detection or forced language transcription

Best practices

Use at least the ‘small’ Whisper model for good baseline accuracy; use ‘large-v3’ for best results
Consider faster-whisper with int8 quantization for CPU-bound, lower-memory environments
Align transcription segments to diarization turns by testing start/end overlap thresholds
Treat segments shorter than ~0.3s or with no overlap as [INAUDIBLE] to avoid bad text
Post-process text for punctuation, capitalization, and remove duplicate overlap fragments

Example use cases

Transcribing multi-speaker meetings into speaker-labeled minutes and action items
Generating subtitle files with speaker IDs for webinars and recorded lectures
Building searchable archives of interviews and podcasts with per-speaker indexing
Real-time or batch transcription on constrained hardware using faster-whisper quantization
Quality-checking diarization by comparing speaker timestamps to transcribed content

FAQ

Which Whisper model should I choose?

Use ‘small’ for a balanced speed/accuracy trade-off and ‘large-v3’ when maximum accuracy matters; switch to faster-whisper if you need faster CPU inference.

How do I handle segments with no transcription?

Mark very short or silent diarization turns as [INAUDIBLE] or skip them, and verify alignment thresholds so missed overlaps are minimized.