home / skills / trevors / dot-claude / whisper-test

whisper-test skill

/skills/whisper-test

This skill transcribes WAV audio with OpenAI Whisper and reports intelligibility, optionally comparing results to expected text for quality assessment.

npx playbooks add skill trevors/dot-claude --skill whisper-test

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
5.4 KB
---
name: whisper-test
description: |
  Transcribe WAV audio files using OpenAI Whisper for intelligibility testing.
  Triggers on: "transcribe audio", "whisper test", "test audio output",
  "is the audio intelligible", "check speech quality", "run whisper",
  "speech to text test", "check if audio sounds right"
---

# Whisper Audio Intelligibility Test

Transcribe WAV audio files using OpenAI Whisper and report whether the speech
is intelligible. Optionally compare against expected text.

## Setup

Whisper is installed as a uv tool: `uv tool install openai-whisper`.

Since this machine may lack `ffmpeg`, always use the Python API approach that
loads WAV files with scipy (bypasses the ffmpeg requirement).

## Running Transcription

Use `uv run --no-project --with openai-whisper --with scipy --python 3.11` to
execute the transcription script:

```bash
uv run --no-project --with openai-whisper --with scipy --python 3.11 \
  python3 ~/.claude/skills/whisper-test/transcribe.py \
  [--model tiny|base|small|medium|large-v3] \
  [--language en] \
  [--expected "expected text"] \
  [--json] \
  file1.wav [file2.wav ...]
```

### Arguments

- `--model`: Whisper model size (default: `large-v3`). See model selection guide below.
- `--language`: Language hint (default: `en`).
- `--expected`: Expected transcription text. When provided, calculates Word Error Rate (WER).
- `--json`: Output results as JSON instead of human-readable text.
- Positional: One or more WAV file paths.

### Model Selection

Use `large-v3` for TTS quality verification. Smaller models hallucinate or miss
words in synthesized speech, making them unreliable for judging output quality.

| Model      | VRAM   | When to use                                                     |
| ---------- | ------ | --------------------------------------------------------------- |
| `large-v3` | ~10 GB | **Default.** TTS evaluation, quality gating, regression testing |
| `medium`   | ~5 GB  | GPU memory constrained, still decent accuracy                   |
| `small`    | ~2 GB  | Quick smoke tests only                                          |
| `base`     | ~1 GB  | Not recommended for TTS — high hallucination rate               |
| `tiny`     | ~1 GB  | Not recommended for TTS — unreliable                            |

Observed with identical Qwen3-TTS 1.7B voice-cloned output:

- `large-v3`: "That's one tank. Flash attention pipeline." (key phrase captured)
- `base`: "That's one thing, flash attention pipeline." (close but hallucinated)

For poor-quality 0.6B output, `base` hallucinated "Charging Wheel" while
`large-v3` gave "Flat, splashes." — honest about the poor quality instead of
confabulating plausible words.

### Output Format

For each file, prints:

```text
filename.wav:
  transcription: "Hello world, this is a test."
  duration: 2.96s
  rms: 0.0866
  peak: 0.6832
  silence: 49.2%
  [wer: 0.0%]  (if --expected provided)
```

## Interpreting Results

| Transcription         | Meaning                                                                         |
| --------------------- | ------------------------------------------------------------------------------- |
| Matches expected text | Audio is intelligible and correct                                               |
| Partial match         | Audio has some speech but quality issues                                        |
| Empty string `""`     | Audio is unintelligible (noise, silence, or garbage)                            |
| Hallucinated text     | Model heard something in noise (common with Whisper, especially smaller models) |

### Audio Quality Indicators

- **RMS < 0.01**: Essentially silent
- **silence > 80%**: Mostly silence, likely no speech
- **peak < 0.05**: Very quiet, may not contain useful audio

### TTS-Specific Patterns

Voice-cloned TTS output often has these characteristics:

- **Garbled opening, clear ending**: Common with ICL voice cloning on short references. The model needs a few frames to "lock in" to the target voice.
- **Key phrases preserved**: Even when WER is high, domain-specific terms (e.g. "flash attention pipeline") often come through clearly.
- **Smaller models produce worse audio**: 0.6B models produce significantly less intelligible output than 1.7B — expect Whisper to reflect this.

## Batch Testing (TTS Variant Comparison)

When testing multiple TTS outputs against expected text:

```bash
uv run --no-project --with openai-whisper --with scipy --python 3.11 \
  python3 ~/.claude/skills/whisper-test/transcribe.py \
  --expected "Hello world, this is a test." \
  variant1.wav variant2.wav variant3.wav
```

This produces a comparison table showing which variants produce intelligible speech.

## Docker / NGC Container Usage

When testing on a GPU box inside an NGC container (e.g. for CUDA flash-attn builds),
ffmpeg isn't available and apt can be slow. Two workarounds:

1. **Static ffmpeg binary** (fast, no apt):

   ```bash
   curl -sL https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-arm64-static.tar.xz \
     | tar xJ --strip-components=1 -C /usr/local/bin/ --wildcards "*/ffmpeg" "*/ffprobe"
   pip install openai-whisper
   ```

2. **Use scipy loader** (this script's default — no ffmpeg needed):

   ```bash
   pip install openai-whisper scipy
   python3 ~/.claude/skills/whisper-test/transcribe.py --model large-v3 output.wav
   ```

The script loads WAV files directly via scipy, bypassing Whisper's ffmpeg
dependency entirely. This works for WAV files (the standard TTS output format).

Overview

This skill transcribes WAV audio files using OpenAI Whisper to assess speech intelligibility and basic audio quality. It reports transcription text, audio metrics (duration, RMS, peak, silence), and can compute word error rate (WER) when expected text is provided. It is designed to run without ffmpeg by loading WAV data via scipy and supports model selection for different accuracy and resource trade-offs.

How this skill works

The skill loads one or more WAV files with scipy, extracts audio statistics (duration, RMS, peak, percent silence) and runs Whisper to produce a transcription. When an expected transcript is supplied, it computes WER to quantify correctness. Models range from tiny/base/small/medium/large-v3; large-v3 is recommended for reliable TTS verification. Output can be printed human-readably or emitted as JSON for automation.

When to use it

  • Verify whether synthesized TTS outputs are intelligible and match expected text.
  • Run quick smoke tests on voice-cloned or generated audio to spot major regressions.
  • Batch-compare multiple TTS variants against a target transcript.
  • Gate quality in CI for TTS systems where accuracy matters (use large-v3).
  • Detect silent, extremely quiet, or noise-only files before downstream processing.

Best practices

  • Use large-v3 for TTS evaluation; smaller models hallucinate or miss words.
  • Provide an expected transcript to get WER and a clearer pass/fail signal.
  • Prefer WAV inputs; the script loads WAV directly to avoid ffmpeg dependency.
  • Interpret audio metrics alongside transcription (e.g., low RMS or high silence implies unintelligible audio).
  • Run batch tests to compare variants and flag regressions automatically.

Example use cases

  • Quality gate for TTS model updates: run large-v3 and block deploys if WER exceeds a threshold.
  • Compare multiple voice-cloned variants to pick the clearest output for production.
  • Sanity-check recorded audio from an ingestion pipeline to reject silent or noisy files.
  • Automate nightly regression tests for synthesized speech to detect degradations early.

FAQ

What model should I pick for reliable results?

Use large-v3 for TTS verification; medium can be used if GPU memory is constrained, but tiny/base are unreliable for quality judgments.

Do I need ffmpeg installed?

No. The script loads WAV files via scipy to bypass ffmpeg. ffmpeg is only needed if you feed non-WAV formats.