home / skills / willsigmon / sigstack / whisper-expert

This skill enables high-accuracy transcription from audio using whisper with API or local options, delivering fast, searchable transcripts.

npx playbooks add skill willsigmon/sigstack --skill whisper-expert

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.2 KB
---
name: Whisper Expert
description: OpenAI Whisper - speech-to-text, transcription, local and API options
allowed-tools: Read, Edit, Bash, WebFetch
model: sonnet
---

# OpenAI Whisper Expert

High-accuracy speech-to-text transcription.

## Pricing (2026)

### API
- **$0.006/minute** flat rate
- No volume discounts
- 1-2 min minimum billing
- No speaker diarization included

### Self-Hosted
- **Free** (local processing)
- Requires GPU for speed
- Full control over data

## API Usage

```python
from openai import OpenAI

client = OpenAI()

# Transcribe audio
with open("audio.mp3", "rb") as file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=file,
        response_format="text"  # or "json", "srt", "vtt"
    )

print(transcript)
```

### With Timestamps
```python
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=audio_file,
    response_format="verbose_json",
    timestamp_granularities=["word", "segment"]
)

for segment in transcript.segments:
    print(f"[{segment['start']:.2f}s] {segment['text']}")
```

## Local Installation

### faster-whisper (Recommended)
```bash
pip install faster-whisper

# Usage
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda")
segments, info = model.transcribe("audio.mp3")

for segment in segments:
    print(f"[{segment.start:.2f}s] {segment.text}")
```

### whisper.cpp (C++ speed)
```bash
# Clone and build
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp && make

# Download model
./models/download-ggml-model.sh large-v3

# Transcribe
./main -m models/ggml-large-v3.bin -f audio.wav
```

## Model Sizes

| Model | VRAM | Speed | Accuracy |
|-------|------|-------|----------|
| tiny | 1GB | 32x | Basic |
| base | 1GB | 16x | Good |
| small | 2GB | 6x | Better |
| medium | 5GB | 2x | Great |
| large-v3 | 10GB | 1x | Best |

## Use Cases
- Podcast transcription
- Meeting notes
- Voice commands
- Accessibility
- Content indexing

## Limitations
- No real-time streaming (API)
- No speaker diarization (need separate service)
- No HIPAA BAA available

Use when: Transcription, voice-to-text, podcast processing, accessibility

Overview

This skill provides high-accuracy speech-to-text transcription using OpenAI Whisper with both API and self-hosted options. It supports multiple output formats (text, JSON, SRT, VTT) and timestamped segments for precise alignment. Pricing and deployment trade-offs are clear: a low per-minute API rate versus free local processing that requires GPU resources. The skill targets podcasting, meetings, accessibility, and general voice-to-text workflows.

How this skill works

The skill sends audio files to the Whisper API or runs one of several local implementations (faster-whisper, whisper.cpp) to produce transcripts. You can request simple text or verbose JSON with word- and segment-level timestamps for downstream processing. Local models run entirely on your hardware to keep data private and avoid API costs, while the API offers simple integration and consistent performance. Different model sizes balance VRAM, speed, and accuracy to match device constraints.

When to use it

  • Batch transcription of recorded audio like podcasts and interviews
  • When you need high-accuracy transcripts with optional timestamps
  • If you must keep audio data on-premises or avoid API costs (self-hosted)
  • For building searchable content indexes and accessibility captions
  • When you need a quick API integration for small-volume workloads

Best practices

  • Choose model size based on available VRAM: large-v3 for best accuracy, smaller models for constrained GPUs
  • Use verbose JSON with timestamp granularities for precise segmenting or subtitles
  • Preprocess audio: normalize volume and remove noise for better results
  • For speaker labeling, run a separate diarization step since Whisper does not include it
  • Batch files to minimize API overhead and watch minimum billing thresholds

Example use cases

  • Transcribe weekly podcasts into articles and chaptered SRT files
  • Generate meeting notes with timestamps for action item extraction
  • Convert training videos to searchable captions and VTT subtitles
  • Implement voice command logging for analytics in product research
  • Local processing of sensitive interviews to maintain full data control

FAQ

Does the API support real-time streaming?

No. The API does not provide real-time streaming; use local solutions and lower-latency tooling for near-real-time needs.

Is speaker diarization included?

No. Speaker diarization is not provided; run a dedicated diarization pipeline after transcription if you need speaker labels.

What are the cost differences between API and self-hosted?

API charges a flat per-minute rate with minimum billing; self-hosted is free to run but requires suitable GPU hardware and maintenance.