home / skills / trpc-group / trpc-agent-go / whisper

whisper skill

safe

This skill transcribes audio to text using OpenAI Whisper, supporting multiple languages, models, and optional timestamps for versatile transcripts.

npx playbooks add skill trpc-group/trpc-agent-go --skill whisper

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

1.9 KB

---
name: whisper
description: Transcribe audio files to text using OpenAI Whisper
---

# Whisper Audio Transcription Skill

Transcribe audio files to text using OpenAI Whisper.

## Capabilities

- Transcribe audio files (MP3, WAV, M4A, FLAC, OGG, etc.) to text
- Support for 90+ languages with auto-detection
- Optional timestamp generation
- Multiple model sizes (tiny/base/small/medium/large)
- Output in plain text or JSON format

## Usage

### Basic Transcription

```bash
python3 scripts/transcribe.py <audio_file> <output_file>
```

### With Options

```bash
# Specify model size (default: base)
python3 scripts/transcribe.py audio.mp3 transcript.txt --model medium

# Specify language (improves accuracy)
python3 scripts/transcribe.py audio.mp3 transcript.txt --language zh

# Include timestamps
python3 scripts/transcribe.py audio.mp3 transcript.txt --timestamps

# JSON output with metadata
python3 scripts/transcribe.py audio.mp3 output.json --format json
```

## Parameters

- `audio_file` (required): Path to input audio file
- `output_file` (required): Path to output text/JSON file
- `--model`: Whisper model size (tiny/base/small/medium/large, default: base)
- `--language`: Language code (e.g., en, zh, es, fr, auto for detection)
- `--timestamps`: Include word-level timestamps in output
- `--format`: Output format (text/json, default: text)

## Model Sizes

| Model  | Parameters | Speed | Accuracy | Memory |
|--------|------------|-------|----------|--------|
| tiny   | 39M        | ~32x  | Good     | ~1GB   |
| base   | 74M        | ~16x  | Better   | ~1GB   |
| small  | 244M       | ~6x   | Great    | ~2GB   |
| medium | 769M       | ~2x   | Excellent| ~5GB   |
| large  | 1.5B       | 1x    | Best     | ~10GB  |

## Supported Audio Formats

MP3, WAV, M4A, FLAC, OGG, AAC, WMA, and more (via FFmpeg)

## Dependencies

- Python 3.8+
- openai-whisper
- ffmpeg

## Installation

```bash
pip install openai-whisper
sudo apt-get install ffmpeg  # Ubuntu/Debian
```

Overview

This skill transcribes audio files to text using OpenAI Whisper models. It supports common audio formats, automatic language detection for 90+ languages, optional timestamps, multiple model sizes, and text or JSON output. It is optimized for practical transcription workflows with speed/accuracy trade-offs across model sizes.

How this skill works

You provide an audio file and target output path; the skill runs a selected Whisper model to produce a transcript. Options let you choose model size (tiny to large), force or auto-detect language, include word- or segment-level timestamps, and export plain text or structured JSON with metadata. FFmpeg is used to preprocess and normalize audio so many file types are accepted.

When to use it

Transcribing interviews, meetings, podcasts, or lectures into text.
Generating captions or subtitles with optional timestamps for video content.
Quickly transcribing short voice notes or long multi-hour recordings depending on model choice.
Processing multilingual audio where automatic language detection is helpful.
Converting audio archives into searchable text for indexing or compliance.

Best practices

Choose model size based on trade-off: tiny/base for speed and low memory, medium/large for best accuracy on noisy or complex audio.
Preprocess noisy recordings with noise reduction and ensure consistent sampling rate via FFmpeg for better accuracy.
Specify language when known to improve transcription quality and reduce mis-detections.
Use timestamps when you need alignment for subtitles or highlight extraction; omit them for faster, smaller outputs.
Export JSON when downstream tooling needs structured segments, speaker labels, or timing metadata.

Example use cases

Produce meeting minutes from a recorded team call with medium model and timestamps for quick reference.
Create subtitles for a multilingual webinar using automatic language detection and JSON export for editing.
Batch-transcribe a podcast series using the small model for cost/speed balance and plain text output for publishing.
Index archived customer support calls by transcribing audio into searchable text for QA and training.
Transcribe voice memos on mobile devices with tiny or base model for near-real-time feedback.

FAQ

What audio formats are supported?

Most formats are supported (MP3, WAV, M4A, FLAC, OGG, AAC, WMA) because FFmpeg is used to decode and normalize audio.

Which model should I pick for accuracy vs. speed?

Use tiny/base for fastest, low-memory tasks; small for a balance; medium or large when you need higher accuracy on noisy or complex speech.