home / skills / openclaw / skills / voice-transcribe
This skill transcribes audio using the gpt-4o-mini-transcribe model, returning accurate text with vocabulary hints and replacements to improve clarity.
npx playbooks add skill openclaw/skills --skill voice-transcribeReview the files below or copy the command above to add this skill to your agents.
---
name: voice-transcribe
description: Transcribe audio files using OpenAI's gpt-4o-mini-transcribe model with vocabulary hints and text replacements. Requires uv (https://docs.astral.sh/uv/).
---
# voice-transcribe
transcribe audio files using openai's gpt-4o-mini-transcribe model.
## when to use
when receiving voice memos (especially via whatsapp), just run:
```bash
uv run /Users/darin/clawd/skills/voice-transcribe/transcribe <audio-file>
```
then respond based on the transcribed content.
## fixing transcription errors
if darin says a word was transcribed wrong, add it to `vocab.txt` (for hints) or `replacements.txt` (for guaranteed fix). see sections below.
## supported formats
- mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, opus
## examples
```bash
# transcribe a voice memo
transcribe /tmp/voice-memo.ogg
# pipe to other tools
transcribe /tmp/memo.ogg | pbcopy
```
## setup
1. add your openai api key to `/Users/darin/clawd/skills/voice-transcribe/.env`:
```
OPENAI_API_KEY=sk-...
```
## custom vocabulary
add words to `vocab.txt` (one per line) to help the model recognize names/jargon:
```
Clawdis
Clawdbot
```
## text replacements
if the model still gets something wrong, add a replacement to `replacements.txt`:
```
wrong spelling -> correct spelling
```
## notes
- assumes english (no language detection)
- uses gpt-4o-mini-transcribe model specifically
- caches by sha256 of audio file
This skill transcribes audio files using OpenAI's gpt-4o-mini-transcribe model with support for vocabulary hints and text replacements. It is designed for quick, local transcription workflows and caches results by audio SHA256 to avoid repeated processing. The tool assumes English audio and supports common audio formats used for voice memos.
You run the command with a local audio file and the skill uploads it to the model which returns a text transcript. It reads optional vocab.txt to bias recognition toward specific names or jargon and uses replacements.txt to apply deterministic corrections. The script caches transcriptions by the audio file's SHA256 so repeated runs are fast.
Which audio formats are supported?
mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, and opus are supported.
How do I force a specific spelling for a mis-transcribed word?
Add an entry to replacements.txt in the format 'wrong -> correct' to apply a deterministic correction.