home / skills / openclaw / skills / voice-transcribe

voice-transcribe skill

safe

This skill transcribes audio using the gpt-4o-mini-transcribe model, returning accurate text with vocabulary hints and replacements to improve clarity.

npx playbooks add skill openclaw/skills --skill voice-transcribe

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

1.4 KB

---
name: voice-transcribe
description: Transcribe audio files using OpenAI's gpt-4o-mini-transcribe model with vocabulary hints and text replacements. Requires uv (https://docs.astral.sh/uv/).
---

# voice-transcribe

transcribe audio files using openai's gpt-4o-mini-transcribe model.

## when to use

when receiving voice memos (especially via whatsapp), just run:
```bash
uv run /Users/darin/clawd/skills/voice-transcribe/transcribe <audio-file>
```
then respond based on the transcribed content.

## fixing transcription errors

if darin says a word was transcribed wrong, add it to `vocab.txt` (for hints) or `replacements.txt` (for guaranteed fix). see sections below.

## supported formats

- mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, opus

## examples

```bash
# transcribe a voice memo
transcribe /tmp/voice-memo.ogg

# pipe to other tools
transcribe /tmp/memo.ogg | pbcopy
```

## setup

1. add your openai api key to `/Users/darin/clawd/skills/voice-transcribe/.env`:
   ```
   OPENAI_API_KEY=sk-...
   ```

## custom vocabulary

add words to `vocab.txt` (one per line) to help the model recognize names/jargon:
```
Clawdis
Clawdbot
```

## text replacements

if the model still gets something wrong, add a replacement to `replacements.txt`:
```
wrong spelling -> correct spelling
```

## notes

- assumes english (no language detection)
- uses gpt-4o-mini-transcribe model specifically
- caches by sha256 of audio file

Overview

This skill transcribes audio files using OpenAI's gpt-4o-mini-transcribe model with support for vocabulary hints and text replacements. It is designed for quick, local transcription workflows and caches results by audio SHA256 to avoid repeated processing. The tool assumes English audio and supports common audio formats used for voice memos.

How this skill works

You run the command with a local audio file and the skill uploads it to the model which returns a text transcript. It reads optional vocab.txt to bias recognition toward specific names or jargon and uses replacements.txt to apply deterministic corrections. The script caches transcriptions by the audio file's SHA256 so repeated runs are fast.

When to use it

Transcribing voice memos received via messaging apps (WhatsApp, Telegram, etc.).
Quickly converting interviews or short recordings into editable text.
Piping transcript output to other tools or clipboards for fast sharing.
When you need model-driven transcription with custom vocabulary hints.
Processing local audio files without language detection (English only).

Best practices

Add uncommon names and jargon to vocab.txt (one entry per line) to improve recognition.
Use replacements.txt for guaranteed fixes when a word is repeatedly mis-transcribed.
Keep audio segments short and clear for better accuracy; avoid heavy background noise.
Store your OpenAI API key in the skill's .env file and protect that file.
Rely on the SHA256 cache to avoid re-transcribing identical files.

Example use cases

Transcribe a WhatsApp voice memo and paste the text into a reply.
Convert interview clips into editable notes for publishing or research.
Batch-process meeting recordings and apply company-specific terminology via vocab.txt.
Pipe the transcript into other scripts for keyword extraction or summarization.
Quickly fix recurring transcription errors by adding replacements and re-running.

FAQ

Which audio formats are supported?

mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, and opus are supported.

How do I force a specific spelling for a mis-transcribed word?

Add an entry to replacements.txt in the format 'wrong -> correct' to apply a deterministic correction.