home / skills / derklinke / codex-config / transcribe

transcribe skill

This skill transcribes audio to text with optional diarization and known-speaker hints, returning structured outputs for interviews, meetings, and recordings.

This is most likely a fork of the transcribe skill from openai

npx playbooks add skill derklinke/codex-config --skill transcribe

Review the files below or copy the command above to add this skill to your agents.

Files (7)

SKILL.md

2.8 KB

---
name: "transcribe"
description: "Transcribe audio files to text with optional diarization and known-speaker hints. Use when a user asks to transcribe speech from audio/video, extract text from recordings, or label speakers in interviews or meetings."
---


# Audio Transcribe

Transcribe audio using OpenAI, with optional speaker diarization when requested. Prefer the bundled CLI for deterministic, repeatable runs.

## Workflow
1. Collect inputs: audio file path(s), desired response format (text/json/diarized_json), optional language hint, and any known speaker references.
2. Verify `OPENAI_API_KEY` is set. If missing, ask the user to set it locally (do not ask them to paste the key).
3. Run the bundled `transcribe_diarize.py` CLI with sensible defaults (fast text transcription).
4. Validate the output: transcription quality, speaker labels, and segment boundaries; iterate with a single targeted change if needed.
5. Save outputs under `output/transcribe/` when working in this repo.

## Decision rules
- Default to `gpt-4o-mini-transcribe` with `--response-format text` for fast transcription.
- If the user wants speaker labels or diarization, use `--model gpt-4o-transcribe-diarize --response-format diarized_json`.
- If audio is longer than ~30 seconds, keep `--chunking-strategy auto`.
- Prompting is not supported for `gpt-4o-transcribe-diarize`.

## Output conventions
- Use `output/transcribe/<job-id>/` for evaluation runs.
- Use `--out-dir` for multiple files to avoid overwriting.

## Dependencies (install if missing)
Prefer `uv` for dependency management.

```
uv pip install openai
```
If `uv` is unavailable:
```
python3 -m pip install openai
```

## Environment
- `OPENAI_API_KEY` must be set for live API calls.
- If the key is missing, instruct the user to create one in the OpenAI platform UI and export it in their shell.
- Never ask the user to paste the full key in chat.

## Skill path (set once)

```bash
export CODEX_HOME="${CODEX_HOME:-$HOME/.codex}"
export TRANSCRIBE_CLI="$CODEX_HOME/skills/transcribe/scripts/transcribe_diarize.py"
```

User-scoped skills install under `$CODEX_HOME/skills` (default: `~/.codex/skills`).

## CLI quick start
Single file (fast text default):
```
python3 "$TRANSCRIBE_CLI" \
  path/to/audio.wav \
  --out transcript.txt
```

Diarization with known speakers (up to 4):
```
python3 "$TRANSCRIBE_CLI" \
  meeting.m4a \
  --model gpt-4o-transcribe-diarize \
  --known-speaker "Alice=refs/alice.wav" \
  --known-speaker "Bob=refs/bob.wav" \
  --response-format diarized_json \
  --out-dir output/transcribe/meeting
```

Plain text output (explicit):
```
python3 "$TRANSCRIBE_CLI" \
  interview.mp3 \
  --response-format text \
  --out interview.txt
```

## Reference map
- `references/api.md`: supported formats, limits, response formats, and known-speaker notes.

Overview

This skill transcribes audio and video to text with optional speaker diarization and support for known-speaker hints. It favors a fast default transcription flow and switches to a diarization model when speaker labeling is requested. Outputs are saved under a predictable output path for repeatable evaluation runs.

How this skill works

Collect audio file paths, desired response format (text, json, or diarized_json), optional language hint, and any known-speaker references. Verify OPENAI_API_KEY is set, then run the bundled CLI with model and response-format chosen based on whether diarization is needed. Validate and iterate on the transcript and speaker segments if quality or labels need adjustment.

When to use it

Transcribe interviews, meetings, podcasts, or recorded talks into plain text.
Generate speaker-labeled transcripts for multi-person conversations or research transcripts.
Extract text from audio/video for search, captions, or note-taking workflows.
Batch-process multiple files while keeping outputs organized to avoid overwrites.

Best practices

Default to gpt-4o-mini-transcribe with --response-format text for fast, cost-effective transcription.
Use gpt-4o-transcribe-diarize with --response-format diarized_json when you need speaker labels or diarization.
Provide up to four known-speaker audio references to improve labeling accuracy for interviews or recurring participants.
If audio is longer than ~30 seconds, keep --chunking-strategy auto to handle long files reliably.
Set OPENAI_API_KEY in your environment; do not paste keys into chat. Save outputs under output/transcribe/<job-id>/ for reproducible runs.

Example use cases

Transcribe a single interview to a plain text file for editing and publishing.
Produce a diarized JSON transcript of a meeting with known-speaker hints for minutes and speaker attribution.
Batch transcription of a podcast season with organized out-directories to preserve each episode’s results.
Convert lecture recordings into searchable text for study materials and indexing.

FAQ

What model should I use for fast transcription?

Use gpt-4o-mini-transcribe with --response-format text for fast, cost-effective transcripts.

How do I get accurate speaker labels?

Provide up to four known-speaker audio references and run with gpt-4o-transcribe-diarize and --response-format diarized_json; validate segments and iterate if needed.

What if OPENAI_API_KEY is not set?

Set OPENAI_API_KEY in your shell (create one in the OpenAI platform UI if you don’t have it). Do not paste the key into chat.