home / skills / derklinke / codex-config / transcribe

transcribe skill

/skills/transcribe

This skill transcribes audio to text with optional diarization and known-speaker hints, returning structured outputs for interviews, meetings, and recordings.

This is most likely a fork of the transcribe skill from openai
npx playbooks add skill derklinke/codex-config --skill transcribe

Review the files below or copy the command above to add this skill to your agents.

Files (7)
SKILL.md
2.8 KB
---
name: "transcribe"
description: "Transcribe audio files to text with optional diarization and known-speaker hints. Use when a user asks to transcribe speech from audio/video, extract text from recordings, or label speakers in interviews or meetings."
---


# Audio Transcribe

Transcribe audio using OpenAI, with optional speaker diarization when requested. Prefer the bundled CLI for deterministic, repeatable runs.

## Workflow
1. Collect inputs: audio file path(s), desired response format (text/json/diarized_json), optional language hint, and any known speaker references.
2. Verify `OPENAI_API_KEY` is set. If missing, ask the user to set it locally (do not ask them to paste the key).
3. Run the bundled `transcribe_diarize.py` CLI with sensible defaults (fast text transcription).
4. Validate the output: transcription quality, speaker labels, and segment boundaries; iterate with a single targeted change if needed.
5. Save outputs under `output/transcribe/` when working in this repo.

## Decision rules
- Default to `gpt-4o-mini-transcribe` with `--response-format text` for fast transcription.
- If the user wants speaker labels or diarization, use `--model gpt-4o-transcribe-diarize --response-format diarized_json`.
- If audio is longer than ~30 seconds, keep `--chunking-strategy auto`.
- Prompting is not supported for `gpt-4o-transcribe-diarize`.

## Output conventions
- Use `output/transcribe/<job-id>/` for evaluation runs.
- Use `--out-dir` for multiple files to avoid overwriting.

## Dependencies (install if missing)
Prefer `uv` for dependency management.

```
uv pip install openai
```
If `uv` is unavailable:
```
python3 -m pip install openai
```

## Environment
- `OPENAI_API_KEY` must be set for live API calls.
- If the key is missing, instruct the user to create one in the OpenAI platform UI and export it in their shell.
- Never ask the user to paste the full key in chat.

## Skill path (set once)

```bash
export CODEX_HOME="${CODEX_HOME:-$HOME/.codex}"
export TRANSCRIBE_CLI="$CODEX_HOME/skills/transcribe/scripts/transcribe_diarize.py"
```

User-scoped skills install under `$CODEX_HOME/skills` (default: `~/.codex/skills`).

## CLI quick start
Single file (fast text default):
```
python3 "$TRANSCRIBE_CLI" \
  path/to/audio.wav \
  --out transcript.txt
```

Diarization with known speakers (up to 4):
```
python3 "$TRANSCRIBE_CLI" \
  meeting.m4a \
  --model gpt-4o-transcribe-diarize \
  --known-speaker "Alice=refs/alice.wav" \
  --known-speaker "Bob=refs/bob.wav" \
  --response-format diarized_json \
  --out-dir output/transcribe/meeting
```

Plain text output (explicit):
```
python3 "$TRANSCRIBE_CLI" \
  interview.mp3 \
  --response-format text \
  --out interview.txt
```

## Reference map
- `references/api.md`: supported formats, limits, response formats, and known-speaker notes.

Overview

This skill transcribes audio and video to text with optional speaker diarization and support for known-speaker hints. It favors a fast default transcription flow and switches to a diarization model when speaker labeling is requested. Outputs are saved under a predictable output path for repeatable evaluation runs.

How this skill works

Collect audio file paths, desired response format (text, json, or diarized_json), optional language hint, and any known-speaker references. Verify OPENAI_API_KEY is set, then run the bundled CLI with model and response-format chosen based on whether diarization is needed. Validate and iterate on the transcript and speaker segments if quality or labels need adjustment.

When to use it

  • Transcribe interviews, meetings, podcasts, or recorded talks into plain text.
  • Generate speaker-labeled transcripts for multi-person conversations or research transcripts.
  • Extract text from audio/video for search, captions, or note-taking workflows.
  • Batch-process multiple files while keeping outputs organized to avoid overwrites.

Best practices

  • Default to gpt-4o-mini-transcribe with --response-format text for fast, cost-effective transcription.
  • Use gpt-4o-transcribe-diarize with --response-format diarized_json when you need speaker labels or diarization.
  • Provide up to four known-speaker audio references to improve labeling accuracy for interviews or recurring participants.
  • If audio is longer than ~30 seconds, keep --chunking-strategy auto to handle long files reliably.
  • Set OPENAI_API_KEY in your environment; do not paste keys into chat. Save outputs under output/transcribe/<job-id>/ for reproducible runs.

Example use cases

  • Transcribe a single interview to a plain text file for editing and publishing.
  • Produce a diarized JSON transcript of a meeting with known-speaker hints for minutes and speaker attribution.
  • Batch transcription of a podcast season with organized out-directories to preserve each episode’s results.
  • Convert lecture recordings into searchable text for study materials and indexing.

FAQ

What model should I use for fast transcription?

Use gpt-4o-mini-transcribe with --response-format text for fast, cost-effective transcripts.

How do I get accurate speaker labels?

Provide up to four known-speaker audio references and run with gpt-4o-transcribe-diarize and --response-format diarized_json; validate segments and iterate if needed.

What if OPENAI_API_KEY is not set?

Set OPENAI_API_KEY in your shell (create one in the OpenAI platform UI if you don’t have it). Do not paste the key into chat.