home / skills / openai / skills / transcribe

transcribe skill

/skills/.curated/transcribe

This skill transcribes audio files to text with optional diarization and known-speaker hints, providing structured outputs for meetings, interviews, and

npx playbooks add skill openai/skills --skill transcribe

Review the files below or copy the command above to add this skill to your agents.

Files (7)
SKILL.md
2.8 KB
---
name: "transcribe"
description: "Transcribe audio files to text with optional diarization and known-speaker hints. Use when a user asks to transcribe speech from audio/video, extract text from recordings, or label speakers in interviews or meetings."
---


# Audio Transcribe

Transcribe audio using OpenAI, with optional speaker diarization when requested. Prefer the bundled CLI for deterministic, repeatable runs.

## Workflow
1. Collect inputs: audio file path(s), desired response format (text/json/diarized_json), optional language hint, and any known speaker references.
2. Verify `OPENAI_API_KEY` is set. If missing, ask the user to set it locally (do not ask them to paste the key).
3. Run the bundled `transcribe_diarize.py` CLI with sensible defaults (fast text transcription).
4. Validate the output: transcription quality, speaker labels, and segment boundaries; iterate with a single targeted change if needed.
5. Save outputs under `output/transcribe/` when working in this repo.

## Decision rules
- Default to `gpt-4o-mini-transcribe` with `--response-format text` for fast transcription.
- If the user wants speaker labels or diarization, use `--model gpt-4o-transcribe-diarize --response-format diarized_json`.
- If audio is longer than ~30 seconds, keep `--chunking-strategy auto`.
- Prompting is not supported for `gpt-4o-transcribe-diarize`.

## Output conventions
- Use `output/transcribe/<job-id>/` for evaluation runs.
- Use `--out-dir` for multiple files to avoid overwriting.

## Dependencies (install if missing)
Prefer `uv` for dependency management.

```
uv pip install openai
```
If `uv` is unavailable:
```
python3 -m pip install openai
```

## Environment
- `OPENAI_API_KEY` must be set for live API calls.
- If the key is missing, instruct the user to create one in the OpenAI platform UI and export it in their shell.
- Never ask the user to paste the full key in chat.

## Skill path (set once)

```bash
export CODEX_HOME="${CODEX_HOME:-$HOME/.codex}"
export TRANSCRIBE_CLI="$CODEX_HOME/skills/transcribe/scripts/transcribe_diarize.py"
```

User-scoped skills install under `$CODEX_HOME/skills` (default: `~/.codex/skills`).

## CLI quick start
Single file (fast text default):
```
python3 "$TRANSCRIBE_CLI" \
  path/to/audio.wav \
  --out transcript.txt
```

Diarization with known speakers (up to 4):
```
python3 "$TRANSCRIBE_CLI" \
  meeting.m4a \
  --model gpt-4o-transcribe-diarize \
  --known-speaker "Alice=refs/alice.wav" \
  --known-speaker "Bob=refs/bob.wav" \
  --response-format diarized_json \
  --out-dir output/transcribe/meeting
```

Plain text output (explicit):
```
python3 "$TRANSCRIBE_CLI" \
  interview.mp3 \
  --response-format text \
  --out interview.txt
```

## Reference map
- `references/api.md`: supported formats, limits, response formats, and known-speaker notes.

Overview

This skill transcribes audio and video files to text, with optional speaker diarization and support for known-speaker hints. It provides a CLI for repeatable runs, sensible defaults for fast transcription, and a diarization mode for labeling speakers in interviews or meetings. Outputs can be plain text, JSON, or diarized JSON and are written to an organized output directory.

How this skill works

You point the CLI at one or more audio files and choose a response format and optional language or known-speaker references. The tool defaults to gpt-4o-mini-transcribe for fast text output and switches to gpt-4o-transcribe-diarize when diarization or speaker labels are requested. It verifies that OPENAI_API_KEY is set, runs the transcription with chunking for long audio, validates segment boundaries and labels, then writes results to output/transcribe/<job-id> or a specified out-dir. For diarization with known speakers, you can provide up to four speaker reference files.

When to use it

  • Transcribe single audio or video files to readable text quickly.
  • Generate diarized transcripts with speaker labels for meetings or interviews.
  • Add known-speaker hints to reliably map audio segments to specific people.
  • Batch-process multiple recordings and save outputs to separate job folders.
  • Extract searchable text from recordings for notes, indexing, or subtitles.

Best practices

  • Ensure OPENAI_API_KEY is set in your shell before running the CLI; do not paste the key into chat.
  • Use gpt-4o-mini-transcribe with --response-format text for fast, low-cost transcripts.
  • Enable --model gpt-4o-transcribe-diarize and --response-format diarized_json for speaker labeling.
  • Provide known-speaker reference files (up to 4) for more accurate speaker mapping.
  • Use --chunking-strategy auto for audio longer than ~30 seconds to avoid timeouts and improve stability.

Example use cases

  • Transcribe a short interview to create a plain-text summary and timestamps.
  • Process a multi-hour meeting and produce a diarized JSON with speaker segments for minutes and action items.
  • Label participants in a panel discussion by supplying reference clips for each speaker.
  • Batch convert a folder of recordings into separate transcript files saved under output/transcribe/.
  • Extract subtitles or searchable text from podcast episodes for publishing or SEO.

FAQ

What if I don’t have an API key set?

Export an OpenAI API key in your shell (create one in the OpenAI platform UI if needed). The CLI will refuse to run if OPENAI_API_KEY is missing; do not paste the full key into chat.

Which model should I pick for diarization?

Use gpt-4o-transcribe-diarize with --response-format diarized_json for speaker labeling. Prompting is not supported with that diarization model.