home / skills / agntswrm / agent-media / audio-transcribe

audio-transcribe skill

not checked

npx playbooks add skill agntswrm/agent-media --skill audio-transcribe

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.0 KB

---
name: audio-transcribe
description: Transcribes audio to text with timestamps and optional speaker identification. Use when you need to convert speech to text, create subtitles, transcribe meetings, or process voice recordings.
---

# Audio Transcribe

Transcribes audio files to text with timestamps. Supports automatic language detection, speaker identification (diarization), and outputs structured JSON with segment-level timing.

## Command

```bash
agent-media audio transcribe --in <path> [options]
```

## Inputs

| Option | Required | Description |
|--------|----------|-------------|
| `--in` | Yes | Input audio file path or URL (supports mp3, wav, m4a, ogg) |
| `--diarize` | No | Enable speaker identification |
| `--language` | No | Language code (auto-detected if not provided) |
| `--speakers` | No | Number of speakers hint for diarization |
| `--out` | No | Output path, filename or directory (default: ./) |
| `--provider` | No | Provider to use (local, fal, replicate) |

## Output

Returns a JSON object with transcription data:

```json
{
  "ok": true,
  "media_type": "audio",
  "action": "transcribe",
  "provider": "fal",
  "output_path": "transcription_123_abc.json",
  "transcription": {
    "text": "Full transcription text...",
    "language": "en",
    "segments": [
      { "start": 0.0, "end": 2.5, "text": "Hello.", "speaker": "SPEAKER_0" },
      { "start": 2.5, "end": 5.0, "text": "Hi there.", "speaker": "SPEAKER_1" }
    ]
  }
}
```

## Examples

Basic transcription (auto-detect language):
```bash
agent-media audio transcribe --in interview.mp3
```

Transcription with speaker identification:
```bash
agent-media audio transcribe --in meeting.wav --diarize
```

Transcription with specific language and speaker count:
```bash
agent-media audio transcribe --in podcast.mp3 --diarize --language en --speakers 3
```

Use specific provider:
```bash
agent-media audio transcribe --in audio.wav --provider replicate
```

## Extracting Audio from Video

To transcribe a video file, first extract the audio:

```bash
# Step 1: Extract audio from video
agent-media audio extract --in video.mp4 --format mp3

# Step 2: Transcribe the extracted audio
agent-media audio transcribe --in extracted_xxx.mp3
```

## Providers

### local

Runs locally on CPU using [Transformers.js](https://huggingface.co/docs/transformers.js), no API key required.

- Uses Moonshine model (5x faster than Whisper)
- Models downloaded on first use (~100MB)
- Does NOT support diarization — use fal or replicate for speaker identification
- You may see a `mutex lock failed` error — ignore it, the output is correct if `"ok": true`

```bash
agent-media audio transcribe --in audio.mp3 --provider local
```

### fal

- Requires `FAL_API_KEY`
- Uses `wizper` model for fast transcription (2x faster) when diarization is disabled
- Uses `whisper` model when diarization is enabled (native support)

### replicate

- Requires `REPLICATE_API_TOKEN`
- Uses `whisper-diarization` model with Whisper Large V3 Turbo
- Native diarization support with word-level timestamps