home / skills / willsigmon / sigstack / transcription-expert

transcription-expert skill

safe

/plugins/media/skills/transcription-expert

This skill helps you choose and compare transcription services like Whisper, Deepgram, and AssemblyAI for your use case.

npx playbooks add skill willsigmon/sigstack --skill transcription-expert

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.5 KB

---
name: Transcription Expert
description: Audio/video transcription - Whisper, Deepgram, AssemblyAI comparison and usage
allowed-tools: Read, Edit, Bash, WebFetch
model: sonnet
---

# Transcription Expert

Choose the right transcription service for your use case.

## Pricing Comparison (2026)

| Service | Price/min | Speed | Diarization | Real-time |
|---------|-----------|-------|-------------|-----------|
| Whisper API | $0.006 | Slow | No (+extra) | No |
| Deepgram | $0.0043 | 20s/hr | Yes | Yes |
| AssemblyAI | $0.0025 | Fast | +$0.02/hr | Yes |

## When to Use Each

### Whisper
- One-time batch processing
- Self-hosting option (free)
- Privacy-sensitive (local)
- Best: Podcasts, offline processing

### Deepgram
- Real-time applications
- Live captioning
- Speaker identification built-in
- Best: Meetings, call centers, voice apps

### AssemblyAI
- Cheapest per-minute
- AI features (sentiment, topics)
- PII redaction
- Best: Content analysis, compliance

## Quick Implementations

### Whisper (OpenAI)
```python
from openai import OpenAI
client = OpenAI()

with open("audio.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1", file=f
    )
print(transcript.text)
```

### Deepgram
```python
from deepgram import DeepgramClient, PrerecordedOptions

dg = DeepgramClient(api_key="...")
options = PrerecordedOptions(model="nova-3", diarize=True)

response = dg.listen.rest.v1.transcribe_file(
    {"buffer": open("audio.mp3", "rb")}, options
)
```

### AssemblyAI
```python
import assemblyai as aai

aai.settings.api_key = "..."
transcriber = aai.Transcriber()

transcript = transcriber.transcribe("audio.mp3")
print(transcript.text)
```

## Speaker Diarization

### Deepgram (Built-in)
```python
options = PrerecordedOptions(diarize=True)
# Response includes speaker labels automatically
```

### AssemblyAI
```python
config = aai.TranscriptionConfig(speaker_labels=True)
# +$0.02/hr additional
```

### Whisper (Requires Extra)
```python
# Need separate diarization service like pyannote
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
```

## Batch Processing
```python
import asyncio

async def transcribe_batch(files):
    tasks = [transcribe(f) for f in files]
    return await asyncio.gather(*tasks)
```

## Output Formats
- Plain text
- SRT/VTT subtitles
- JSON with timestamps
- Word-level timing

Use when: Podcast transcription, meeting notes, video subtitles, voice content indexing

Overview

This skill helps you choose and implement the right audio/video transcription pipeline by comparing Whisper, Deepgram, and AssemblyAI and providing practical usage patterns. It summarizes pricing, latency, diarization support, and recommended scenarios so you can match a service to your product, privacy needs, and budget. It includes quick implementation guidance, batching tips, and output format options.

How this skill works

The skill inspects trade-offs between cost, speed, diarization, and real-time capability to recommend one of three popular transcription providers. It explains where each service excels (batch vs. real-time, privacy vs. features) and outlines integration patterns for single-file, batch, and streaming workflows. It also covers speaker diarization approaches and common output formats like text, SRT/VTT, and JSON with timestamps.

When to use it

Choose Whisper for one-time batch jobs, self-hosting, or privacy-sensitive local processing.
Choose Deepgram when you need low-latency real-time transcription and built-in speaker diarization for meetings or live captions.
Choose AssemblyAI when per-minute cost and advanced content analysis features (sentiment, topics, PII redaction) matter.
Use batch processing for podcasts and large media libraries to maximize throughput and reduce cost.
Use streaming/real-time for live captioning, voice apps, or call center monitoring.

Best practices

Estimate end-to-end cost using price/min and additional diarization fees to avoid surprises.
Prototype with a small dataset to validate accuracy, speaker separation, and timestamps before bulk processing.
Use batching and async concurrency for high-volume workloads to improve throughput and latency.
Apply PII redaction or local processing for regulated content; prefer self-hosting or providers offering redaction when necessary.
Normalize audio (sample rate, channels, noise reduction) before transcription to improve accuracy.

Example use cases

Podcast pipeline: batch-transcribe episodes with Whisper (self-hosted or API) and export SRT/VTT for publishing.
Live meeting captions: use Deepgram for low-latency streaming and built-in speaker labels for meeting notes.
Content analysis and compliance: use AssemblyAI for cheapest per-minute cost plus topic extraction and PII redaction.
Call center monitoring: real-time Deepgram transcription with diarization and downstream sentiment/topic analytics.
Large media archive: run async batch jobs to produce searchable JSON transcripts with word-level timing.

FAQ

Which provider is cheapest?

AssemblyAI is typically the lowest per-minute price; include any extra feature fees (e.g., speaker labels) when estimating total cost.

Which option is best for speaker diarization?

Deepgram provides built-in diarization for real-time and prerecorded audio; AssemblyAI offers speaker labels as an add-on; Whisper requires a separate diarization tool like pyannote.

Can I mix providers in one pipeline?

Yes. Common patterns use one service for low-cost bulk transcription and another for live or diarized segments based on feature needs and cost.