home / skills / yonatangross / orchestkit / audio-language-models

audio-language-models skill

Q: When should I use native speech-to-speech vs STT+LLM+TTS?

Use native speech-to-speech when you require minimal latency and natural turn-taking. Use STT+LLM+TTS when you need fine-grained transcript access, complex RAG workflows, or offline model control.

safe

/plugins/ork/skills/audio-language-models

This skill helps you implement real-time voice agents, transcription, and TTS using top models for multilingual, natural conversations.

npx playbooks add skill yonatangross/orchestkit --skill audio-language-models

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

11.3 KB

---
name: audio-language-models
description: Gemini Live API, Grok Voice Agent, GPT-4o-Transcribe, AssemblyAI patterns for real-time voice, speech-to-text, and TTS. Use when implementing voice agents, audio transcription, or conversational AI.
context: fork
agent: multimodal-specialist
version: 1.1.0
author: OrchestKit
user-invocable: false
tags: [audio, multimodal, gemini-live, grok-voice, whisper, tts, speech, voice-agent, 2026]
---

# Audio Language Models (2026)

Build real-time voice agents and audio processing using the latest native speech-to-speech models.

## Overview

- Real-time voice assistants and agents
- Live conversational AI (phone agents, support bots)
- Audio transcription with speaker diarization
- Multilingual voice interactions
- Text-to-speech generation
- Voice-to-voice translation

## Model Comparison (January 2026)

### Real-Time Voice (Speech-to-Speech)

| Model | Latency | Languages | Price | Best For |
|-------|---------|-----------|-------|----------|
| **Grok Voice Agent** | <1s TTFA | 100+ | $0.05/min | Fastest, #1 Big Bench |
| **Gemini Live API** | Low | 24 (30 voices) | Usage-based | Emotional awareness |
| **OpenAI Realtime** | ~1s | 50+ | $0.10/min | Ecosystem integration |

### Speech-to-Text Only

| Model | WER | Latency | Best For |
|-------|-----|---------|----------|
| **Gemini 2.5 Pro** | ~5% | Medium | 9.5hr audio, diarization |
| **GPT-4o-Transcribe** | ~7% | Medium | Accuracy + accents |
| **AssemblyAI Universal-2** | 8.4% | 200ms | Best features |
| **Deepgram Nova-3** | ~18% | <300ms | Lowest latency |
| **Whisper Large V3** | 7.4% | Slow | Self-host, 99+ langs |

## Grok Voice Agent API (xAI) - Fastest

```python
import asyncio
import websockets
import json

async def grok_voice_agent():
    """Real-time voice agent with Grok - #1 on Big Bench Audio.

    Features:
    - <1 second time-to-first-audio (5x faster than competitors)
    - Native speech-to-speech (no transcription intermediary)
    - 100+ languages, $0.05/min
    - OpenAI Realtime API compatible
    """
    uri = "wss://api.x.ai/v1/realtime"
    headers = {"Authorization": f"Bearer {XAI_API_KEY}"}

    async with websockets.connect(uri, extra_headers=headers) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "model": "grok-4-voice",
                "voice": "Aria",  # or "Eve", "Leo"
                "instructions": "You are a helpful voice assistant.",
                "input_audio_format": "pcm16",
                "output_audio_format": "pcm16",
                "turn_detection": {"type": "server_vad"}
            }
        }))

        # Stream audio in/out
        async def send_audio(audio_stream):
            async for chunk in audio_stream:
                await ws.send(json.dumps({
                    "type": "input_audio_buffer.append",
                    "audio": base64.b64encode(chunk).decode()
                }))

        async def receive_audio():
            async for message in ws:
                data = json.loads(message)
                if data["type"] == "response.audio.delta":
                    yield base64.b64decode(data["delta"])

        return send_audio, receive_audio

# Expressive voice with auditory cues
async def expressive_response(ws, text: str):
    """Use auditory cues for natural speech."""
    # Supports: [whisper], [sigh], [laugh], [pause]
    await ws.send(json.dumps({
        "type": "response.create",
        "response": {
            "instructions": "[sigh] Let me think about that... [pause] Here's what I found."
        }
    }))
```

## Gemini Live API (Google) - Emotional Awareness

```python
import google.generativeai as genai
from google.generativeai import live

genai.configure(api_key="YOUR_API_KEY")

async def gemini_live_voice():
    """Real-time voice with emotional understanding.

    Features:
    - 30 HD voices in 24 languages
    - Affective dialog (understands emotions)
    - Barge-in support (interrupt anytime)
    - Proactive audio (responds only when relevant)
    """
    model = genai.GenerativeModel("gemini-2.5-flash-live")

    config = live.LiveConnectConfig(
        response_modalities=["AUDIO"],
        speech_config=live.SpeechConfig(
            voice_config=live.VoiceConfig(
                prebuilt_voice_config=live.PrebuiltVoiceConfig(
                    voice_name="Puck"  # or Charon, Kore, Fenrir, Aoede
                )
            )
        ),
        system_instruction="You are a friendly voice assistant."
    )

    async with model.connect(config=config) as session:
        # Send audio
        async def send_audio(audio_chunk: bytes):
            await session.send(
                input=live.LiveClientContent(
                    realtime_input=live.RealtimeInput(
                        media_chunks=[live.MediaChunk(
                            data=audio_chunk,
                            mime_type="audio/pcm"
                        )]
                    )
                )
            )

        # Receive audio responses
        async for response in session.receive():
            if response.data:
                yield response.data  # Audio bytes

# With transcription
async def gemini_live_with_transcript():
    """Get both audio and text transcripts."""
    async with model.connect(config=config) as session:
        async for response in session.receive():
            if response.server_content:
                # Text transcript
                if response.server_content.model_turn:
                    for part in response.server_content.model_turn.parts:
                        if part.text:
                            print(f"Transcript: {part.text}")
            if response.data:
                yield response.data  # Audio
```

## Gemini Audio Transcription (Long-Form)

```python
import google.generativeai as genai

def transcribe_with_gemini(audio_path: str) -> dict:
    """Transcribe up to 9.5 hours of audio with speaker diarization.

    Gemini 2.5 Pro handles long-form audio natively.
    """
    model = genai.GenerativeModel("gemini-2.5-pro")

    # Upload audio file
    audio_file = genai.upload_file(audio_path)

    response = model.generate_content([
        audio_file,
        """Transcribe this audio with:
        1. Speaker labels (Speaker 1, Speaker 2, etc.)
        2. Timestamps for each segment
        3. Punctuation and formatting

        Format:
        [00:00:00] Speaker 1: First statement...
        [00:00:15] Speaker 2: Response..."""
    ])

    return {
        "transcript": response.text,
        "audio_duration": audio_file.duration
    }
```

## Gemini TTS (Text-to-Speech)

```python
def gemini_text_to_speech(text: str, voice: str = "Kore") -> bytes:
    """Generate speech with Gemini 2.5 TTS.

    Features:
    - Enhanced expressivity with style prompts
    - Precision pacing (context-aware speed)
    - Multi-speaker dialogue consistency
    """
    model = genai.GenerativeModel("gemini-2.5-flash-tts")

    response = model.generate_content(
        contents=text,
        generation_config=genai.GenerationConfig(
            response_mime_type="audio/mp3",
            speech_config=genai.SpeechConfig(
                voice_config=genai.VoiceConfig(
                    prebuilt_voice_config=genai.PrebuiltVoiceConfig(
                        voice_name=voice  # Puck, Charon, Kore, Fenrir, Aoede
                    )
                )
            )
        )
    )

    return response.audio
```

## OpenAI GPT-4o-Transcribe

```python
from openai import OpenAI

client = OpenAI()

def transcribe_openai(audio_path: str, language: str = None) -> dict:
    """Transcribe with GPT-4o-Transcribe (enhanced accuracy)."""
    with open(audio_path, "rb") as audio_file:
        response = client.audio.transcriptions.create(
            model="gpt-4o-transcribe",
            file=audio_file,
            language=language,
            response_format="verbose_json",
            timestamp_granularities=["word", "segment"]
        )
    return {
        "text": response.text,
        "words": response.words,
        "segments": response.segments,
        "duration": response.duration
    }
```

## AssemblyAI (Best Features)

```python
import assemblyai as aai

aai.settings.api_key = "YOUR_API_KEY"

def transcribe_assemblyai(audio_url: str) -> dict:
    """Transcribe with speaker diarization, sentiment, entities."""
    config = aai.TranscriptionConfig(
        speaker_labels=True,
        sentiment_analysis=True,
        entity_detection=True,
        auto_highlights=True,
        language_detection=True
    )

    transcriber = aai.Transcriber()
    transcript = transcriber.transcribe(audio_url, config=config)

    return {
        "text": transcript.text,
        "speakers": transcript.utterances,
        "sentiment": transcript.sentiment_analysis,
        "entities": transcript.entities
    }
```

## Real-Time Streaming Comparison

```python
async def choose_realtime_provider(
    requirements: dict
) -> str:
    """Select best real-time voice provider."""

    if requirements.get("fastest_latency"):
        return "grok"  # <1s TTFA, 5x faster

    if requirements.get("emotional_understanding"):
        return "gemini"  # Affective dialog

    if requirements.get("openai_ecosystem"):
        return "openai"  # Compatible tools

    if requirements.get("lowest_cost"):
        return "grok"  # $0.05/min (half of OpenAI)

    return "gemini"  # Best overall for 2026
```

## API Pricing (January 2026)

| Provider | Type | Price | Notes |
|----------|------|-------|-------|
| Grok Voice Agent | Real-time | $0.05/min | Cheapest real-time |
| Gemini Live | Real-time | Usage-based | 30 HD voices |
| OpenAI Realtime | Real-time | $0.10/min | |
| Gemini 2.5 Pro | Transcription | $1.25/M tokens | 9.5hr audio |
| GPT-4o-Transcribe | Transcription | $0.01/min | |
| AssemblyAI | Transcription | ~$0.15/hr | Best features |
| Deepgram | Transcription | ~$0.0043/min | |

## Key Decisions

| Scenario | Recommendation |
|----------|----------------|
| Voice assistant | Grok Voice Agent (fastest) |
| Emotional AI | Gemini Live API |
| Long audio (hours) | Gemini 2.5 Pro (9.5hr) |
| Speaker diarization | AssemblyAI or Gemini |
| Lowest latency STT | Deepgram Nova-3 |
| Self-hosted | Whisper Large V3 |

## Common Mistakes

- Using STT+LLM+TTS pipeline instead of native speech-to-speech
- Not leveraging emotional understanding (Gemini)
- Ignoring barge-in support for natural conversations
- Using deprecated Whisper-1 instead of GPT-4o-Transcribe
- Not testing latency with real users

## Related Skills

- `vision-language-models` - Image/video processing
- `multimodal-rag` - Audio + text retrieval
- `streaming-api-patterns` - WebSocket patterns

## Capability Details

### real-time-voice
**Keywords:** voice agent, real-time, conversational, live audio
**Solves:**
- Build voice assistants
- Phone agents and support bots
- Interactive voice response (IVR)

### speech-to-speech
**Keywords:** native audio, speech-to-speech, no transcription
**Solves:**
- Low-latency voice responses
- Natural conversation flow
- Emotional voice interactions

### transcription
**Keywords:** transcribe, speech-to-text, STT, convert audio
**Solves:**
- Convert audio files to text
- Generate meeting transcripts
- Process long-form audio

### voice-tts
**Keywords:** TTS, text-to-speech, voice synthesis
**Solves:**
- Generate natural speech
- Multi-voice dialogue
- Expressive audio output

Overview

This skill packages production-ready patterns and comparisons for building real-time voice agents, speech-to-text pipelines, and TTS using Gemini Live, Grok Voice Agent, GPT-4o-Transcribe, AssemblyAI, and related providers. It focuses on practical, low-latency voice interactions, long-form transcription, speaker diarization, and expressive TTS—all with TypeScript examples and integration patterns. Use it to choose providers, implement streaming websockets, and deploy conversational audio agents.

How this skill works

The skill inspects provider capabilities, latency, language coverage, and pricing, then offers code patterns for real-time streaming (WebSocket and SDK flows), long-form transcription uploads, and TTS generation. It includes best-practice decision logic to pick Grok for sub-second time-to-first-audio, Gemini for emotional and long-form audio, GPT-4o-Transcribe for high-accuracy STT, and AssemblyAI for diarization and entity/sentiment extraction. Example implementations show session setup, audio chunking, response handling, and expressive prompts for natural voice output.

When to use it

Building a low-latency, real-time voice assistant or phone agent
Transcribing long-form meetings with speaker labels and timestamps
Creating multilingual voice UIs or voice-to-voice translation
Choosing between native speech-to-speech or STT+LLM+TTS pipelines
Adding emotional understanding, barge-in, or expressive TTS to a conversational agent

Best practices

Prefer native speech-to-speech when you need sub-second turnaround and natural flow
Benchmark latency and TTFA with real users, not only lab tests
Use provider-specific features: Grok for speed, Gemini for affective dialog, AssemblyAI for diarization
Stream audio in small chunks and use server-side VAD or turn detection to reduce delays
Include timestamps, speaker labels, and confidence scores in transcripts for downstream processing

Example use cases

Customer support phone bot that handles interruptions and emotional cues using Gemini Live
High-volume call transcription with speaker diarization and sentiment via AssemblyAI
Real-time low-cost voice assistant using Grok Voice Agent for sub-second responses
Batch transcribe 9+ hour recordings with Gemini 2.5 Pro for legal or research workflows
Generate multi-voice, expressive audio for interactive stories or training simulations

FAQ

Which provider is best for lowest latency real-time voice?

Grok Voice Agent is the recommended choice for the fastest time-to-first-audio (<1s) and lowest per-minute cost for real-time speech-to-speech.

When should I use native speech-to-speech vs STT+LLM+TTS?

Use native speech-to-speech when you require minimal latency and natural turn-taking. Use STT+LLM+TTS when you need fine-grained transcript access, complex RAG workflows, or offline model control.