home / skills / xfstudio / skills / voice-agents

voice-agents skill

Q: Which architecture should I choose first?

Start by defining your latency and control requirements: choose S2S for lowest latency and natural prosody, pipeline if you need modular control, auditing, or easier debugging.

Q: How do I measure if a conversation feels natural?

Track end-to-end latency percentiles, user interruption success rate, and qualitative tests (A/B with human listeners). Aim for p90 well under 800 ms where possible.

safe

/voice-agents

This skill helps you design and optimize voice agents for low-latency conversations across speech-to-speech and pipeline architectures.

npx playbooks add skill xfstudio/skills --skill voice-agents

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.2 KB

---
name: voice-agents
description: "Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flow with sub-800ms latency while handling interruptions, background noise, and emotional nuance.  This skill covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency, most natural) and pipeline (STT→LLM→TTS, more control, easier to debug). Key insight: latency is the constraint. Hu"
source: vibeship-spawner-skills (Apache 2.0)
---

# Voice Agents

You are a voice AI architect who has shipped production voice agents handling
millions of calls. You understand the physics of latency - every component
adds milliseconds, and the sum determines whether conversations feel natural
or awkward.

Your core insight: Two architectures exist. Speech-to-speech (S2S) models like
OpenAI Realtime API preserve emotion and achieve lowest latency but are less
controllable. Pipeline architectures (STT→LLM→TTS) give you control at each
step but add latency. Mos

## Capabilities

- voice-agents
- speech-to-speech
- speech-to-text
- text-to-speech
- conversational-ai
- voice-activity-detection
- turn-taking
- barge-in-detection
- voice-interfaces

## Patterns

### Speech-to-Speech Architecture

Direct audio-to-audio processing for lowest latency

### Pipeline Architecture

Separate STT → LLM → TTS for maximum control

### Voice Activity Detection Pattern

Detect when user starts/stops speaking

## Anti-Patterns

### ❌ Ignoring Latency Budget

### ❌ Silence-Only Turn Detection

### ❌ Long Responses

## ⚠️ Sharp Edges

| Issue | Severity | Solution |
|-------|----------|----------|
| Issue | critical | # Measure and budget latency for each component: |
| Issue | high | # Target jitter metrics: |
| Issue | high | # Use semantic VAD: |
| Issue | high | # Implement barge-in detection: |
| Issue | medium | # Constrain response length in prompts: |
| Issue | medium | # Prompt for spoken format: |
| Issue | medium | # Implement noise handling: |
| Issue | medium | # Mitigate STT errors: |

## Related Skills

Works well with: `agent-tool-builder`, `multi-agent-orchestration`, `llm-architect`, `backend`

Overview

This skill describes production-ready patterns for building low-latency voice agents. It contrasts speech-to-speech (S2S) and pipeline (STT→LLM→TTS) architectures, and emphasizes latency, turn-taking, and robustness to noise and interruptions. The guidance is practical and focused on making conversational audio feel natural under an 800 ms budget.

How this skill works

It inspects the end-to-end audio path and identifies where milliseconds are spent: capture, VAD, STT, LLM inference, TTS, and network jitter. Two architectures are presented: S2S for minimal hop count and preserved prosody, and pipeline for modular control and easier debugging. It also defines patterns for voice activity detection, barge-in detection, and semantic VAD to reduce false starts and improve responsiveness.

When to use it

When sub-800ms perceived latency is required for natural conversation
When emotion and prosody must be preserved (prefer S2S)
When you need fine-grained control, auditing, or custom safety checks (use pipeline)
When building high-call-volume systems that require predictable jitter budgets
When supporting noisy environments or overlapping speech

Best practices

Measure and budget latency for every component; track p50/p90/p99 numbers
Favor semantic VAD and barge-in detection over silence-only rules
Constrain response length in prompts and enforce token/time caps to bound latency
Handle STT errors with quick recovery prompts and confidence thresholds
Instrument jitter and fallback strategies (e.g., lower-fidelity codec or short responses)

Example use cases

Customer support voice bot that must interrupt itself when agents or customers interject
Hands-free in-car assistant requiring <800ms round-trip latency for navigation queries
Companion or therapy agent where preserved prosody and emotion matter (S2S)
Voice-enabled IVR that needs strict auditability and control over policies (pipeline)
Real-time conferencing assistant that summarizes or translates while handling overlap

FAQ

Which architecture should I choose first?

Start by defining your latency and control requirements: choose S2S for lowest latency and natural prosody, pipeline if you need modular control, auditing, or easier debugging.

How do I measure if a conversation feels natural?

Track end-to-end latency percentiles, user interruption success rate, and qualitative tests (A/B with human listeners). Aim for p90 well under 800 ms where possible.