home / skills / qodex-ai / ai-agent-skills / voice-ai-integration
npx playbooks add skill qodex-ai/ai-agent-skills --skill voice-ai-integrationReview the files below or copy the command above to add this skill to your agents.
---
name: voice-ai-integration
description: Build voice-enabled AI applications with speech recognition, text-to-speech, and voice-based interactions. Supports multiple voice providers and real-time processing. Use when creating voice assistants, voice-controlled applications, audio interfaces, or hands-free AI systems.
---
# Voice AI Integration
Build intelligent voice-enabled AI applications that understand spoken language and respond naturally through audio, creating seamless voice-first user experiences.
## Overview
Voice AI systems combine three key capabilities:
1. **Speech Recognition** - Convert audio input to text
2. **Natural Language Processing** - Understand intent and context
3. **Text-to-Speech** - Generate natural-sounding responses
## Speech Recognition Providers
See [examples/speech_recognition_providers.py](examples/speech_recognition_providers.py) for implementations:
- **Google Cloud Speech-to-Text**: High accuracy with automatic punctuation
- **OpenAI Whisper**: Robust multilingual speech recognition
- **Azure Speech Services**: Enterprise-grade speech recognition
- **AssemblyAI**: Async processing with high accuracy
## Text-to-Speech Providers
See [examples/text_to_speech_providers.py](examples/text_to_speech_providers.py) for implementations:
- **Google Cloud TTS**: Natural voices with multiple language support
- **OpenAI TTS**: Simple integration with high-quality output
- **Azure Speech Services**: Enterprise TTS with neural voices
- **Eleven Labs**: Premium voices with emotional control
## Voice Assistant Architecture
See [examples/voice_assistant.py](examples/voice_assistant.py) for `VoiceAssistant`:
- Complete voice pipeline: STT → NLP → TTS
- Conversation history management
- Multi-provider support (OpenAI, Google, Azure, etc.)
- Async processing for responsive interactions
## Real-Time Voice Processing
See [examples/realtime_voice_processor.py](examples/realtime_voice_processor.py) for `RealTimeVoiceProcessor`:
- Stream audio input from microphone
- Stream audio output to speakers
- Voice Activity Detection (VAD)
- Configurable sample rates and chunk sizes
## Voice Agent Applications
### Voice-Controlled Smart Home
```python
class SmartHomeVoiceAgent:
def __init__(self):
self.voice_assistant = VoiceAssistant()
self.devices = {
"lights": SmartLights(),
"temperature": SmartThermostat(),
"security": SecuritySystem()
}
async def handle_voice_command(self, audio_input):
# Get text from voice
command_text = await self.voice_assistant.process_voice_input(audio_input)
# Parse intent
intent = parse_smart_home_intent(command_text)
# Execute command
if intent.action == "turn_on_lights":
self.devices["lights"].turn_on(intent.room)
elif intent.action == "set_temperature":
self.devices["temperature"].set(intent.value)
# Confirm with voice
response = f"I've {intent.action_description}"
audio_output = await self.voice_assistant.synthesize_response(response)
return audio_output
```
### Voice Meeting Transcription
```python
class VoiceMeetingRecorder:
def __init__(self):
self.processor = RealTimeVoiceProcessor()
self.transcripts = []
async def record_and_transcribe_meeting(self, duration_seconds=3600):
audio_stream = self.processor.stream_audio_input()
buffer = []
chunk_duration = 30 # Transcribe every 30 seconds
for audio_chunk in audio_stream:
buffer.append(audio_chunk)
if sum(len(chunk) for chunk in buffer) >= chunk_duration * 16000:
# Transcribe chunk
transcript = transcribe_audio_whisper(buffer)
self.transcripts.append({
"timestamp": datetime.now(),
"text": transcript
})
buffer = []
return self.transcripts
```
## Best Practices
### Audio Quality
- ✓ Use 16kHz sample rate for speech recognition
- ✓ Handle background noise filtering
- ✓ Implement voice activity detection (VAD)
- ✓ Normalize audio levels
- ✓ Use appropriate audio format (WAV for quality)
### Latency Optimization
- ✓ Use low-latency STT models
- ✓ Implement streaming transcription
- ✓ Cache common responses
- ✓ Use async processing
- ✓ Minimize network round trips
### Error Handling
- ✓ Handle network failures gracefully
- ✓ Implement fallback voices/providers
- ✓ Log audio processing failures
- ✓ Validate audio quality before processing
- ✓ Implement retry logic
### Privacy & Security
- ✓ Encrypt audio in transit
- ✓ Delete audio after processing
- ✓ Implement user consent mechanisms
- ✓ Log access to audio data
- ✓ Comply with data regulations (GDPR, CCPA)
## Common Challenges & Solutions
### Challenge: Accents and Dialects
**Solutions**:
- Use multilingual models
- Fine-tune on regional data
- Implement language detection
- Use domain-specific vocabularies
### Challenge: Background Noise
**Solutions**:
- Implement noise filtering
- Use beamforming techniques
- Pre-process audio with noise removal
- Deploy microphone arrays
### Challenge: Long Audio Files
**Solutions**:
- Implement chunked processing
- Use streaming APIs
- Split into speaker turns
- Implement caching
## Frameworks & Libraries
### Speech Recognition
- OpenAI Whisper
- Google Cloud Speech-to-Text
- Azure Speech Services
- AssemblyAI
- DeepSpeech
### Text-to-Speech
- Google Cloud Text-to-Speech
- OpenAI TTS
- Azure Text-to-Speech
- Eleven Labs
- Tacotron 2
## Getting Started
1. Choose STT and TTS providers
2. Set up authentication
3. Build basic voice pipeline
4. Add conversation management
5. Implement error handling
6. Test with real users
7. Monitor and optimize latency