home / skills / microsoft / skills / azure-ai-voicelive-skill

azure-ai-voicelive-skill skill

safe

/.github/skills/azure-ai-voicelive-skill

This skill enables real-time bidirectional audio with Azure AI Voice Live in Python, powering voice assistants and live speech interfaces.

npx playbooks add skill microsoft/skills --skill azure-ai-voicelive-skill

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

8.6 KB

---
name: azure-ai-voicelive
description: Build real-time voice AI applications using Azure AI Voice Live SDK (azure-ai-voicelive). Use this skill when creating Python applications that need real-time bidirectional audio communication with Azure AI, including voice assistants, voice-enabled chatbots, real-time speech-to-speech translation, voice-driven avatars, or any WebSocket-based audio streaming with AI models. Supports Server VAD (Voice Activity Detection), turn-based conversation, function calling, MCP tools, avatar integration, and transcription.
---

# Azure AI Voice Live SDK

Build real-time voice AI applications with bidirectional WebSocket communication.

## Installation

```bash
pip install azure-ai-voicelive aiohttp
```

## Quick Start

```python
import asyncio
from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential

async def main():
    async with connect(
        endpoint="https://<region>.api.cognitive.microsoft.com",
        credential=AzureKeyCredential("<your-api-key>"),
        model="gpt-4o-realtime-preview"
    ) as conn:
        # Update session with instructions
        await conn.session.update(session={
            "instructions": "You are a helpful assistant.",
            "modalities": ["text", "audio"],
            "voice": "alloy"
        })
        
        # Listen for events
        async for event in conn:
            print(f"Event: {event.type}")
            if event.type == "response.audio_transcript.done":
                print(f"Transcript: {event.transcript}")
            elif event.type == "response.done":
                break

asyncio.run(main())
```

## Core Architecture

### Connection Setup

```python
from azure.ai.voicelive.aio import connect
from azure.core.credentials import AzureKeyCredential
from azure.identity.aio import DefaultAzureCredential

# API Key auth
async with connect(
    endpoint="https://<region>.api.cognitive.microsoft.com",
    credential=AzureKeyCredential("<key>"),
    model="gpt-4o-realtime-preview"
) as conn:
    ...

# Azure AD auth
async with connect(
    endpoint="https://<region>.api.cognitive.microsoft.com",
    credential=DefaultAzureCredential(),
    model="gpt-4o-realtime-preview",
    credential_scopes=["https://cognitiveservices.azure.com/.default"]
) as conn:
    ...
```

### Connection Resources

The `VoiceLiveConnection` exposes these resources:

| Resource | Purpose | Key Methods |
|----------|---------|-------------|
| `conn.session` | Session configuration | `update(session=...)` |
| `conn.response` | Model responses | `create()`, `cancel()` |
| `conn.input_audio_buffer` | Audio input | `append()`, `commit()`, `clear()` |
| `conn.output_audio_buffer` | Audio output | `clear()` |
| `conn.conversation` | Conversation state | `item.create()`, `item.delete()`, `item.truncate()` |
| `conn.transcription_session` | Transcription config | `update(session=...)` |

## Session Configuration

```python
from azure.ai.voicelive.models import RequestSession, FunctionTool

await conn.session.update(session=RequestSession(
    instructions="You are a helpful voice assistant.",
    modalities=["text", "audio"],
    voice="alloy",  # or "echo", "shimmer", "sage", etc.
    input_audio_format="pcm16",
    output_audio_format="pcm16",
    turn_detection={
        "type": "server_vad",
        "threshold": 0.5,
        "prefix_padding_ms": 300,
        "silence_duration_ms": 500
    },
    tools=[
        FunctionTool(
            type="function",
            name="get_weather",
            description="Get current weather",
            parameters={
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        )
    ]
))
```

## Audio Streaming

### Send Audio (Base64 PCM16)

```python
import base64

# Read audio chunk (16-bit PCM, 24kHz mono)
audio_chunk = await read_audio_from_microphone()
b64_audio = base64.b64encode(audio_chunk).decode()

await conn.input_audio_buffer.append(audio=b64_audio)
```

### Receive Audio

```python
async for event in conn:
    if event.type == "response.audio.delta":
        audio_bytes = base64.b64decode(event.delta)
        await play_audio(audio_bytes)
    elif event.type == "response.audio.done":
        print("Audio complete")
```

## Event Handling

```python
async for event in conn:
    match event.type:
        # Session events
        case "session.created":
            print(f"Session: {event.session}")
        case "session.updated":
            print("Session updated")
        
        # Audio input events
        case "input_audio_buffer.speech_started":
            print(f"Speech started at {event.audio_start_ms}ms")
        case "input_audio_buffer.speech_stopped":
            print(f"Speech stopped at {event.audio_end_ms}ms")
        
        # Transcription events
        case "conversation.item.input_audio_transcription.completed":
            print(f"User said: {event.transcript}")
        case "conversation.item.input_audio_transcription.delta":
            print(f"Partial: {event.delta}")
        
        # Response events
        case "response.created":
            print(f"Response started: {event.response.id}")
        case "response.audio_transcript.delta":
            print(event.delta, end="", flush=True)
        case "response.audio.delta":
            audio = base64.b64decode(event.delta)
        case "response.done":
            print(f"Response complete: {event.response.status}")
        
        # Function calls
        case "response.function_call_arguments.done":
            result = handle_function(event.name, event.arguments)
            await conn.conversation.item.create(item={
                "type": "function_call_output",
                "call_id": event.call_id,
                "output": json.dumps(result)
            })
            await conn.response.create()
        
        # Errors
        case "error":
            print(f"Error: {event.error.message}")
```

## Common Patterns

### Manual Turn Mode (No VAD)

```python
await conn.session.update(session={"turn_detection": None})

# Manually control turns
await conn.input_audio_buffer.append(audio=b64_audio)
await conn.input_audio_buffer.commit()  # End of user turn
await conn.response.create()  # Trigger response
```

### Interrupt Handling

```python
async for event in conn:
    if event.type == "input_audio_buffer.speech_started":
        # User interrupted - cancel current response
        await conn.response.cancel()
        await conn.output_audio_buffer.clear()
```

### Conversation History

```python
# Add system message
await conn.conversation.item.create(item={
    "type": "message",
    "role": "system",
    "content": [{"type": "input_text", "text": "Be concise."}]
})

# Add user message
await conn.conversation.item.create(item={
    "type": "message",
    "role": "user", 
    "content": [{"type": "input_text", "text": "Hello!"}]
})

await conn.response.create()
```

## Voice Options

| Voice | Description |
|-------|-------------|
| `alloy` | Neutral, balanced |
| `echo` | Warm, conversational |
| `shimmer` | Clear, professional |
| `sage` | Calm, authoritative |
| `coral` | Friendly, upbeat |
| `ash` | Deep, measured |
| `ballad` | Expressive |
| `verse` | Storytelling |

Azure voices: Use `AzureStandardVoice`, `AzureCustomVoice`, or `AzurePersonalVoice` models.

## Audio Formats

| Format | Sample Rate | Use Case |
|--------|-------------|----------|
| `pcm16` | 24kHz | Default, high quality |
| `pcm16-8000hz` | 8kHz | Telephony |
| `pcm16-16000hz` | 16kHz | Voice assistants |
| `g711_ulaw` | 8kHz | Telephony (US) |
| `g711_alaw` | 8kHz | Telephony (EU) |

## Turn Detection Options

```python
# Server VAD (default)
{"type": "server_vad", "threshold": 0.5, "silence_duration_ms": 500}

# Azure Semantic VAD (smarter detection)
{"type": "azure_semantic_vad"}
{"type": "azure_semantic_vad_en"}  # English optimized
{"type": "azure_semantic_vad_multilingual"}
```

## Error Handling

```python
from azure.ai.voicelive.aio import ConnectionError, ConnectionClosed

try:
    async with connect(...) as conn:
        async for event in conn:
            if event.type == "error":
                print(f"API Error: {event.error.code} - {event.error.message}")
except ConnectionClosed as e:
    print(f"Connection closed: {e.code} - {e.reason}")
except ConnectionError as e:
    print(f"Connection error: {e}")
```

## References

- **Detailed API Reference**: See [references/api-reference.md](references/api-reference.md)
- **Complete Examples**: See [references/examples.md](references/examples.md)
- **All Models & Types**: See [references/models.md](references/models.md)

Overview

This skill enables building real-time voice AI applications using the Azure AI Voice Live SDK for Python. It provides bidirectional WebSocket audio streaming, session and conversation management, server-side VAD and manual turn control, function calling, transcription, and avatar integration. Use it to create voice assistants, live translation, voice-driven avatars, and interactive voice chatbots.

How this skill works

The skill opens a persistent WebSocket connection to Azure AI and exposes a VoiceLiveConnection with resources for session configuration, input/output audio buffers, responses, conversation state, and transcription sessions. It streams base64-encoded PCM audio chunks to the input buffer, receives incremental audio and transcript events, and drives responses with create/cancel control. Session settings control modalities, voice selection, audio formats, and turn detection (server VAD or manual).

When to use it

Building a low-latency voice assistant or voice-enabled chatbot.
Implementing real-time speech-to-speech translation or voice relay.
Creating voice-driven avatars or live character narration.
Handling bidirectional audio over WebSocket with incremental transcripts.
Integrating function calls (tools) and structured outputs in audio interactions.

Best practices

Use server_vad or Azure semantic VAD to automate turn detection and reduce manual commit overhead.
Encode audio as PCM16 base64 at the configured sample rate; match input_audio_format and output_audio_format to your audio pipeline.
Append audio in small chunks, commit at end of user turn for manual mode, and handle speech_started/speech_stopped events to manage interruptions.
Keep conversation history concise; add system messages for consistent assistant behavior and truncate older items when needed.
Handle response.cancel and output_audio_buffer.clear on interruptions to avoid overlapping audio playback.

Example use cases

Real-time voice assistant on a web or mobile client using microphone capture and audio playback.
Multilingual speech-to-speech translator that transcribes, translates, and streams synthesized voice in near real time.
Interactive voice avatar that drives facial/viseme animations from response.audio.delta events.
Contact-center agent augmentation that transcribes calls and triggers function tools for data lookup.
Voice-enabled IoT device with telephony-friendly audio formats (g711) for telephony integration.

FAQ

What authentication methods are supported?

API Key and Azure AD credentials are supported via AzureKeyCredential and DefaultAzureCredential with appropriate scopes.

How do I detect end of user speech?

Use server_vad or Azure semantic VAD for automatic turn detection, or set turn_detection to null and call input_audio_buffer.commit() to manually mark end of user turn.