home / skills / openclaw / skills / alicloud-ai-audio-tts-realtime

alicloud-ai-audio-tts-realtime skill

/skills/cinience/alicloud-ai-audio-tts-realtime

This skill enables real-time Alibaba Cloud TTS with low latency streaming for interactive applications, using Qwen TTS Realtime models.

npx playbooks add skill openclaw/skills --skill alicloud-ai-audio-tts-realtime

Review the files below or copy the command above to add this skill to your agents.

Files (5)
SKILL.md
2.1 KB
---
name: alicloud-ai-audio-tts-realtime
description: Real-time speech synthesis with Alibaba Cloud Model Studio Qwen TTS Realtime models. Use when low-latency interactive speech is required, including instruction-controlled realtime synthesis.
---

Category: provider

# Model Studio Qwen TTS Realtime

Use realtime TTS models for low-latency streaming speech output.

## Critical model names

Use one of these exact model strings:
- `qwen3-tts-flash-realtime`
- `qwen3-tts-instruct-flash-realtime`
- `qwen3-tts-instruct-flash-realtime-2026-01-22`

## Prerequisites

- Install SDK in a virtual environment:

```bash
python3 -m venv .venv
. .venv/bin/activate
python -m pip install dashscope
```
- Set `DASHSCOPE_API_KEY` in your environment, or add `dashscope_api_key` to `~/.alibabacloud/credentials`.

## Normalized interface (tts.realtime)

### Request
- `text` (string, required)
- `voice` (string, required)
- `instruction` (string, optional)
- `sample_rate` (int, optional)

### Response
- `audio_base64_pcm_chunks` (array<string>)
- `sample_rate` (int)
- `finish_reason` (string)

## Operational guidance

- Use websocket or streaming endpoint for realtime mode.
- Keep each utterance short for lower latency.
- For instruction models, keep instruction explicit and concise.
- Some SDK/runtime combinations may reject realtime model calls over `MultiModalConversation`; use the probe script below to verify compatibility.

## Local demo script

Use the probe script to verify realtime compatibility in your current SDK/runtime, and optionally fallback to a non-realtime model for immediate output:

```bash
.venv/bin/python skills/ai/audio/alicloud-ai-audio-tts-realtime/scripts/realtime_tts_demo.py \
  --text "这是一个 realtime 语音演示。" \
  --fallback \
  --output output/ai-audio-tts-realtime/audio/fallback-demo.wav
```

Strict mode (for CI / gating):

```bash
.venv/bin/python skills/ai/audio/alicloud-ai-audio-tts-realtime/scripts/realtime_tts_demo.py \
  --text "realtime health check" \
  --strict
```

## Output location

- Default output: `output/ai-audio-tts-realtime/audio/`
- Override base dir with `OUTPUT_DIR`.

## References

- `references/sources.md`

Overview

This skill provides real-time speech synthesis using Alibaba Cloud Model Studio Qwen TTS Realtime models for low-latency interactive voice output. It supports instruction-driven synthesis, streaming over websockets, and fallbacks to non-realtime models when needed. It is designed for short utterances and interactive scenarios where immediate audio feedback matters.

How this skill works

The skill connects to the Qwen realtime TTS models via a streaming/websocket endpoint and sends short text utterances with an optional instruction field. It returns incremental base64-encoded PCM audio chunks and a sample rate, allowing the client to play audio as it arrives. A probe/demo script verifies runtime compatibility and can optionally fall back to non-realtime synthesis.

When to use it

  • Interactive voice assistants that require sub-second response audio
  • Live narration for games, demos, or guided workflows
  • Instruction-controlled synthesis where tone or behavior must be specified
  • Low-latency customer support or conversational IVR systems
  • Health checks or CI gating for realtime TTS capability

Best practices

  • Keep each utterance short to minimize latency and improve stream responsiveness
  • Provide concise, explicit instructions when using an instruction-capable model
  • Use the probe script to verify SDK/runtime compatibility before production use
  • Prefer websocket/streaming endpoints rather than polling for realtime performance
  • Include fallback logic to a non-realtime model for environments that reject realtime calls

Example use cases

  • Real-time conversational agent that speaks responses as they are generated
  • Live demo application that plays synthesized speech with minimal delay
  • Interactive instructional system where voice behavior is controlled via instructions
  • CI health check that runs a strict realtime TTS probe to validate deployment

FAQ

Which exact model names are supported?

Use one of: qwen3-tts-flash-realtime, qwen3-tts-instruct-flash-realtime, or qwen3-tts-instruct-flash-realtime-2026-01-22.

What SDK setup is required?

Install the dashscope SDK in a virtual environment and set DASHSCOPE_API_KEY in your environment or add dashscope_api_key to ~/.alibabacloud/credentials.