home / skills / ada20204 / qwen-voice / qwen-voice
This skill performs ASR transcription and TTS voice replies using Qwen-voice for Telegram audio messaging, returning text and .ogg voice notes.
npx playbooks add skill ada20204/qwen-voice --skill qwen-voiceReview the files below or copy the command above to add this skill to your agents.
---
name: qwen-voice
description: "Use Qwen (DashScope/百炼) for speech tasks: (1) ASR speech-to-text transcription of user audio/voice messages (Telegram .ogg opus, wav, mp3) using qwen3-asr-flash, optionally with coarse timestamps via chunking; (2) TTS text-to-speech voice reply using qwen3-tts-flash with selectable voice (default Cherry) and output as .ogg voice note for Telegram."
---
# Qwen Voice (ASR + TTS)
Use the bundled scripts. Prefer environment variable `DASHSCOPE_API_KEY`. If missing, scripts attempt to read it from `~/.bashrc`.
## ASR (speech → text)
### Non-timestamp (default)
```bash
python3 skills/qwen-voice/scripts/qwen_asr.py --in /path/to/audio.ogg
```
### With timestamps (chunk-based)
```bash
python3 skills/qwen-voice/scripts/qwen_asr.py --in /path/to/audio.ogg --timestamps --chunk-sec 3
```
Notes:
- Timestamps are generated by fixed-length chunking (not word-level alignment).
- Input audio is converted to mono 16kHz WAV before sending.
## TTS (text → speech)
### Preset voice (default: Cherry)
```bash
python3 skills/qwen-voice/scripts/qwen_tts.py --text '你好,我是 Pi。' --voice Cherry --out /tmp/out.ogg
```
### Clone voice (create once, reuse)
1) Create a voice profile from a sample audio:
```bash
python3 skills/qwen-voice/scripts/qwen_voice_clone.py --in ./voice_sample.ogg --name george --out work/qwen-voice/george.voice.json
```
2) Use the cloned voice to synthesize:
```bash
python3 skills/qwen-voice/scripts/qwen_tts.py --text '你好,我是 George。' --voice-profile work/qwen-voice/george.voice.json --out /tmp/out.ogg
```
Notes:
- `.ogg` output is Opus, suitable for Telegram voice messages.
- Voice cloning uses DashScope customization endpoint + Qwen realtime TTS model.
- Scripts use a local venv at `work/venv-dashscope` (auto-created on first run).
## Typical chat workflow
- When user sends voice message/audio: run ASR and reply with the transcribed text.
- When user explicitly asks for voice reply: run TTS and send the generated `.ogg` as a voice note.
This skill provides ASR and TTS capabilities using Qwen (DashScope) for handling voice in chat flows. It converts user audio into text and can synthesize natural-sounding voice replies as .ogg voice notes suitable for Telegram. The tool supports chunk-based coarse timestamps and optional custom voice cloning for personalized TTS.
For ASR, the skill converts input audio to mono 16 kHz WAV and sends it to the qwen3-asr-flash model; timestamps are produced by fixed-length chunking when enabled. For TTS, text is sent to qwen3-tts-flash and returned as an Opus .ogg file (default voice: Cherry). Voice cloning is supported by creating a reusable voice profile and then synthesizing with that profile.
How do I provide the API key?
Set the DASHSCOPE_API_KEY environment variable. If not set, the scripts attempt to read a shell config file.
Are timestamps word-accurate?
No. Timestamps are produced by fixed-length chunking and provide coarse segment boundaries, not word-level alignment.
What audio format is produced for TTS?
TTS outputs Opus .ogg files suitable for Telegram voice notes. Use the default Cherry voice or supply a cloned voice profile.