home / skills / ada20204 / qwen-voice / qwen-voice

qwen-voice skill

safe

This skill performs ASR transcription and TTS voice replies using Qwen-voice for Telegram audio messaging, returning text and .ogg voice notes.

npx playbooks add skill ada20204/qwen-voice --skill qwen-voice

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.0 KB

---
name: qwen-voice
description: "Use Qwen (DashScope/百炼) for speech tasks: (1) ASR speech-to-text transcription of user audio/voice messages (Telegram .ogg opus, wav, mp3) using qwen3-asr-flash, optionally with coarse timestamps via chunking; (2) TTS text-to-speech voice reply using qwen3-tts-flash with selectable voice (default Cherry) and output as .ogg voice note for Telegram."
---

# Qwen Voice (ASR + TTS)

Use the bundled scripts. Prefer environment variable `DASHSCOPE_API_KEY`. If missing, scripts attempt to read it from `~/.bashrc`.

## ASR (speech → text)

### Non-timestamp (default)

```bash
python3 skills/qwen-voice/scripts/qwen_asr.py --in /path/to/audio.ogg
```

### With timestamps (chunk-based)

```bash
python3 skills/qwen-voice/scripts/qwen_asr.py --in /path/to/audio.ogg --timestamps --chunk-sec 3
```

Notes:
- Timestamps are generated by fixed-length chunking (not word-level alignment).
- Input audio is converted to mono 16kHz WAV before sending.

## TTS (text → speech)

### Preset voice (default: Cherry)

```bash
python3 skills/qwen-voice/scripts/qwen_tts.py --text '你好，我是 Pi。' --voice Cherry --out /tmp/out.ogg
```

### Clone voice (create once, reuse)

1) Create a voice profile from a sample audio:

```bash
python3 skills/qwen-voice/scripts/qwen_voice_clone.py --in ./voice_sample.ogg --name george --out work/qwen-voice/george.voice.json
```

2) Use the cloned voice to synthesize:

```bash
python3 skills/qwen-voice/scripts/qwen_tts.py --text '你好，我是 George。' --voice-profile work/qwen-voice/george.voice.json --out /tmp/out.ogg
```

Notes:
- `.ogg` output is Opus, suitable for Telegram voice messages.
- Voice cloning uses DashScope customization endpoint + Qwen realtime TTS model.
- Scripts use a local venv at `work/venv-dashscope` (auto-created on first run).

## Typical chat workflow

- When user sends voice message/audio: run ASR and reply with the transcribed text.
- When user explicitly asks for voice reply: run TTS and send the generated `.ogg` as a voice note.

Overview

This skill provides ASR and TTS capabilities using Qwen (DashScope) for handling voice in chat flows. It converts user audio into text and can synthesize natural-sounding voice replies as .ogg voice notes suitable for Telegram. The tool supports chunk-based coarse timestamps and optional custom voice cloning for personalized TTS.

How this skill works

For ASR, the skill converts input audio to mono 16 kHz WAV and sends it to the qwen3-asr-flash model; timestamps are produced by fixed-length chunking when enabled. For TTS, text is sent to qwen3-tts-flash and returned as an Opus .ogg file (default voice: Cherry). Voice cloning is supported by creating a reusable voice profile and then synthesizing with that profile.

When to use it

Automatically transcribing incoming voice messages or uploaded audio files.
Providing spoken replies when users request a voice response.
Generating coarse timestamps for long recordings to quickly locate content.
Creating a branded or personalized voice via the cloning workflow.
Preparing Telegram-compatible .ogg voice notes for chat bots or services.

Best practices

Set DASHSCOPE_API_KEY in the environment to avoid interactive prompts and ensure stable access.
Use the chunk-based timestamps only when coarse timing is acceptable; it is not word-level alignment.
Send clean, reasonably paced audio (sample rate conversion is handled, but quality affects accuracy).
Reuse cloned voice profiles for consistent TTS output instead of recloning for every request.
Keep TTS output as Opus .ogg for seamless Telegram voice note delivery.

Example use cases

Transcribe a Telegram .ogg voice message to text for moderation, indexing, or reply drafting.
Reply to a user with a spoken message by synthesizing text to .ogg using the Cherry voice.
Produce timestamped transcriptions for meeting recordings or lecture segments using chunking.
Create a custom assistant persona by cloning a brand voice and generating responses in that voice.
Batch-process uploaded mp3 or wav files to generate searchable transcripts and optional audio responses.

FAQ

How do I provide the API key?

Set the DASHSCOPE_API_KEY environment variable. If not set, the scripts attempt to read a shell config file.

Are timestamps word-accurate?

No. Timestamps are produced by fixed-length chunking and provide coarse segment boundaries, not word-level alignment.

What audio format is produced for TTS?

TTS outputs Opus .ogg files suitable for Telegram voice notes. Use the default Cherry voice or supply a cloned voice profile.