home / skills / openclaw / skills / vocal-chat

vocal-chat skill

Q: Which local tools are required?

ffmpeg for audio handling, whisper-cpp (or equivalent) for transcription, and sherpa-onnx-tts for text-to-speech.

Q: Does the skill always send audio replies?

Yes — it returns both a text message for clarity and an .ogg voice note generated locally.

safe

This skill enables voice-to-voice WhatsApp conversations by transcribing incoming audio and replying with local TTS audio.

npx playbooks add skill openclaw/skills --skill vocal-chat

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

1.1 KB

---
name: walkie-talkie
description: Handles voice-to-voice conversations on WhatsApp. Automatically transcribes incoming audio and responds with local TTS audio. Use when the user wants to "talk" instead of type.
---

# Walkie-Talkie Mode

This skill automates the voice-to-voice loop on WhatsApp using local transcription and local TTS.

## Workflow

1. **Incoming Audio**: When a user sends an audio/ogg/opus file:
   - Use `tools/transcribe_voice.sh` to get the text.
   - Process the text as a normal user prompt.

2. **Outgoing Response**:
   - Instead of a text reply, generate speech using `bin/sherpa-onnx-tts`.
   - Send the resulting `.ogg` file back to the user as a voice note.

## Triggers

- User sends an audio message.
- User says "activa modo walkie-talkie" or "hablemos por voz".

## Constraints

- Use local tools only (ffmpeg, whisper-cpp, sherpa-onnx-tts).
- Maintain a fast response time (RTF < 0.5).
- Always reply with BOTH text (for clarity) and audio.

## Manual Execution (Internal)

To respond with voice manually:
```bash
bin/sherpa-onnx-tts /tmp/reply.ogg "Tu mensaje aquí"
```
Then send `/tmp/reply.ogg` via `message` tool with `filePath`.

Overview

This skill automates voice-to-voice conversations on WhatsApp by transcribing incoming audio locally and replying with locally generated TTS audio. It maintains a conversational loop where every voice message is converted to text, processed as a prompt, and returned both as text and an .ogg voice note. Use it when you prefer talking over typing for fast, natural exchanges.

How this skill works

When WhatsApp receives an audio/ogg/opus file, the skill runs a local transcription tool (whisper-cpp via a helper script) to extract text. The transcribed text is handled as a normal user prompt and a text reply is generated for clarity. The text reply is then converted to speech with a local TTS engine (sherpa-onnx-tts) producing an .ogg file, which is sent back alongside the text message.

When to use it

You want hands-free or faster exchanges than typing.
Chatting in noisy or mobile environments where voice is easier.
Conducting short, real-time voice conversations over WhatsApp.
Testing local-only voice processing without cloud services.
Providing accessibility-friendly voice responses for users.

Best practices

Keep voice prompts concise to preserve low response time (RTF < 0.5s where possible).
Run transcription and TTS on the same host to avoid transfer delays.
Fallback to text if audio quality is low or transcription confidence is low.
Always send both text and audio so recipients can read or play the reply.
Monitor CPU/GPU load of ffmpeg, whisper-cpp, and sherpa-onnx-tts to maintain responsiveness.

Example use cases

Converting incoming WhatsApp voice notes to text and replying with a spoken answer.
Providing a walkie-talkie style interface for rapid team coordination.
Enabling conversational demos where the user says a command and hears the action result.
Offering accessible voice responses for users with visual impairments.
Archiving both the transcribed text and TTS audio for backup or review.

FAQ

Which local tools are required?

ffmpeg for audio handling, whisper-cpp (or equivalent) for transcription, and sherpa-onnx-tts for text-to-speech.

Does the skill always send audio replies?

Yes — it returns both a text message for clarity and an .ogg voice note generated locally.