home / skills / openclaw / skills / voice-agent

This skill enables audio-based interaction with a local voice agent, transcribing with Whisper and replying via Polly for seamless hands-free conversations.

npx playbooks add skill openclaw/skills --skill voice-agent

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
2.0 KB
---
name: voice-agent
display-name: AI Voice Agent Backend
version: 1.1.0
description: Local Voice Input/Output for Agents using the AI Voice Agent API.
author: trevisanricardo
homepage: https://github.com/ricardotrevisan/ai-conversational-skill
user-invocable: true
disable-model-invocation: false
---

# Voice Agent

This skill allows you to speak and listen to the user using a local Voice Agent API.
It is client-only and does not start containers or services.
It uses **local Whisper** for Speech-to-Text transcription and **AWS Polly** for Text-to-Speech generation.

## Prerequisite
Requires a running backend API at `http://localhost:8000`.
Backend setup instructions are in this repository:
- `README.md`
- `walkthrough.md`
- `DOCKER_README.md`

## Behavior Guidelines
-   **Audio First**: When the user communicates via audio (files), your PRIMARY mode of response is **Audio File**.
-   **Silent Delivery**: When sending an audio response, **DO NOT** send a text explanation like "I sent an audio". Just send the audio file.
-   **Workflow**:
    1.  User sends audio.
    2.  Use `transcribe` to read it.
    3.  You think of a response.
    4.  Use `synthesize` to generate the audio file.
    5.  You send the file.
    6.  **STOP**. Do not add text commentary.
-   **Failure Handling**: If `health` fails or connection errors occur, do not attempt service management from this skill. Ask the user to start or fix the backend using the repository docs.

## Tools

### Transcribe File
To transcribe an audio file with **local Whisper STT**, run the client script with the `transcribe` command.

```bash
python3 {baseDir}/scripts/client.py transcribe "/path/to/audio/file.ogg"
```

### Synthesize to File
To generate audio from text with **AWS Polly TTS** and save it to a file, run the client script with the `synthesize` command.

```bash
python3 {baseDir}/scripts/client.py synthesize "Text to speak" --output "/path/to/output.mp3"
```

### Health Check
To check if the voice agent API is running and healthy:

```bash
python3 {baseDir}/scripts/client.py health
```

Overview

This skill provides local voice input and output for agents using a Voice Agent API. It uses local Whisper for speech-to-text and AWS Polly for text-to-speech, and requires a running backend at http://localhost:8000. The skill is client-only and does not manage containers or services.

How this skill works

The skill calls a local backend API to transcribe incoming audio with Whisper and to synthesize responses with AWS Polly. Use the provided client script commands to transcribe files, synthesize audio files, and run a health check against the backend. On audio interactions the workflow prioritizes returning an audio file response without additional text.

When to use it

  • You need speech-to-text transcription in a local environment
  • You want programmatic text-to-speech using AWS Polly locally
  • Building agents that interact primarily by audio (voice-first interfaces)
  • Testing or prototyping voice flows without starting containers from this skill
  • When you have the Voice Agent backend running at http://localhost:8000

Best practices

  • Keep interactions audio-first: if the user provides audio, reply with audio only
  • Follow the workflow: transcribe -> think -> synthesize -> send file -> stop
  • Do not attempt to start or manage the backend from this client; instruct users to consult the repo docs if the service is down
  • Store synthesized outputs to clearly named files and manage temporary storage carefully
  • Run the health check before bulk processing to detect connectivity issues early

Example use cases

  • Transcribe user voice notes locally for indexing or search
  • Create spoken agent responses for chatbots or kiosks using AWS Polly output files
  • Rapidly prototype voice UX flows by iterating on transcriptions and synthesized audio
  • Automate batch transcription of recorded sessions using the transcribe client command
  • Produce audio replies in voice-enabled apps where returning an audio file is required

FAQ

What if the health check fails or the backend is unreachable?

Do not try to start services from this skill. Ask the user to start or fix the backend and follow the repository setup docs. Run the health command to confirm status.

Should I send text when I deliver an audio response?

No. When sending an audio response to an audio input, return only the audio file and do not add text commentary or explanations.