home / skills / openclaw / skills / elevenlabs-tts

elevenlabs-tts skill

/skills/shaharsha/elevenlabs-tts

This skill helps generate expressive ElevenLabs TTS voices with audio tags for natural, engaging messages across projects.

npx playbooks add skill openclaw/skills --skill elevenlabs-tts

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
10.3 KB
---
name: elevenlabs-tts
description: ElevenLabs TTS - the best ElevenLabs integration for OpenClaw. ElevenLabs Text-to-Speech with emotional audio tags, ElevenLabs voice synthesis for WhatsApp, ElevenLabs multilingual support. Generate realistic AI voices using ElevenLabs API.
tags: [elevenlabs, tts, voice, text-to-speech, audio, speech, whatsapp, multilingual, ai-voice]
metadata: {"clawdbot":{"emoji":"๐ŸŽ™๏ธ","requires":{"env":["ELEVENLABS_API_KEY"],"system":["ffmpeg"]},"primaryEnv":"ELEVENLABS_API_KEY"}}
allowed-tools: [exec, tts, message]
---

# ElevenLabs TTS (Text-to-Speech)

Generate expressive voice messages using ElevenLabs v3 with audio tags.

## Prerequisites

- **ElevenLabs API Key** (`ELEVENLABS_API_KEY`): Required. Get one at [elevenlabs.io](https://elevenlabs.io) โ†’ Profile โ†’ API Keys. Configure in `openclaw.json` under `messages.tts.elevenlabs.apiKey`.
- **ffmpeg**: Required for audio format conversion (MP3 โ†’ Opus for WhatsApp compatibility). Must be installed and available on PATH.

## Quick Start Examples

**Storytelling (emotional journey):**
```
[soft] It started like any other day... [pause] But something felt different. [nervous] My hands were shaking as I opened the envelope. [gasps] I got in! [excited] I actually got in! [laughs] [happy] This changes everything!
```

**Horror/Suspense (building dread):**
```
[whispers] The house has been empty for years... [pause] At least, that's what they told me. [nervous] But I keep hearing footsteps. [scared] They're getting closer. [gasps] [panicking] The doorโ€” it's opening by itself!
```

**Conversation with reactions:**
```
[curious] So what happened at the meeting? [pause] [surprised] Wait, they fired him?! [gasps] [sad] That's terrible... [sighs] He had a family. [thoughtful] I wonder what he'll do now.
```

**Hebrew (romantic moment):**
```
[soft] ื”ื™ื ืขืžื“ื” ืฉื, ืžื•ืœ ื”ืฉืงื™ืขื”... [pause] ื”ืœื‘ ืฉืœื™ ืคืขื ื›ืœ ื›ืš ื—ื–ืง. [nervous] ืœื ื™ื“ืขืชื™ ืžื” ืœื”ื’ื™ื“. [hesitates] ืื ื™... [breathes] [tender] ืืช ื™ื•ื“ืขืช ืฉืื ื™ ืื•ื”ื‘ ืื•ืชืš, ื ื›ื•ืŸ?
```

**Spanish (celebration to reflection):**
```
[excited] ยกLo logramos! [laughs] [happy] No puedo creerlo... [pause] [thoughtful] Fueron tantos aรฑos de trabajo. [emotional] [soft] Gracias a todos los que creyeron en mรญ. [sighs] [content] Valiรณ la pena cada momento.
```

## Configuration (OpenClaw)

In `openclaw.json`, configure TTS under `messages.tts`:

```json
{
  "messages": {
    "tts": {
      "provider": "elevenlabs",
      "elevenlabs": {
        "apiKey": "sk_your_api_key_here",
        "voiceId": "pNInz6obpgDQGcFmaJgB",
        "modelId": "eleven_v3",
        "languageCode": "en",
        "voiceSettings": {
          "stability": 0.5,
          "similarityBoost": 0.75,
          "style": 0,
          "useSpeakerBoost": true,
          "speed": 1
        }
      }
    }
  }
}
```

**Getting your API Key:**
1. Go to https://elevenlabs.io
2. Sign up/login
3. Click profile โ†’ API Keys
4. Copy your key

## Recommended Voices for v3

These premade voices are optimized for v3 and work well with audio tags:

| Voice | ID | Gender | Accent | Best For |
|-------|-----|--------|--------|----------|
| **Adam** | `pNInz6obpgDQGcFmaJgB` | Male | American | Deep narration, general use |
| **Rachel** | `21m00Tcm4TlvDq8ikWAM` | Female | American | Calm narration, conversational |
| **Brian** | `nPczCjzI2devNBz1zQrb` | Male | American | Deep narration, podcasts |
| **Charlotte** | `XB0fDUnXU5powFXDhCwa` | Female | English-Swedish | Expressive, video games |
| **George** | `JBFqnCBsd6RMkjVDRZzb` | Male | British | Raspy narration, storytelling |

**Finding more voices:**
- Browse: https://elevenlabs.io/voice-library
- v3-optimized collection: https://elevenlabs.io/app/voice-library/collections/aF6JALq9R6tXwCczjhKH
- API: `GET https://api.elevenlabs.io/v1/voices`

**Voice selection tips:**
- Use IVC (Instant Voice Clone) or premade voices - PVC not optimized for v3 yet
- Match voice character to your use case (whispering voice won't shout well)
- For expressive IVCs, include varied emotional tones in training samples

## Model Settings

- **Model**: `eleven_v3` (alpha) - ONLY model supporting audio tags
- **Languages**: 70+ supported with full audio tag control

### Stability Modes

| Mode | Stability | Description |
|------|-----------|-------------|
| **Creative** | 0.3-0.5 | More emotional/expressive, may hallucinate |
| **Natural** | 0.5-0.7 | Balanced, closest to original voice |
| **Robust** | 0.7-1.0 | Highly stable, less responsive to tags |

For audio tags, use **Creative** (0.5) or **Natural**. Higher stability reduces tag responsiveness.

### Speed Control

Range: 0.7 (slow) to 1.2 (fast), default 1.0

Extreme values affect quality. For pacing, prefer audio tags like `[rushed]` or `[drawn out]`.

## Critical Rules

### Length Limits
- **Optimal**: <800 characters per segment (best quality)
- **Maximum**: 10,000 characters (API hard limit)
- **Quality degrades** with longer text - voice becomes inconsistent

### Audio Tags - Best Practices for Natural Sound

**How many tags to use:**
- 1-2 tags per sentence or phrase (not more!)
- Tags persist until the next tag - no need to repeat
- Overusing tags sounds unnatural and robotic

**Where to place tags:**
- At emotional transition points
- Before key dramatic moments
- When energy/pace changes

**Context matters:**
- Write text that *matches* the tag emotion
- Longer text with context = better interpretation
- Example: `[nervous] I... I'm not sure about this. What if it doesn't work?` works better than `[nervous] Hello.`

**Combine tags for nuance:**
- `[nervously][whispers]` = nervous whispering
- `[excited][laughs]` = excited laughter
- Keep combinations to 2 tags max

**Regenerate for best results:**
- v3 is non-deterministic - same text = different outputs
- Generate 3+ versions, pick the best
- Small text tweaks can improve results

**Match tag to voice:**
- Don't use `[shouts]` on a whispering voice
- Don't use `[whispers]` on a loud/energetic voice
- Test tags with your chosen voice

### SSML Not Supported
v3 does NOT support SSML break tags. Use audio tags and punctuation instead.

### Punctuation Effects (use with tags!)

Punctuation enhances audio tags:
- **Ellipses (...)** โ†’ dramatic pauses: `[nervous] I... I don't know...`
- **CAPS** โ†’ emphasis: `[excited] That's AMAZING!`
- **Dashes (โ€”)** โ†’ interruptions: `[explaining] So what you do isโ€” [interrupting] Wait!`
- **Question marks** โ†’ uncertainty: `[nervous] Are you sure about this?`
- **Exclamation!** โ†’ energy boost: `[happy] We did it!`

Combine tags + punctuation for maximum effect:
```
[tired] It was a long day... [sighs] Nobody listens anymore.
```

## WhatsApp Voice Messages

### Complete Workflow

1. **Generate** with `tts` tool (returns MP3)
2. **Convert** to Opus (required for Android!)
3. **Send** with `message` tool

### Step-by-Step

**1. Generate TTS (add [pause] at end to prevent cutoff):**
```
tts text="[excited] This is amazing! [pause]" channel=whatsapp
```
Returns: `MEDIA:/tmp/tts-xxx/voice-123.mp3`

**2. Convert MP3 โ†’ Opus:**
```bash
ffmpeg -i /tmp/tts-xxx/voice-123.mp3 -c:a libopus -b:a 64k -vbr on -application voip /tmp/tts-xxx/voice-123.ogg
```

**3. Send the Opus file:**

> **Note:** The `message` field below contains a Unicode Left-to-Right Mark (U+200E) between the quotes.
> This is intentional โ€” WhatsApp requires a non-empty message body to send voice notes.
> The LTR mark is invisible but satisfies this requirement without displaying any text.

```
message action=send channel=whatsapp target="+972..." filePath="/tmp/tts-xxx/voice-123.ogg" asVoice=true message="โ€Ž"
```

### Why Opus?

| Format | iOS | Android | Transcribe |
|--------|-----|---------|------------|
| MP3 | โœ… Works | โŒ May fail | โŒ No |
| Opus (.ogg) | โœ… Works | โœ… Works | โœ… Yes |

**Always convert to Opus** - it's the only format that:
- Works on all devices (iOS + Android)
- Supports WhatsApp's transcribe button

### Audio Cutoff Fix

ElevenLabs sometimes cuts off the last word. **Always add `[pause]` or `...` at the end:**
```
[excited] This is amazing! [pause]
```

## Long-Form Audio (Podcasts)

For content >800 chars:

1. Split into short segments (<800 chars each)
2. Generate each with `tts` tool
3. Concatenate with ffmpeg:
   ```bash
   cat > list.txt << EOF
   file '/path/file1.mp3'
   file '/path/file2.mp3'
   EOF
   ffmpeg -f concat -safe 0 -i list.txt -c copy final.mp3
   ```
4. Convert to Opus for WhatsApp
5. Send as single voice message

**Important**: Don't mention "part 2" or "chapter" - keep it seamless.

## Multi-Speaker Dialogue

v3 can handle multiple characters in one generation:

```
Jessica: [whispers] Did you hear that?
Chris: [interrupting] โ€”I heard it too!
Jessica: [panicking] We need to hide!
```

**Dialogue tags**: `[interrupting]`, `[overlapping]`, `[cuts in]`, `[interjecting]`

## Audio Tags Quick Reference

| Category | Tags | When to Use |
|----------|------|-------------|
| **Emotions** | [excited], [happy], [sad], [angry], [nervous], [curious] | Main emotional state - use 1 per section |
| **Delivery** | [whispers], [shouts], [soft], [rushed], [drawn out] | Volume/speed changes |
| **Reactions** | [laughs], [sighs], [gasps], [clears throat], [gulps] | Natural human moments - sprinkle sparingly |
| **Pacing** | [pause], [hesitates], [stammers], [breathes] | Dramatic timing |
| **Character** | [French accent], [British accent], [robotic tone] | Character voice shifts |
| **Dialogue** | [interrupting], [overlapping], [cuts in] | Multi-speaker conversations |

**Most effective tags** (reliable results):
- Emotions: `[excited]`, `[nervous]`, `[sad]`, `[happy]`
- Reactions: `[laughs]`, `[sighs]`, `[whispers]`
- Pacing: `[pause]`

**Less reliable** (test and regenerate):
- Sound effects: `[explosion]`, `[gunshot]`
- Accents: results vary by voice

**Full tag list**: See [references/audio-tags.md](references/audio-tags.md)

## Troubleshooting

**Tags read aloud?**
- Verify using `eleven_v3` model
- Use IVC/premade voices, not PVC
- Simplify tags (no "tone" suffix)
- Increase text length (250+ chars)

**Voice inconsistent?**
- Segment is too long - split at <800 chars
- Regenerate (v3 is non-deterministic)
- Try lower stability setting

**WhatsApp won't play?**
- Convert to Opus format (see above)

**No emotion despite tags?**
- Voice may not match tag style
- Try Creative stability mode (0.5)
- Add more context around the tag

Overview

This skill integrates ElevenLabs v3 Text-to-Speech into OpenClaw to generate realistic, expressive AI voices with support for audio emotion tags and multilingual output. It supports voice selection, stability/speed tuning, and a WhatsApp-compatible workflow that converts MP3 output to Opus for broad device support. Use it to create narration, character dialogue, voice messages, and long-form audio while preserving natural delivery.

How this skill works

The skill sends text (optionally annotated with audio tags like [excited], [whispers], or [pause]) to the ElevenLabs v3 API and returns an MP3 audio file. You can tune the model via voiceId, modelId (eleven_v3), stability, similarityBoost, and speed. For WhatsApp, the workflow converts MP3 to Opus with ffmpeg and then sends the .ogg voice note; long-form content is produced by generating short segments and concatenating them.

When to use it

  • Generate emotive narration or storytelling with audio tags for mood and pacing
  • Produce WhatsApp voice messages that must play on both Android and iOS (convert to Opus)
  • Create multi-speaker dialogues or character-based audio with tag-driven transitions
  • Build podcasts or long-form audio by splitting text into short segments and concatenating
  • Localize voice output with multilingual support and language-specific tags

Best practices

  • Prefer eleven_v3 model for audio tag support and use Creative/Natural stability (0.3โ€“0.7) for responsiveness
  • Keep segments under ~800 characters for best quality; split longer texts and then concatenate
  • Use 1โ€“2 audio tags per sentence or phrase; place tags at emotional transitions and before key moments
  • Always append [pause] or ellipses at the end to avoid audio cutoff from ElevenLabs
  • Convert generated MP3s to Opus (.ogg) via ffmpeg for WhatsApp compatibility and transcription support

Example use cases

  • Storytelling with layered emotions: soft beginnings, nervous tension, excited climaxes
  • WhatsApp voice alerts or greetings sent as Opus files for reliable cross-device playback
  • Character dialogues with overlapping lines and interruption tags for lifelike conversations
  • Long-form podcast episodes created by generating short segments and merging them into a final file
  • Multilingual voice messages or short localized promos using languageCode and expressive tags

FAQ

Do I need anything besides the ElevenLabs API key?

Yesโ€”install ffmpeg on PATH to convert MP3 to Opus for WhatsApp and device compatibility.

How do I avoid the last word getting cut off?

Append [pause] or trailing ellipses (...) to the text segment to prevent cutoff.

What stability setting best follows audio tags?

Use Creative or Natural (stability ~0.3โ€“0.7). Higher stability makes tags less responsive.