home / skills / openclaw / skills / local-llama-tts

local-llama-tts skill

/skills/wuxxin/local-llama-tts

This skill synthesizes speech locally using llama-tts and OuteTTS-1.0-0.6B to generate WAV audio from text.

npx playbooks add skill openclaw/skills --skill local-llama-tts

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
1.6 KB
---
name: local-llama-tts
description: Local text-to-speech using llama-tts (llama.cpp) and OuteTTS-1.0-0.6B model.
metadata:
  {
    "openclaw":
      {
        "emoji": "🔊",
        "requires": { "bins": ["llama-tts"] },
      },
  }
---

# Local Llama TTS

Synthesize speech locally using `llama-tts` and the `OuteTTS-1.0-0.6B` model.

## Usage

You can use the wrapper script:
- `scripts/tts-local.sh [options] "<text>"`

### Options
- `-o, --output <file>`: Output WAV file (default: `output.wav`)
- `-s, --speaker <file>`: Speaker reference file (optional)
- `-t, --temp <value>`: Temperature (default: `0.4`)

## Scripts

- **Location:** `scripts/tts-local.sh` (inside skill folder)
- **Model:** `/data/public/machine-learning/models/text-to-speach/OuteTTS-1.0-0.6B-Q4_K_M.gguf`
- **Vocoder:** `/data/public/machine-learning/models/text-to-speach/WavTokenizer-Large-75-Q4_0.gguf`
- **GPU:** Enabled via `llama-tts`.

## Setup

1. **Model:** Download from [OuteAI/OuteTTS-1.0-0.6B-GGUF](https://huggingface.co/OuteAI/OuteTTS-1.0-0.6B-GGUF/resolve/main/OuteTTS-1.0-0.6B-Q4_K_M.gguf?download=true)
2. **Vocoder:** Download from [ggml-org/WavTokenizer](https://huggingface.co/ggml-org/WavTokenizer/resolve/main/WavTokenizer-Large-75-Q5_1.gguf?download=true) (Note: Felix uses a Q4_0 version, Q5_1 is linked here as a high-quality alternative).

Place files in `/data/public/machine-learning/models/text-to-speach/` or update `scripts/tts-local.sh`.

## Sampling Configuration
The model card recommends the following settings (hardcoded in the script):
- **Temperature:** 0.4
- **Repetition Penalty:** 1.1
- **Repetition Range:** 64
- **Top-k:** 40
- **Top-p:** 0.9
- **Min-p:** 0.05

Overview

This skill provides local text-to-speech using llama-tts with the OuteTTS-1.0-0.6B model and a local vocoder. It wraps llama-tts in a simple script to synthesize WAV output without sending text to remote services. The setup targets offline use and can leverage GPU acceleration when available.

How this skill works

The skill runs llama-tts with the specified GGUF voice model and a local WavTokenizer vocoder to generate audio. A helper script (scripts/tts-local.sh) exposes options for output path, speaker reference, and sampling temperature. The script embeds recommended sampling parameters for stable, natural output and can be pointed to alternate model locations if needed.

When to use it

  • You need offline text-to-speech with no cloud dependencies.
  • You want to run TTS on local GPU-enabled machines for speed and privacy.
  • You need reproducible voice synthesis with fixed sampling settings.
  • You want a simple script wrapper to integrate TTS into pipelines or automation.
  • You need to test different speaker reference files or temperatures locally.

Best practices

  • Place model and vocoder files in the default /data/public/machine-learning/models/text-to-speach/ path or update the script paths.
  • Use the bundled script scripts/tts-local.sh to avoid manual CLI mistakes and ensure consistent sampling config.
  • Start with the recommended temperature (0.4) and repetition settings; adjust temperature in small increments to change expressiveness.
  • Provide a speaker reference file for consistent voice characteristics when available.
  • Run on GPU when possible for faster synthesis; ensure llama-tts is configured for GPU.

Example use cases

  • Generate narration audio for videos or demos entirely offline.
  • Integrate local TTS into a private voice assistant or kiosk without cloud calls.
  • Batch-produce spoken prompts or alerts on an on-prem server using the script.
  • Experiment with speaker references to create custom voice identities for prototypes.
  • Create test audio datasets for speech research without exposing text data externally.

FAQ

What model files do I need and where should they go?

You need the OuteTTS-1.0-0.6B GGUF model and a WavTokenizer vocoder. Place them in /data/public/machine-learning/models/text-to-speach/ or update scripts/tts-local.sh to point to your locations.

How do I control output quality and style?

Use the script options: set --temp for temperature, provide a speaker reference file with --speaker, and rely on the bundled sampling configuration (repetition penalty, top-k/top-p) for stable quality. Adjust temperature slowly to change expressiveness.