home / skills / krishagel / geoffrey / local-tts

local-tts skill

/skills/local-tts

This skill generates high-quality local speech from text using MLX-accelerated Kokoro-82M, with offline operation and flexible voice presets.

npx playbooks add skill krishagel/geoffrey --skill local-tts

Review the files below or copy the command above to add this skill to your agents.

Files (5)
SKILL.md
3.3 KB
---
name: local-tts
version: 1.0.0
description: Local text-to-speech using MLX and Kokoro model
triggers:
  - local-tts
  - local tts
  - generate audio locally
  - kokoro tts
dependencies:
  - mlx-audio (via uv --with)
  - pydub (via uv --with)
---

# Local TTS Skill

Generate high-quality speech audio locally using Apple Silicon MLX acceleration and the Kokoro-82M model. No API keys or recurring costs.

## Quick Start

```bash
# Generate MP3 from text
uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \
    --text "Hello, this is a test." \
    --output ~/Desktop/test.mp3

# Generate from file
uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \
    --file /tmp/script.txt \
    --voice af_heart \
    --output ~/Desktop/podcast.mp3

# List available voices
uv run --with mlx-audio skills/local-tts/scripts/list_voices.py
```

## Parameters

| Parameter | Required | Default | Description |
|-----------|----------|---------|-------------|
| `--text` | One of text/file | - | Text to convert |
| `--file` | One of text/file | - | Path to text file |
| `--voice` | No | `af_heart` | Voice preset |
| `--output` | Yes | - | Output file path (.mp3, .wav) |
| `--model` | No | `Kokoro-82M-bf16` | Model to use |
| `--list-voices` | No | - | Show available voices |

## Voice Presets

### American English Female (prefix: af_)
- `af_heart` - Warm, friendly **(default)**
- `af_bella` - Soft, calm
- `af_nova` - Clear, professional
- `af_river` - Clear, confident
- `af_sarah` - Soft, expressive

### American English Male (prefix: am_)
- `am_adam` - Clear, professional
- `am_echo` - Deep, smooth
- `am_liam` - Articulate, conversational
- `am_michael` - Soft, measured

### British English (prefix: bf_, bm_)
- `bf_emma` - Clear, refined female
- `bm_daniel` - Clear, professional male
- `bm_george` - Distinguished male

See `references/voices.md` for full list.

## Output Format

```json
{
  "success": true,
  "file": "/Users/hagelk/Desktop/podcast.mp3",
  "voice": "af_heart",
  "model": "Kokoro-82M-bf16",
  "characters": 9824,
  "chunks": 20,
  "duration_seconds": 612.5,
  "generation_time": 45.2
}
```

## Performance

| Hardware | Speed | Notes |
|----------|-------|-------|
| M3 Pro 36GB | ~3-4x realtime | First run slower (model loading) |
| M1/M2 Mac Mini 8GB | ~1.5x realtime | Works well for briefings |
| M1/M2 Mac Mini 16GB | ~2x realtime | Comfortable headroom |

## Technical Details

- **Model**: Kokoro-82M-bf16 (~200MB download on first run)
- **Sample rate**: 24kHz mono
- **Chunking**: Text split at ~400 chars per chunk for quality
- **Concatenation**: Chunks joined seamlessly via pydub
- **Formats**: MP3, WAV, M4A, OGG

## Important Notes

1. **MUST use `--with` flags** - Do not use PEP 723 inline deps. mlx-audio requires uv's cached environment.

2. **First run is slower** - Model downloads ~200MB and espeak dependencies initialize.

3. **Model cached at**: `~/.cache/huggingface/hub/models--mlx-community--Kokoro-82M-bf16/`

## Integration with Morning Briefing

The morning-briefing skill uses this for podcast generation:

```bash
uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \
    --file /tmp/morning_briefing_podcast.txt \
    --voice af_heart \
    --output ~/Desktop/morning_briefing.mp3
```

Overview

This skill provides local text-to-speech using Apple Silicon MLX acceleration and the Kokoro-82M model so you can generate high-quality audio without external APIs or recurring costs. It focuses on fast, offline generation on Mac hardware with multiple voice presets and common output formats (MP3, WAV, M4A, OGG).

How this skill works

The skill loads the Kokoro-82M model locally and uses mlx-audio for hardware-accelerated inference on Apple Silicon. Input text is chunked (~400 characters) for stable generation, then audio chunks are concatenated with pydub and written to the requested file format. No API keys are required; the model downloads once to a local cache.

When to use it

  • Create voiceovers, podcasts, or briefings locally without cloud services
  • Generate batch TTS files for internal tools or offline apps
  • Produce short- or long-form narration on Apple Silicon Macs
  • Convert scripts or text files to polished MP3/WAV outputs
  • Test voices and presets before integrating into larger pipelines

Best practices

  • Always run with the required --with flags (mlx-audio and pydub) to ensure the cached environment is used
  • Prefer the default voice for a warm result, but test presets (af_, am_, bf_, bm_) to match tone
  • Supply --file for long scripts to avoid shell quoting issues; use --text for quick snippets
  • Allow the first run extra time — the model (~200MB) and espeak dependencies download and initialize
  • Choose MP3 for distribution and WAV for further editing to preserve quality

Example use cases

  • Generate a morning briefing podcast from a text file and save to Desktop
  • Create a narrated demo video soundtrack using a professional preset voice
  • Batch-convert documentation or training scripts into audio files for accessibility
  • Produce short social audio clips by specifying text and a target MP3 output

FAQ

Do I need an API key or internet access to run?

No API key is required. Internet is only needed for the initial model download; subsequent runs are fully local.

Which hardware is recommended?

Apple Silicon (M1/M2/M3) is recommended. M3 Pro shows ~3–4x realtime; M1/M2 provide roughly 1.5–2x depending on memory.