home / skills / sammcj / agentic-coding / piper-tts-training

piper-tts-training skill

This skill helps you train and export Piper TTS voices for offline deployment, covering data prep, validation, fine-tuning, and ONNX export.

npx playbooks add skill sammcj/agentic-coding --skill piper-tts-training

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

5.7 KB

---
name: piper-tts-training
description: Train custom TTS voices for Piper (ONNX format) using fine-tuning or from-scratch approaches. Use when creating new synthetic voices, fine-tuning existing Piper checkpoints, preparing audio datasets for TTS training, or deploying voice models to devices like Raspberry Pi or Home Assistant. Covers dataset preparation, Whisper-based validation, training configuration, and ONNX export.
# model: inherit
# allowed-tools: Read,Write,Bash,Grep
---

# Piper TTS Voice Training

Train custom text-to-speech voices compatible with Piper's lightweight ONNX runtime.

## Overview

Piper produces fast, offline TTS suitable for embedded devices. Training involves:
1. Corpus preparation (text covering phonetic range)
2. Audio generation or recording
3. Quality validation via Whisper transcription
4. Fine-tuning from existing checkpoint (recommended) or training from scratch
5. ONNX export for deployment

**Fine-tuning vs from-scratch:**
- Fine-tuning: ~1,300 phrases + 1,000 epochs (days on modest GPU)
- From scratch: ~13,000+ phrases + 2,000+ epochs (weeks/months)

## Workflow

### 1. Corpus Preparation

Gather 1,300-1,500+ phrases covering broad phonetic range:
- Use piper-recording-studio corpus as base
- Add domain-specific phrases for your use case
- Include varied sentence structures and lengths

**Critical for non-US English:** Ensure corpus uses correct regional spelling. See [Localisation](#localisation-for-australian-new-zealand-and-uk-english).

### 2. Audio Generation

Generate or record training audio at 22050Hz mono WAV.

**If using voice cloning (e.g., Chatterbox TTS):**
- Generate at source sample rate (often 24kHz)
- Convert to 22050Hz: `sox -v 0.95 input.wav -r 22050 -t wav output.wav`
- The `-v 0.95` prevents clipping during resampling

**Recording requirements:**
- Consistent microphone position and room acoustics
- Minimal background noise
- Natural speaking pace (not reading voice)

### 3. Quality Validation with Whisper

Automate quality checks rather than manual listening:

```python
import whisper
from piper_phonemize import phonemize_text

model = whisper.load_model("base")

def validate_sample(audio_path, expected_text):
    result = model.transcribe(audio_path)
    transcribed = result["text"].strip()

    # Compare phonemically to handle spelling/punctuation differences
    expected_phonemes = phonemize_text(expected_text, "en-gb")
    transcribed_phonemes = phonemize_text(transcribed, "en-gb")

    return expected_phonemes == transcribed_phonemes
```

Retry failed samples up to 3 times. Target 95%+ dataset coverage.

### 4. Dataset Format (LJSpeech)

Structure your dataset:
```
dataset/
├── metadata.csv
└── wavs/
    ├── sample_0001.wav
    ├── sample_0002.wav
    └── ...
```

**metadata.csv format:** `{id}|{text}` (pipe-separated, no headers)
```
sample_0001|The quick brown fox jumps over the lazy dog.
sample_0002|Pack my box with five dozen liquor jugs.
```

### 5. Preprocessing

Convert to PyTorch tensors:
```bash
python3 -m piper_train.preprocess \
    --language en-gb \
    --input-dir dataset/ \
    --output-dir piper_training_dir/ \
    --dataset-format ljspeech
```

Use `en-gb` for Australian/NZ/UK voices (espeak-ng phoneme set).

### 6. Training

**Fine-tuning (recommended):**
```bash
python3 -m piper_train \
    --dataset-dir piper_training_dir/ \
    --accelerator gpu \
    --devices 1 \
    --batch-size 12 \
    --max_epochs 3000 \
    --resume_from_checkpoint ljspeech-2000.ckpt \
    --checkpoint-epochs 100 \
    --quality high \
    --precision 32
```

**Key parameters:**
- `--batch-size`: Reduce if VRAM limited (12 works on 8GB)
- `--resume_from_checkpoint`: Start from LJSpeech high-quality checkpoint
- `--precision 32`: More stable than mixed precision
- `--validation-split 0.0 --num-test-examples 0`: Skip validation for small datasets

Monitor with TensorBoard: watch `loss_disc_all` for convergence.

### 7. ONNX Export

```bash
python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized
onnxsim output.onnx.unoptimized output.onnx
```

Create metadata file `output.onnx.json` from training `config.json`.

## Localisation for Australian, New Zealand and UK English

Piper uses espeak-ng for phonemisation. American pronunciations in training data cause accent drift.

**Corpus preparation:**
- Run `scripts/convert_spelling.py` on corpus text before training
- Use `en-gb` or `en-au` espeak-ng voice for phonemisation
- Review generated phonemes for Americanisms

**Common spelling conversions:**
| American | Australian/UK |
|----------|---------------|
| -ize | -ise |
| -or | -our |
| -er | -re |
| -og | -ogue |
| -ense | -ence |

**Phoneme considerations:**
- /r/ linking and intrusion patterns differ
- Vowel sounds in words like "dance", "bath", "castle"
- Final -ile pronunciation (hostile, missile)

For complete word lists and phonetic details, see [references/localisation.md](references/localisation.md).

**Validation:** Use Whisper with `language="en"` and verify transcriptions match expected regional forms.

## Dependencies

Pin versions to avoid API breakage:
```
pytorch-lightning==1.9.3
torch<2.6.0
piper-phonemize
onnxruntime-gpu
onnxsim
```

Docker containerisation recommended for reproducibility.

## Hardware Requirements

**Minimum (fine-tuning):**
- 8GB VRAM GPU (Pascal or newer)
- 8GB system RAM
- ~5 days for 1,000 epochs on Tesla P4

**From scratch:** Multiply time by ~200x.

## Troubleshooting

| Issue | Solution |
|-------|----------|
| CUDA OOM | Reduce batch-size (try 8 or 4) |
| Checkpoint won't load | Check pytorch-lightning version matches checkpoint |
| Garbled output | Insufficient training epochs or dataset too small |
| Wrong accent | Check espeak-ng language code and corpus spelling |

Overview

This skill trains custom text-to-speech voices for Piper and exports optimized ONNX models for fast, offline runtime on devices like Raspberry Pi and Home Assistant. It supports fine-tuning existing Piper checkpoints or training from scratch and includes tools for dataset preparation, Whisper-based validation, training orchestration, and ONNX export. The workflow is designed for reproducible results on modest GPUs and embedded deployments.

How this skill works

Prepare a phoneme-complete corpus and 22.05 kHz mono WAV audio, validate samples automatically with Whisper transcription and phoneme comparison, then preprocess into a PyTorch-friendly dataset. Train by fine-tuning a high-quality checkpoint (recommended) or training from scratch, monitor convergence with TensorBoard, and export the final model to ONNX and simplify it with onnxsim. Create a JSON metadata file for runtime configuration and deploy the ONNX to target devices.

When to use it

Creating a new synthetic voice for a product or assistant.
Fine-tuning an existing Piper checkpoint to match a target speaker or style.
Preparing and validating an audio corpus for TTS training.
Exporting and deploying compact TTS models to Raspberry Pi or Home Assistant.
Adapting voice models to regional English variants (UK/AU/NZ).

Best practices

Prefer fine-tuning with ~1,300+ phrases for practical training time; use from-scratch only for very large corpora (13k+).
Record or generate 22.05 kHz mono WAV; if source differs, resample with sox using -v 0.95 to avoid clipping.
Automate quality checks with Whisper and phonemisation to target 95%+ validated samples.
Use espeak-ng language codes (en-gb/en-au) and run spelling conversions to avoid accent drift.
Pin dependency versions and use a Docker container for reproducible environments.

Example use cases

Fine-tune a neutral TTS voice to sound like a brand narrator for an embedded kiosk.
Train a regional English voice (UK/AU/NZ) by converting spelling and using en-gb phonemisation.
Validate and clean a mixed-quality recording dataset using Whisper before training.
Export a compact ONNX model for offline Home Assistant voice responses on Raspberry Pi.

FAQ

How many phrases do I need to fine-tune vs train from scratch?

Fine-tuning works well with ~1,300–1,500 phrases; training from scratch typically needs 13,000+ phrases.

What sample rate and format are required?

Use 22050 Hz mono WAV for training. If your source is different, resample to 22.05 kHz before preprocessing.

How do I avoid accent drift for UK/AU/NZ English?

Convert American spellings, use en-gb or en-au espeak-ng phonemisation, and validate with Whisper phoneme comparisons.