home / skills / sammcj / agentic-coding / piper-tts-training

piper-tts-training skill

/Skills/piper-tts-training

This skill helps you train and export Piper TTS voices for offline deployment, covering data prep, validation, fine-tuning, and ONNX export.

npx playbooks add skill sammcj/agentic-coding --skill piper-tts-training

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
5.7 KB
---
name: piper-tts-training
description: Train custom TTS voices for Piper (ONNX format) using fine-tuning or from-scratch approaches. Use when creating new synthetic voices, fine-tuning existing Piper checkpoints, preparing audio datasets for TTS training, or deploying voice models to devices like Raspberry Pi or Home Assistant. Covers dataset preparation, Whisper-based validation, training configuration, and ONNX export.
# model: inherit
# allowed-tools: Read,Write,Bash,Grep
---

# Piper TTS Voice Training

Train custom text-to-speech voices compatible with Piper's lightweight ONNX runtime.

## Overview

Piper produces fast, offline TTS suitable for embedded devices. Training involves:
1. Corpus preparation (text covering phonetic range)
2. Audio generation or recording
3. Quality validation via Whisper transcription
4. Fine-tuning from existing checkpoint (recommended) or training from scratch
5. ONNX export for deployment

**Fine-tuning vs from-scratch:**
- Fine-tuning: ~1,300 phrases + 1,000 epochs (days on modest GPU)
- From scratch: ~13,000+ phrases + 2,000+ epochs (weeks/months)

## Workflow

### 1. Corpus Preparation

Gather 1,300-1,500+ phrases covering broad phonetic range:
- Use piper-recording-studio corpus as base
- Add domain-specific phrases for your use case
- Include varied sentence structures and lengths

**Critical for non-US English:** Ensure corpus uses correct regional spelling. See [Localisation](#localisation-for-australian-new-zealand-and-uk-english).

### 2. Audio Generation

Generate or record training audio at 22050Hz mono WAV.

**If using voice cloning (e.g., Chatterbox TTS):**
- Generate at source sample rate (often 24kHz)
- Convert to 22050Hz: `sox -v 0.95 input.wav -r 22050 -t wav output.wav`
- The `-v 0.95` prevents clipping during resampling

**Recording requirements:**
- Consistent microphone position and room acoustics
- Minimal background noise
- Natural speaking pace (not reading voice)

### 3. Quality Validation with Whisper

Automate quality checks rather than manual listening:

```python
import whisper
from piper_phonemize import phonemize_text

model = whisper.load_model("base")

def validate_sample(audio_path, expected_text):
    result = model.transcribe(audio_path)
    transcribed = result["text"].strip()

    # Compare phonemically to handle spelling/punctuation differences
    expected_phonemes = phonemize_text(expected_text, "en-gb")
    transcribed_phonemes = phonemize_text(transcribed, "en-gb")

    return expected_phonemes == transcribed_phonemes
```

Retry failed samples up to 3 times. Target 95%+ dataset coverage.

### 4. Dataset Format (LJSpeech)

Structure your dataset:
```
dataset/
├── metadata.csv
└── wavs/
    ├── sample_0001.wav
    ├── sample_0002.wav
    └── ...
```

**metadata.csv format:** `{id}|{text}` (pipe-separated, no headers)
```
sample_0001|The quick brown fox jumps over the lazy dog.
sample_0002|Pack my box with five dozen liquor jugs.
```

### 5. Preprocessing

Convert to PyTorch tensors:
```bash
python3 -m piper_train.preprocess \
    --language en-gb \
    --input-dir dataset/ \
    --output-dir piper_training_dir/ \
    --dataset-format ljspeech
```

Use `en-gb` for Australian/NZ/UK voices (espeak-ng phoneme set).

### 6. Training

**Fine-tuning (recommended):**
```bash
python3 -m piper_train \
    --dataset-dir piper_training_dir/ \
    --accelerator gpu \
    --devices 1 \
    --batch-size 12 \
    --max_epochs 3000 \
    --resume_from_checkpoint ljspeech-2000.ckpt \
    --checkpoint-epochs 100 \
    --quality high \
    --precision 32
```

**Key parameters:**
- `--batch-size`: Reduce if VRAM limited (12 works on 8GB)
- `--resume_from_checkpoint`: Start from LJSpeech high-quality checkpoint
- `--precision 32`: More stable than mixed precision
- `--validation-split 0.0 --num-test-examples 0`: Skip validation for small datasets

Monitor with TensorBoard: watch `loss_disc_all` for convergence.

### 7. ONNX Export

```bash
python3 -m piper_train.export_onnx checkpoint.ckpt output.onnx.unoptimized
onnxsim output.onnx.unoptimized output.onnx
```

Create metadata file `output.onnx.json` from training `config.json`.

## Localisation for Australian, New Zealand and UK English

Piper uses espeak-ng for phonemisation. American pronunciations in training data cause accent drift.

**Corpus preparation:**
- Run `scripts/convert_spelling.py` on corpus text before training
- Use `en-gb` or `en-au` espeak-ng voice for phonemisation
- Review generated phonemes for Americanisms

**Common spelling conversions:**
| American | Australian/UK |
|----------|---------------|
| -ize | -ise |
| -or | -our |
| -er | -re |
| -og | -ogue |
| -ense | -ence |

**Phoneme considerations:**
- /r/ linking and intrusion patterns differ
- Vowel sounds in words like "dance", "bath", "castle"
- Final -ile pronunciation (hostile, missile)

For complete word lists and phonetic details, see [references/localisation.md](references/localisation.md).

**Validation:** Use Whisper with `language="en"` and verify transcriptions match expected regional forms.

## Dependencies

Pin versions to avoid API breakage:
```
pytorch-lightning==1.9.3
torch<2.6.0
piper-phonemize
onnxruntime-gpu
onnxsim
```

Docker containerisation recommended for reproducibility.

## Hardware Requirements

**Minimum (fine-tuning):**
- 8GB VRAM GPU (Pascal or newer)
- 8GB system RAM
- ~5 days for 1,000 epochs on Tesla P4

**From scratch:** Multiply time by ~200x.

## Troubleshooting

| Issue | Solution |
|-------|----------|
| CUDA OOM | Reduce batch-size (try 8 or 4) |
| Checkpoint won't load | Check pytorch-lightning version matches checkpoint |
| Garbled output | Insufficient training epochs or dataset too small |
| Wrong accent | Check espeak-ng language code and corpus spelling |

Overview

This skill trains custom text-to-speech voices for Piper and exports optimized ONNX models for fast, offline runtime on devices like Raspberry Pi and Home Assistant. It supports fine-tuning existing Piper checkpoints or training from scratch and includes tools for dataset preparation, Whisper-based validation, training orchestration, and ONNX export. The workflow is designed for reproducible results on modest GPUs and embedded deployments.

How this skill works

Prepare a phoneme-complete corpus and 22.05 kHz mono WAV audio, validate samples automatically with Whisper transcription and phoneme comparison, then preprocess into a PyTorch-friendly dataset. Train by fine-tuning a high-quality checkpoint (recommended) or training from scratch, monitor convergence with TensorBoard, and export the final model to ONNX and simplify it with onnxsim. Create a JSON metadata file for runtime configuration and deploy the ONNX to target devices.

When to use it

  • Creating a new synthetic voice for a product or assistant.
  • Fine-tuning an existing Piper checkpoint to match a target speaker or style.
  • Preparing and validating an audio corpus for TTS training.
  • Exporting and deploying compact TTS models to Raspberry Pi or Home Assistant.
  • Adapting voice models to regional English variants (UK/AU/NZ).

Best practices

  • Prefer fine-tuning with ~1,300+ phrases for practical training time; use from-scratch only for very large corpora (13k+).
  • Record or generate 22.05 kHz mono WAV; if source differs, resample with sox using -v 0.95 to avoid clipping.
  • Automate quality checks with Whisper and phonemisation to target 95%+ validated samples.
  • Use espeak-ng language codes (en-gb/en-au) and run spelling conversions to avoid accent drift.
  • Pin dependency versions and use a Docker container for reproducible environments.

Example use cases

  • Fine-tune a neutral TTS voice to sound like a brand narrator for an embedded kiosk.
  • Train a regional English voice (UK/AU/NZ) by converting spelling and using en-gb phonemisation.
  • Validate and clean a mixed-quality recording dataset using Whisper before training.
  • Export a compact ONNX model for offline Home Assistant voice responses on Raspberry Pi.

FAQ

How many phrases do I need to fine-tune vs train from scratch?

Fine-tuning works well with ~1,300–1,500 phrases; training from scratch typically needs 13,000+ phrases.

What sample rate and format are required?

Use 22050 Hz mono WAV for training. If your source is different, resample to 22.05 kHz before preprocessing.

How do I avoid accent drift for UK/AU/NZ English?

Convert American spellings, use en-gb or en-au espeak-ng phonemisation, and validate with Whisper phoneme comparisons.