home / skills / benchflow-ai / skillsbench / text-to-speech

This skill guides TTS audio mastering from cleanup to delivery specs, ensuring loudness consistency, proper timing, and export readiness.

npx playbooks add skill benchflow-ai/skillsbench --skill text-to-speech

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.3 KB
---

name: "TTS Audio Mastering"
description: "Practical mastering steps for TTS audio: cleanup, loudness normalization, alignment, and delivery specs."
---

# SKILL: TTS Audio Mastering

This skill focuses on producing clean, consistent, and delivery-ready TTS audio for video tasks. It covers speech cleanup, loudness normalization, segment boundaries, and export specs.

## 1. TTS Engine & Output Basics

Choose a TTS engine based on deployment constraints and quality needs:

* **Neural offline** (e.g., Kokoro): stable, high quality, no network dependency.
* **Cloud TTS** (e.g., Edge-TTS / OpenAI TTS): convenient, higher naturalness but network-dependent.
* **Formant TTS** (e.g., espeak-ng): for prototyping only; often less natural.

**Key rule:** Always confirm the **native sample rate** of the generated audio before resampling for video delivery.

---

## 2. Speech Cleanup (Per Segment)

Apply lightweight processing to avoid common artifacts:

* **Rumble/DC removal:** high-pass filter around **20 Hz**
* **Harshness control:** optional low-pass around **16 kHz** (helps remove digital fizz)
* **Click/pop prevention:** short fades at boundaries (e.g., **50 ms** fade-in and fade-out)

Recommended FFmpeg pattern (example):

* Add filters in a single chain, and keep them consistent across segments.

---

## 3. Loudness Normalization

Target loudness depends on the benchmark/task spec. A common target is ITU-R BS.1770 loudness measurement:

* **Integrated loudness:** **-23 LUFS**
* **True peak:** around **-1.5 dBTP**
* **LRA:** around **11** (optional)

Recommended workflow:

1. **Measure loudness** using FFmpeg `ebur128` (or equivalent meter).
2. **Apply normalization** (e.g., `loudnorm`) as the final step after cleanup and timing edits.
3. If you adjust tempo/duration after normalization, re-normalize again.

---

## 4. Timing & Segment Boundary Handling

When stitching segment-level TTS into a full track:

* Match each segment to its target window as closely as possible.
* If a segment is shorter than its window, pad with silence.
* If a segment is longer, use gentle duration control (small speed change) or truncate carefully.
* Always apply boundary fades after padding/trimming to avoid clicks.

**Sync guideline:** keep end-to-end drift small (e.g., **<= 0.2s**) unless the task states otherwise.

Overview

This skill describes practical mastering steps to produce clean, consistent, and delivery-ready TTS audio for video tasks. It covers engine selection, per-segment cleanup, loudness normalization, timing alignment, and final export specifications. The guidance is concise and focused on repeatable, minimal-processing workflows that avoid common artifacts.

How this skill works

Inspect TTS output for native sample rate and quality characteristics, then apply a minimal chain of filters per segment: rumble/DC removal, optional high-frequency taming, and short fades to prevent clicks. Measure integrated loudness and true-peak, apply loudness normalization as the final processing step, and handle segment timing with padding or gentle duration control before final fades and export.

When to use it

  • Preparing TTS tracks for video voiceover delivery where consistent loudness and timing are required.
  • Batch-processing many short TTS segments that must be stitched into a single timeline.
  • When you need reproducible audio specs (LUFS, true peak, sample rate) for broadcast or streaming.
  • Prototyping delivery-ready TTS with minimal artifacts for client review or automated pipelines.

Best practices

  • Confirm the TTS engine's native sample rate before any resampling to avoid unnecessary quality loss.
  • Keep per-segment processing lightweight: high-pass ~20 Hz, optional low-pass ~16 kHz, and 50 ms fades at boundaries.
  • Measure loudness first, then apply loudness normalization (e.g., loudnorm) as the final step; re-normalize after any timing edits.
  • Pad short segments with silence or gently speed-control long segments; always apply fades after padding/trimming.
  • Aim for integrated loudness around -23 LUFS and true peak near -1.5 dBTP unless a different spec is required.

Example use cases

  • Stitching hundreds of short TTS segments into a narration track for an explainer video while keeping timing tight.
  • Delivering TTS voiceovers to a post team with strict LUFS and true-peak requirements for broadcast.
  • Automating a pipeline that generates, cleans, and exports TTS audio with consistent loudness and fade settings.
  • Quickly prototyping alternative TTS engines and comparing output that has been normalized and aligned to timeline windows.

FAQ

What loudness target should I use?

Use the task or platform spec. A common default is -23 LUFS integrated with a true peak around -1.5 dBTP. Adapt if your delivery platform requires different targets.

When should I re-normalize loudness?

Re-normalize after any edit that changes duration or dynamics, such as trimming, padding, or time-stretching segments.