home / skills / benchflow-ai / skillsbench / text-to-speech

text-to-speech skill

safe

/tasks/multilingual-video-dubbing/environment/skills/text-to-speech

This skill guides TTS audio mastering from cleanup to delivery specs, ensuring loudness consistency, proper timing, and export readiness.

npx playbooks add skill benchflow-ai/skillsbench --skill text-to-speech

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.3 KB

---

name: "TTS Audio Mastering"
description: "Practical mastering steps for TTS audio: cleanup, loudness normalization, alignment, and delivery specs."
---

# SKILL: TTS Audio Mastering

This skill focuses on producing clean, consistent, and delivery-ready TTS audio for video tasks. It covers speech cleanup, loudness normalization, segment boundaries, and export specs.

## 1. TTS Engine & Output Basics

Choose a TTS engine based on deployment constraints and quality needs:

* **Neural offline** (e.g., Kokoro): stable, high quality, no network dependency.
* **Cloud TTS** (e.g., Edge-TTS / OpenAI TTS): convenient, higher naturalness but network-dependent.
* **Formant TTS** (e.g., espeak-ng): for prototyping only; often less natural.

**Key rule:** Always confirm the **native sample rate** of the generated audio before resampling for video delivery.

---

## 2. Speech Cleanup (Per Segment)

Apply lightweight processing to avoid common artifacts:

* **Rumble/DC removal:** high-pass filter around **20 Hz**
* **Harshness control:** optional low-pass around **16 kHz** (helps remove digital fizz)
* **Click/pop prevention:** short fades at boundaries (e.g., **50 ms** fade-in and fade-out)

Recommended FFmpeg pattern (example):

* Add filters in a single chain, and keep them consistent across segments.

---

## 3. Loudness Normalization

Target loudness depends on the benchmark/task spec. A common target is ITU-R BS.1770 loudness measurement:

* **Integrated loudness:** **-23 LUFS**
* **True peak:** around **-1.5 dBTP**
* **LRA:** around **11** (optional)

Recommended workflow:

1. **Measure loudness** using FFmpeg `ebur128` (or equivalent meter).
2. **Apply normalization** (e.g., `loudnorm`) as the final step after cleanup and timing edits.
3. If you adjust tempo/duration after normalization, re-normalize again.

---

## 4. Timing & Segment Boundary Handling

When stitching segment-level TTS into a full track:

* Match each segment to its target window as closely as possible.
* If a segment is shorter than its window, pad with silence.
* If a segment is longer, use gentle duration control (small speed change) or truncate carefully.
* Always apply boundary fades after padding/trimming to avoid clicks.

**Sync guideline:** keep end-to-end drift small (e.g., **<= 0.2s**) unless the task states otherwise.

Overview

This skill describes practical mastering steps to produce clean, consistent, and delivery-ready TTS audio for video tasks. It covers engine selection, per-segment cleanup, loudness normalization, timing alignment, and final export specifications. The guidance is concise and focused on repeatable, minimal-processing workflows that avoid common artifacts.

How this skill works

Inspect TTS output for native sample rate and quality characteristics, then apply a minimal chain of filters per segment: rumble/DC removal, optional high-frequency taming, and short fades to prevent clicks. Measure integrated loudness and true-peak, apply loudness normalization as the final processing step, and handle segment timing with padding or gentle duration control before final fades and export.

When to use it

Preparing TTS tracks for video voiceover delivery where consistent loudness and timing are required.
Batch-processing many short TTS segments that must be stitched into a single timeline.
When you need reproducible audio specs (LUFS, true peak, sample rate) for broadcast or streaming.
Prototyping delivery-ready TTS with minimal artifacts for client review or automated pipelines.

Best practices

Confirm the TTS engine's native sample rate before any resampling to avoid unnecessary quality loss.
Keep per-segment processing lightweight: high-pass ~20 Hz, optional low-pass ~16 kHz, and 50 ms fades at boundaries.
Measure loudness first, then apply loudness normalization (e.g., loudnorm) as the final step; re-normalize after any timing edits.
Pad short segments with silence or gently speed-control long segments; always apply fades after padding/trimming.
Aim for integrated loudness around -23 LUFS and true peak near -1.5 dBTP unless a different spec is required.

Example use cases

Stitching hundreds of short TTS segments into a narration track for an explainer video while keeping timing tight.
Delivering TTS voiceovers to a post team with strict LUFS and true-peak requirements for broadcast.
Automating a pipeline that generates, cleans, and exports TTS audio with consistent loudness and fade settings.
Quickly prototyping alternative TTS engines and comparing output that has been normalized and aligned to timeline windows.

FAQ

What loudness target should I use?

Use the task or platform spec. A common default is -23 LUFS integrated with a true peak around -1.5 dBTP. Adapt if your delivery platform requires different targets.

When should I re-normalize loudness?

Re-normalize after any edit that changes duration or dynamics, such as trimming, padding, or time-stretching segments.