home / skills / benchflow-ai / skillsbench / gtts

This skill converts text to speech using gTTS, handling long texts by chunking and concatenating audio for audiobooks and podcasts.

npx playbooks add skill benchflow-ai/skillsbench --skill gtts

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.3 KB
---
name: gtts
description: "Google Text-to-Speech (gTTS) for converting text to audio. Use when creating audiobooks, podcasts, or speech synthesis from text. Handles long text by chunking at sentence boundaries and concatenating audio segments with pydub."
---

# Google Text-to-Speech (gTTS)

gTTS is a Python library that converts text to speech using Google's Text-to-Speech API. It's free to use and doesn't require an API key.

## Installation

```bash
pip install gtts pydub
```

pydub is useful for manipulating and concatenating audio files.

## Basic Usage

```python
from gtts import gTTS

# Create speech
tts = gTTS(text="Hello, world!", lang='en')

# Save to file
tts.save("output.mp3")
```

## Language Options

```python
# US English (default)
tts = gTTS(text="Hello", lang='en')

# British English
tts = gTTS(text="Hello", lang='en', tld='co.uk')

# Slow speech
tts = gTTS(text="Hello", lang='en', slow=True)
```

## Python Example for Long Text

```python
from gtts import gTTS
from pydub import AudioSegment
import tempfile
import os
import re

def chunk_text(text, max_chars=4500):
    """Split text into chunks at sentence boundaries."""
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) < max_chars:
            current_chunk += sentence + " "
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sentence + " "

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks


def text_to_audiobook(text, output_path):
    """Convert long text to a single audio file."""
    chunks = chunk_text(text)
    audio_segments = []

    for i, chunk in enumerate(chunks):
        # Create temp file for this chunk
        with tempfile.NamedTemporaryFile(suffix='.mp3', delete=False) as tmp:
            tmp_path = tmp.name

        # Generate speech
        tts = gTTS(text=chunk, lang='en', slow=False)
        tts.save(tmp_path)

        # Load and append
        segment = AudioSegment.from_mp3(tmp_path)
        audio_segments.append(segment)

        # Cleanup
        os.unlink(tmp_path)

    # Concatenate all segments
    combined = audio_segments[0]
    for segment in audio_segments[1:]:
        combined += segment

    # Export
    combined.export(output_path, format="mp3")
```

## Handling Large Documents

gTTS has a character limit per request (~5000 chars). For long documents:

1. Split text into chunks at sentence boundaries
2. Generate audio for each chunk using gTTS
3. Use pydub to concatenate the chunks

## Alternative: Using ffmpeg for Concatenation

If you prefer ffmpeg over pydub:

```bash
# Create file list
echo "file 'chunk1.mp3'" > files.txt
echo "file 'chunk2.mp3'" >> files.txt

# Concatenate
ffmpeg -f concat -safe 0 -i files.txt -c copy output.mp3
```

## Best Practices

- Split at sentence boundaries to avoid cutting words mid-sentence
- Use `slow=False` for natural speech speed
- Handle network errors gracefully (gTTS requires internet)
- Consider adding brief pauses between chapters/sections

## Limitations

- Requires internet connection (uses Google's servers)
- Voice quality is good but not as natural as paid services
- Limited voice customization options
- May have rate limits for very heavy usage

Overview

This skill provides Google Text-to-Speech (gTTS) integration to convert text into spoken audio files. It handles both short snippets and long documents by chunking text at sentence boundaries and producing a single concatenated MP3. Use it to create audiobooks, narration for videos, or quick voice previews without needing an API key. The implementation relies on pydub (or ffmpeg) for audio assembly and cleanup.

How this skill works

The skill sends text to Google's TTS endpoint via the gTTS Python library and saves each response as an MP3. For long inputs it splits text into sentence-based chunks under the character limit, generates an MP3 per chunk, then concatenates segments using pydub (or optionally ffmpeg). Temporary files are created per chunk and removed after concatenation to minimize disk use. Network errors and rate limits should be handled since gTTS uses Google servers.

When to use it

  • Generating audiobooks or long-form narrated content from plain text.
  • Producing voiceovers for videos, tutorials, or presentations quickly.
  • Creating podcasts or spoken previews from articles or blog posts.
  • Automating accessibility features like reading UI text or documents aloud.
  • Prototyping voice interactions when paid TTS services are not available.

Best practices

  • Chunk text at sentence boundaries to avoid mid-word cuts and respect gTTS limits.
  • Use pydub or ffmpeg to concatenate segments and add optional pauses between sections.
  • Set slow=False for natural pace; adjust language and tld for regional accents.
  • Implement retries and backoff to handle transient network or rate-limit errors.
  • Clean up temporary files immediately after processing to conserve disk space.

Example use cases

  • Convert a novel or long article into a single MP3 audiobook by chunking and concatenating.
  • Produce short narrated clips for social media posts from text summaries.
  • Batch-generate spoken versions of product descriptions for an e-commerce site.
  • Create accessibility audio for documentation or help centers on demand.
  • Prototype voice output for a chatbot or in-app assistant without a paid TTS API.

FAQ

Does this require an API key?

No, gTTS uses Google's public endpoint and does not require an API key.

How are long texts handled?

Text is split into sentence-based chunks under the character limit, each chunk is converted to MP3, and segments are concatenated into one file.

Can I change voice or pitch?

gTTS offers limited customization (language and tld for regional variants); advanced voice controls are not available.