home / skills / cnemri / google-genai-skills / speech-build

speech-build skill

This skill enables commercial-grade speech synthesis and transcription using Gemini-TTS and Chirp 3, supporting multi-speaker voices and diarization.

npx playbooks add skill cnemri/google-genai-skills --skill speech-build

Review the files below or copy the command above to add this skill to your agents.

Files (6)

SKILL.md

1.7 KB

---
name: speech-build
description: Generate and transcribe speech using Google's Gemini-TTS and Chirp 3 models. Supports Text-to-Speech (Single/Multi-speaker), Instant Custom Voice, and Speech-to-Text (Transcription/Diarization).
---

# Speech Skill (TTS & STT)

Use this skill to implement audio generation and transcription workflows using the `google-genai` and `google-cloud-speech` SDKs.

## Quick Start Setup

```python
from google import genai
from google.genai import types
# For STT: from google.cloud import speech_v2

client = genai.Client()
```

## Reference Materials

- **[Text-to-Speech (TTS)](references/tts.md)**: Gemini-TTS, Chirp 3 HD, Instant Custom Voice.
- **[Speech-to-Text (STT)](references/stt.md)**: Chirp 3 Transcription, Diarization, Streaming.
- **[Voices & Locales](references/voices.md)**: Available voices (`Aoede`, `Puck`...) and languages.
- **[Prompting Guide](references/prompting.md)**: How to control style, accent, and pacing in Gemini-TTS.
- **[Source Code](references/source_code.md)**: Deep inspection of SDK internals.

## Common Workflows

### 1. Generate Speech (Gemini-TTS)
```python
response = client.models.generate_content(
    model="gemini-2.5-flash-preview-tts",
    contents="Hello, world!",
    config=types.GenerateContentConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name='Kore')
            )
        )
    )
)
```

### 2. Transcribe Audio (Chirp 3)
```python
# Requires google-cloud-speech
from google.cloud import speech_v2
# ... (See stt.md for full setup)
response = speech_client.recognize(...)
```

Overview

This skill provides audio generation and transcription using Google's Gemini-TTS and Chirp 3 models. It supports single- and multi-speaker Text-to-Speech, Instant Custom Voice creation, and Speech-to-Text with optional diarization. The implementation relies on the google-genai and google-cloud-speech SDKs for robust, production-ready workflows.

How this skill works

The skill calls Gemini-TTS to generate audio from text, allowing voice selection, style control, and multi-speaker mixes. For transcription, it uses Chirp 3 (via google-cloud-speech) to produce transcripts with timestamps and optional speaker diarization. Typical flows instantiate a genai.Client for TTS and a speech_v2 client for STT, then send configuration objects that control response modality, voice config, and diarization.

When to use it

Generate natural-sounding audio for UI narration, podcasts, or automated messages.
Create custom or branded voices quickly using Instant Custom Voice features.
Transcribe meetings, interviews, or call recordings with speaker diarization.
Build multimodal experiences that need synchronized text and audio outputs.
Prototype voice assistants or accessibility features requiring high-quality TTS/STT.

Best practices

Choose an appropriate prebuilt voice and locale to match your audience and content tone.
Control speaking style and pacing through the prompting guide and speech_config parameters.
Use diarization for multi-speaker audio; split very long audio into segments to improve accuracy.
Normalize audio input (sample rate, channels) and prefer WAV/FLAC for STT quality.
Cache generated audio when reusing the same content to reduce cost and latency.

Example use cases

Generate narrated product tours with a consistent branded voice using Gemini-TTS.
Transcribe and diarize webinar recordings to produce searchable meeting notes.
Create multi-character dialogue audio for games or interactive stories with multi-speaker TTS.
Build a voice-over pipeline that converts blog posts into downloadable podcasts.
Proof user-submitted audio by transcribing and aligning captions with timestamps.

FAQ

Which clients do I instantiate for TTS and STT?

Use genai.Client() for Gemini-TTS and a speech_v2 client from google.cloud.speech_v2 for Chirp 3 transcription.

Can I create a custom voice?

Yes. The Instant Custom Voice option lets you generate branded voices; follow the voice creation and prompt guidelines to control style.

How do I improve diarization accuracy?

Provide clear channel-separated audio when possible, segment long files, and tune diarization settings in the speech client configuration.