home / skills / cnemri / google-genai-skills / speech-build

speech-build skill

/skills/speech-build

This skill enables commercial-grade speech synthesis and transcription using Gemini-TTS and Chirp 3, supporting multi-speaker voices and diarization.

npx playbooks add skill cnemri/google-genai-skills --skill speech-build

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
1.7 KB
---
name: speech-build
description: Generate and transcribe speech using Google's Gemini-TTS and Chirp 3 models. Supports Text-to-Speech (Single/Multi-speaker), Instant Custom Voice, and Speech-to-Text (Transcription/Diarization).
---

# Speech Skill (TTS & STT)

Use this skill to implement audio generation and transcription workflows using the `google-genai` and `google-cloud-speech` SDKs.

## Quick Start Setup

```python
from google import genai
from google.genai import types
# For STT: from google.cloud import speech_v2

client = genai.Client()
```

## Reference Materials

- **[Text-to-Speech (TTS)](references/tts.md)**: Gemini-TTS, Chirp 3 HD, Instant Custom Voice.
- **[Speech-to-Text (STT)](references/stt.md)**: Chirp 3 Transcription, Diarization, Streaming.
- **[Voices & Locales](references/voices.md)**: Available voices (`Aoede`, `Puck`...) and languages.
- **[Prompting Guide](references/prompting.md)**: How to control style, accent, and pacing in Gemini-TTS.
- **[Source Code](references/source_code.md)**: Deep inspection of SDK internals.

## Common Workflows

### 1. Generate Speech (Gemini-TTS)
```python
response = client.models.generate_content(
    model="gemini-2.5-flash-preview-tts",
    contents="Hello, world!",
    config=types.GenerateContentConfig(
        response_modalities=["AUDIO"],
        speech_config=types.SpeechConfig(
            voice_config=types.VoiceConfig(
                prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name='Kore')
            )
        )
    )
)
```

### 2. Transcribe Audio (Chirp 3)
```python
# Requires google-cloud-speech
from google.cloud import speech_v2
# ... (See stt.md for full setup)
response = speech_client.recognize(...)
```

Overview

This skill provides audio generation and transcription using Google's Gemini-TTS and Chirp 3 models. It supports single- and multi-speaker Text-to-Speech, Instant Custom Voice creation, and Speech-to-Text with optional diarization. The implementation relies on the google-genai and google-cloud-speech SDKs for robust, production-ready workflows.

How this skill works

The skill calls Gemini-TTS to generate audio from text, allowing voice selection, style control, and multi-speaker mixes. For transcription, it uses Chirp 3 (via google-cloud-speech) to produce transcripts with timestamps and optional speaker diarization. Typical flows instantiate a genai.Client for TTS and a speech_v2 client for STT, then send configuration objects that control response modality, voice config, and diarization.

When to use it

  • Generate natural-sounding audio for UI narration, podcasts, or automated messages.
  • Create custom or branded voices quickly using Instant Custom Voice features.
  • Transcribe meetings, interviews, or call recordings with speaker diarization.
  • Build multimodal experiences that need synchronized text and audio outputs.
  • Prototype voice assistants or accessibility features requiring high-quality TTS/STT.

Best practices

  • Choose an appropriate prebuilt voice and locale to match your audience and content tone.
  • Control speaking style and pacing through the prompting guide and speech_config parameters.
  • Use diarization for multi-speaker audio; split very long audio into segments to improve accuracy.
  • Normalize audio input (sample rate, channels) and prefer WAV/FLAC for STT quality.
  • Cache generated audio when reusing the same content to reduce cost and latency.

Example use cases

  • Generate narrated product tours with a consistent branded voice using Gemini-TTS.
  • Transcribe and diarize webinar recordings to produce searchable meeting notes.
  • Create multi-character dialogue audio for games or interactive stories with multi-speaker TTS.
  • Build a voice-over pipeline that converts blog posts into downloadable podcasts.
  • Proof user-submitted audio by transcribing and aligning captions with timestamps.

FAQ

Which clients do I instantiate for TTS and STT?

Use genai.Client() for Gemini-TTS and a speech_v2 client from google.cloud.speech_v2 for Chirp 3 transcription.

Can I create a custom voice?

Yes. The Instant Custom Voice option lets you generate branded voices; follow the voice creation and prompt guidelines to control style.

How do I improve diarization accuracy?

Provide clear channel-separated audio when possible, segment long files, and tune diarization settings in the speech client configuration.