home / skills / cnemri / google-genai-skills / speech-use

speech-use skill

safe

This skill enables speech synthesis, transcription, and voice cloning using Gemini-TTS, Chirp 3, and custom voices via Google's GenAI and Cloud SDKs.

npx playbooks add skill cnemri/google-genai-skills --skill speech-use

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

2.4 KB

---
name: speech-use
description: "Generate (TTS), Transcribe (STT), and Clone voices using Google's GenAI and Cloud Speech SDKs. Supports Gemini-TTS, Chirp 3, and Instant Custom Voice."
---

# Speech Use

Use this skill to perform Text-to-Speech (TTS), Speech-to-Text (STT), and Voice Cloning operations.

This skill uses portable Python scripts managed by `uv`.

## Prerequisites

1.  **Environment Variables**:
    *   `GOOGLE_API_KEY` (for TTS via Gemini)
    *   `GOOGLE_CLOUD_PROJECT` (Required for STT and Voice Cloning)
    *   `GOOGLE_APPLICATION_CREDENTIALS` (Recommended for STT/Voice Cloning)

2.  **APIs Enabled**:
    *   Text-to-Speech API (`texttospeech.googleapis.com`)
    *   Speech-to-Text API (`speech.googleapis.com`)

## Usage

### 1. Generate Speech (TTS)

Generate audio from text using Gemini-TTS.

**Standard Voice:**
```bash
uv run skills/speech-use/scripts/generate_speech.py "Hello world, this is a test." --voice Puck --output hello.wav
```

**Custom Voice (Cloned):**
```bash
uv run skills/speech-use/scripts/generate_speech.py "This is my custom voice speaking." --voice-cloning-key "YOUR_KEY_HERE" --output custom.wav
```

### 2. Create Custom Voice (Voice Cloning)

Generate a `voiceCloningKey` from a reference audio file and a consent file.

**Requirements:**
*   `reference.wav`: 10-30s of clear speech (the voice to clone).
*   `consent.wav`: The speaker saying: *"I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model."*

```bash
uv run skills/speech-use/scripts/create_custom_voice.py --reference-audio reference.wav --consent-audio consent.wav
```
*Save the output key to use with `generate_speech.py`.*

### 3. Transcribe Audio (STT)

Transcribe audio files using Chirp 3.

```bash
uv run skills/speech-use/scripts/transcribe_audio.py audio.wav --language en-US --output transcript.txt
```

## Options

**generate_speech.py**
*   `--voice`: Prebuilt voice (e.g., `Kore`, `Puck`, `Fenrir`, `Aoede`).
*   `--voice-cloning-key`: Key from `create_custom_voice.py`.
*   `--model`: Default `gemini-2.5-flash-preview-tts`.

**transcribe_audio.py**
*   `--model`: Default `chirp_3`.
*   `--language`: Default `auto`.
*   `--location`: Cloud region (default `us`).

## References

> **Before running scripts**, review the reference guides for available voices and options.

-   [Voices Guide](references/voices.md) - 30+ voice options with styles (Puck, Kore, Fenrir, Aoede, etc.)

Overview

This skill provides portable Python tools to generate speech (TTS), transcribe audio (STT), and create cloned/custom voices using Google GenAI and Cloud Speech SDKs. It supports Gemini-TTS for high-quality synthesis, Chirp 3 for transcription, and Instant Custom Voice workflows for cloning. The scripts are lightweight, command-line friendly, and designed for integration into pipelines or quick experiments.

How this skill works

Scripts call Google APIs using credentials from environment variables and project settings. generate_speech.py sends text and voice parameters to Gemini-TTS or an Instant Custom Voice key to produce audio files. transcribe_audio.py uploads audio to Chirp 3 for automatic transcription. create_custom_voice.py produces a voiceCloningKey from a reference audio and a consent recording for later TTS use.

When to use it

Create spoken audio from text for apps, demos, or narration.
Transcribe recorded meetings, interviews, or voice notes to text.
Create a custom synthetic voice for brand or accessibility use.
Prototype voice features in chatbots, IVR, or multimedia projects.
Batch-process audio files or integrate TTS/STT into pipelines.

Best practices

Set GOOGLE_API_KEY, GOOGLE_CLOUD_PROJECT, and GOOGLE_APPLICATION_CREDENTIALS before running scripts.
Use 10–30 seconds of clear, single-speaker audio for reliable voice cloning.
Record a short consent phrase exactly as required when creating a custom voice.
Choose appropriate models (gemini-2.5-flash-preview-tts for TTS, chirp_3 for STT) and adjust language/region flags.
Validate output audio and transcripts, then refine prompts or audio quality if results are noisy.

Example use cases

Generate narrated product walkthroughs or marketing clips with Gemini-TTS.
Transcribe customer service calls and export text for analytics.
Clone a presenter’s voice for consistent automated announcements.
Convert lecture recordings to searchable transcripts for study tools.
Produce localized voiceovers by swapping TTS voice/style parameters.

FAQ

What environment variables are required?

Set GOOGLE_API_KEY for Gemini TTS. Also set GOOGLE_CLOUD_PROJECT and GOOGLE_APPLICATION_CREDENTIALS for STT and voice cloning workflows.

How long should reference audio be for cloning?

Provide 10–30 seconds of clear, single-speaker speech for best cloning results.