home / skills / cnemri / google-genai-skills / speech-use
This skill enables speech synthesis, transcription, and voice cloning using Gemini-TTS, Chirp 3, and custom voices via Google's GenAI and Cloud SDKs.
npx playbooks add skill cnemri/google-genai-skills --skill speech-useReview the files below or copy the command above to add this skill to your agents.
---
name: speech-use
description: "Generate (TTS), Transcribe (STT), and Clone voices using Google's GenAI and Cloud Speech SDKs. Supports Gemini-TTS, Chirp 3, and Instant Custom Voice."
---
# Speech Use
Use this skill to perform Text-to-Speech (TTS), Speech-to-Text (STT), and Voice Cloning operations.
This skill uses portable Python scripts managed by `uv`.
## Prerequisites
1. **Environment Variables**:
* `GOOGLE_API_KEY` (for TTS via Gemini)
* `GOOGLE_CLOUD_PROJECT` (Required for STT and Voice Cloning)
* `GOOGLE_APPLICATION_CREDENTIALS` (Recommended for STT/Voice Cloning)
2. **APIs Enabled**:
* Text-to-Speech API (`texttospeech.googleapis.com`)
* Speech-to-Text API (`speech.googleapis.com`)
## Usage
### 1. Generate Speech (TTS)
Generate audio from text using Gemini-TTS.
**Standard Voice:**
```bash
uv run skills/speech-use/scripts/generate_speech.py "Hello world, this is a test." --voice Puck --output hello.wav
```
**Custom Voice (Cloned):**
```bash
uv run skills/speech-use/scripts/generate_speech.py "This is my custom voice speaking." --voice-cloning-key "YOUR_KEY_HERE" --output custom.wav
```
### 2. Create Custom Voice (Voice Cloning)
Generate a `voiceCloningKey` from a reference audio file and a consent file.
**Requirements:**
* `reference.wav`: 10-30s of clear speech (the voice to clone).
* `consent.wav`: The speaker saying: *"I am the owner of this voice and I consent to Google using this voice to create a synthetic voice model."*
```bash
uv run skills/speech-use/scripts/create_custom_voice.py --reference-audio reference.wav --consent-audio consent.wav
```
*Save the output key to use with `generate_speech.py`.*
### 3. Transcribe Audio (STT)
Transcribe audio files using Chirp 3.
```bash
uv run skills/speech-use/scripts/transcribe_audio.py audio.wav --language en-US --output transcript.txt
```
## Options
**generate_speech.py**
* `--voice`: Prebuilt voice (e.g., `Kore`, `Puck`, `Fenrir`, `Aoede`).
* `--voice-cloning-key`: Key from `create_custom_voice.py`.
* `--model`: Default `gemini-2.5-flash-preview-tts`.
**transcribe_audio.py**
* `--model`: Default `chirp_3`.
* `--language`: Default `auto`.
* `--location`: Cloud region (default `us`).
## References
> **Before running scripts**, review the reference guides for available voices and options.
- [Voices Guide](references/voices.md) - 30+ voice options with styles (Puck, Kore, Fenrir, Aoede, etc.)This skill provides portable Python tools to generate speech (TTS), transcribe audio (STT), and create cloned/custom voices using Google GenAI and Cloud Speech SDKs. It supports Gemini-TTS for high-quality synthesis, Chirp 3 for transcription, and Instant Custom Voice workflows for cloning. The scripts are lightweight, command-line friendly, and designed for integration into pipelines or quick experiments.
Scripts call Google APIs using credentials from environment variables and project settings. generate_speech.py sends text and voice parameters to Gemini-TTS or an Instant Custom Voice key to produce audio files. transcribe_audio.py uploads audio to Chirp 3 for automatic transcription. create_custom_voice.py produces a voiceCloningKey from a reference audio and a consent recording for later TTS use.
What environment variables are required?
Set GOOGLE_API_KEY for Gemini TTS. Also set GOOGLE_CLOUD_PROJECT and GOOGLE_APPLICATION_CREDENTIALS for STT and voice cloning workflows.
How long should reference audio be for cloning?
Provide 10–30 seconds of clear, single-speaker speech for best cloning results.