home / skills / openclaw / skills / gemini-stt

gemini-stt skill

/skills/araa47/gemini-stt

This skill transcribes audio files using Google's Gemini API or Vertex AI, delivering fast, accurate captions for media and logs.

npx playbooks add skill openclaw/skills --skill gemini-stt

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
3.5 KB
---
name: gemini-stt
description: Transcribe audio files using Google's Gemini API or Vertex AI
metadata: {"clawdbot":{"emoji":"🎤","os":["linux","darwin"]}}
---

# Gemini Speech-to-Text Skill

Transcribe audio files using Google's Gemini API or Vertex AI. Default model is `gemini-2.0-flash-lite` for fastest transcription.

## Authentication (choose one)

### Option 1: Vertex AI with Application Default Credentials (Recommended)

```bash
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID
```

The script will automatically detect and use ADC when available.

### Option 2: Direct Gemini API Key

Set `GEMINI_API_KEY` in environment (e.g., `~/.env` or `~/.clawdbot/.env`)

## Requirements

- Python 3.10+ (no external dependencies)
- Either GEMINI_API_KEY or gcloud CLI with ADC configured

## Supported Formats

- `.ogg` / `.opus` (Telegram voice messages)
- `.mp3`
- `.wav`
- `.m4a`

## Usage

```bash
# Auto-detect auth (tries ADC first, then GEMINI_API_KEY)
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg

# Force Vertex AI
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex

# With a specific model
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --model gemini-2.5-pro

# Vertex AI with specific project and region
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex --project my-project --region us-central1

# With Clawdbot media
python ~/.claude/skills/gemini-stt/transcribe.py ~/.clawdbot/media/inbound/voice-message.ogg
```

## Options

| Option | Description |
|--------|-------------|
| `<audio_file>` | Path to the audio file (required) |
| `--model`, `-m` | Gemini model to use (default: `gemini-2.0-flash-lite`) |
| `--vertex`, `-v` | Force use of Vertex AI with ADC |
| `--project`, `-p` | GCP project ID (for Vertex, defaults to gcloud config) |
| `--region`, `-r` | GCP region (for Vertex, default: `us-central1`) |

## Supported Models

Any Gemini model that supports audio input can be used. Recommended models:

| Model | Notes |
|-------|-------|
| `gemini-2.0-flash-lite` | **Default.** Fastest transcription speed. |
| `gemini-2.0-flash` | Fast and cost-effective. |
| `gemini-2.5-flash-lite` | Lightweight 2.5 model. |
| `gemini-2.5-flash` | Balanced speed and quality. |
| `gemini-2.5-pro` | Higher quality, slower. |
| `gemini-3-flash-preview` | Latest flash model. |
| `gemini-3-pro-preview` | Latest pro model, best quality. |

See [Gemini API Models](https://ai.google.dev/gemini-api/docs/models) for the latest list.

## How It Works

1. Reads the audio file and base64 encodes it
2. Auto-detects authentication:
   - If ADC is available (gcloud), uses Vertex AI endpoint
   - Otherwise, uses GEMINI_API_KEY with direct Gemini API
3. Sends to the selected Gemini model with transcription prompt
4. Returns the transcribed text

## Example Integration

For Clawdbot voice message handling:

```bash
# Transcribe incoming voice message
TRANSCRIPT=$(python ~/.claude/skills/gemini-stt/transcribe.py "$AUDIO_PATH")
echo "User said: $TRANSCRIPT"
```

## Error Handling

The script exits with code 1 and prints to stderr on:
- No authentication available (neither ADC nor GEMINI_API_KEY)
- File not found
- API errors
- Missing GCP project (when using Vertex)

## Notes

- Uses Gemini 2.0 Flash Lite by default for fastest transcription
- No external Python dependencies (uses stdlib only)
- Automatically detects MIME type from file extension
- Prefers Vertex AI with ADC when available (no API key management needed)

Overview

This skill transcribes audio files using Google’s Gemini API or Vertex AI, defaulting to gemini-2.0-flash-lite for fast results. It auto-detects authentication via Application Default Credentials (Vertex) or a GEMINI_API_KEY and supports common audio formats like OGG, MP3, WAV, and M4A. The tool is lightweight, requires Python 3.10+, and has no external dependencies.

How this skill works

The script reads the audio file, base64-encodes the content, and determines authentication: it prefers Vertex AI via ADC when available, otherwise it falls back to the direct Gemini API with GEMINI_API_KEY. It sends the encoded audio to the chosen Gemini model with a transcription prompt and returns the transcribed text. Command-line options allow forcing Vertex, selecting model, and specifying GCP project and region.

When to use it

  • Transcribing single audio files or voice messages from a local workflow.
  • Automating transcription in scripts or small bots without external dependencies.
  • Using Vertex AI when you want to avoid managing API keys and rely on gcloud ADC.
  • Testing different Gemini speech-capable models for speed vs. quality trade-offs.
  • Processing Telegram voice notes or other common formats (OGG/OPUS, MP3, WAV, M4A).

Best practices

  • Use gcloud application-default login and set project to prefer Vertex and avoid API keys.
  • Choose gemini-2.0-flash-lite for fast, low-cost transcriptions and higher-tier models for quality when needed.
  • Validate the audio format and ensure file paths are correct to avoid file-not-found errors.
  • Provide a GCP project and region when forcing Vertex to prevent missing-project errors.
  • Handle non-zero exit codes and stderr output to catch authentication or API errors in automation.

Example use cases

  • Quickly transcribe an incoming Clawdbot voice message for chat logs or moderation.
  • Batch-transcribe a directory of meeting recordings using a shell loop and capture transcripts.
  • Integrate into a lightweight ingestion pipeline where you prefer no third-party Python packages.
  • Switch between Vertex and direct Gemini API depending on environment: local dev vs. CI.
  • Experiment with model choices to find the optimal balance between latency and transcript accuracy.

FAQ

What authentication options are supported?

The skill supports Vertex AI via Application Default Credentials (recommended) or a direct GEMINI_API_KEY environment variable.

Which audio formats are accepted?

Supported formats include .ogg/.opus, .mp3, .wav, and .m4a; MIME type is inferred from the file extension.