home / skills / openclaw / skills / gemini-yt-video-transcript
This skill helps you generate a verbatim YouTube transcript using Google Gemini, delivering speaker labels and clean formatting without timestamps.
npx playbooks add skill openclaw/skills --skill gemini-yt-video-transcriptReview the files below or copy the command above to add this skill to your agents.
---
name: gemini-yt-video-transcript
description: "Create a verbatim transcript for a YouTube URL using Google Gemini (speaker labels, paragraph breaks; no time codes). Use when the user asks to transcribe a YouTube video or wants a clean transcript (no timestamps)."
summary: "Generate a verbatim YouTube transcript via Google Gemini (speaker labels, no time codes)."
version: 1.0.2
homepage: https://github.com/odrobnik/gemini-yt-video-transcript-skill
metadata: {"clawdbot":{"emoji":"📝","requires":{"env":["GEMINI_API_KEY"],"bins":["python3"]}}}
---
# Gemini YouTube Video Transcript
Create a **verbatim transcript** for a YouTube URL using **Google Gemini**.
**Output format**
- First line: YouTube video title
- Then transcript lines only in the form:
```
Speaker: text
```
**Requirements**
- No time codes
- No extra headings / lists / commentary
## Usage
```bash
python3 {baseDir}/scripts/youtube_transcript.py "https://www.youtube.com/watch?v=..."
```
Options:
- `--out <path>` Write transcript to a specific file (default: auto-named in the workspace `out/` folder).
## Delivery
When chatting: send the resulting transcript as a document/attachment.
This skill creates a verbatim transcript for a YouTube video URL using Google Gemini. The transcript includes speaker labels and paragraph breaks and deliberately omits time codes and extra headings. It outputs the YouTube video title on the first line followed by lines in the form "Speaker: text."
You provide a YouTube watch URL and the skill fetches the video audio, sends it to Google Gemini for speech-to-text transcription, and formats the response into a clean, document-style transcript. The output preserves speaker turns and paragraph breaks while removing timestamps, stage directions, and supplementary headers. The primary output is a plain transcript suitable for saving to a file or returning as an attachment in a chat.
Can the transcript include timestamps?
No. The skill intentionally omits time codes; it produces verbatim text with speaker labels and paragraph breaks only.
How is speaker labeling handled?
Speaker turns are detected and labeled in-line (e.g., "Speaker: text"). Labels are derived from Gemini's speaker separation; manual review is recommended for accuracy.