home / skills / openclaw / skills / gemini-yt-video-transcript

gemini-yt-video-transcript skill

This skill helps you generate a verbatim YouTube transcript using Google Gemini, delivering speaker labels and clean formatting without timestamps.

npx playbooks add skill openclaw/skills --skill gemini-yt-video-transcript

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
1.1 KB
---
name: gemini-yt-video-transcript
description: "Create a verbatim transcript for a YouTube URL using Google Gemini (speaker labels, paragraph breaks; no time codes). Use when the user asks to transcribe a YouTube video or wants a clean transcript (no timestamps)."
summary: "Generate a verbatim YouTube transcript via Google Gemini (speaker labels, no time codes)."
version: 1.0.2
homepage: https://github.com/odrobnik/gemini-yt-video-transcript-skill
metadata: {"clawdbot":{"emoji":"📝","requires":{"env":["GEMINI_API_KEY"],"bins":["python3"]}}}
---

# Gemini YouTube Video Transcript

Create a **verbatim transcript** for a YouTube URL using **Google Gemini**.

**Output format**
- First line: YouTube video title
- Then transcript lines only in the form:

```
Speaker: text
```

**Requirements**
- No time codes
- No extra headings / lists / commentary

## Usage

```bash
python3 {baseDir}/scripts/youtube_transcript.py "https://www.youtube.com/watch?v=..."
```

Options:
- `--out <path>` Write transcript to a specific file (default: auto-named in the workspace `out/` folder).

## Delivery

When chatting: send the resulting transcript as a document/attachment.

Overview

This skill creates a verbatim transcript for a YouTube video URL using Google Gemini. The transcript includes speaker labels and paragraph breaks and deliberately omits time codes and extra headings. It outputs the YouTube video title on the first line followed by lines in the form "Speaker: text."

How this skill works

You provide a YouTube watch URL and the skill fetches the video audio, sends it to Google Gemini for speech-to-text transcription, and formats the response into a clean, document-style transcript. The output preserves speaker turns and paragraph breaks while removing timestamps, stage directions, and supplementary headers. The primary output is a plain transcript suitable for saving to a file or returning as an attachment in a chat.

When to use it

  • You need a clean, readable transcript of a YouTube video without timestamps.
  • Preparing written content from video interviews, talks, or podcasts where speaker labels are required.
  • Archiving or backing up spoken content in text form for search and accessibility.
  • Creating captions or notes where timestamps would be distracting.
  • Delivering a document-ready transcript to clients or teammates.

Best practices

  • Provide the exact YouTube watch URL to avoid ambiguous video selection.
  • Use high-quality audio videos for more accurate speaker detection and verbatim text.
  • Review the transcript for proper nouns and domain-specific terms; edit if needed for publication.
  • Specify an output file path when running in batch mode to save transcripts consistently.
  • Treat the result as a raw, verbatim transcript — do not assume punctuation or capitalization is fully corrected.

Example use cases

  • Transcribe an interview posted on YouTube to produce quotes and speaker-attributed text for an article.
  • Archive talks and presentations from a channel into searchable text files for backup.
  • Deliver client-ready transcripts of webinars without timestamps for PR or documentation.
  • Generate readable transcripts of panel discussions where multiple speakers need clear labels.
  • Create a clean source document for editing into blog posts, summaries, or social media clips.

FAQ

Can the transcript include timestamps?

No. The skill intentionally omits time codes; it produces verbatim text with speaker labels and paragraph breaks only.

How is speaker labeling handled?

Speaker turns are detected and labeled in-line (e.g., "Speaker: text"). Labels are derived from Gemini's speaker separation; manual review is recommended for accuracy.