home / skills / openclaw / skills / a6-gemini-video-analyzer

a6-gemini-video-analyzer skill

safe

/skills/aiwithabidi/a6-gemini-video-analyzer

This skill analyzes videos using Google Gemini to describe scenes, extract UI text, answer questions, and provide transcripts.

npx playbooks add skill openclaw/skills --skill a6-gemini-video-analyzer

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

2.8 KB

---
name: gemini-video-analyzer
description: |
  Native video analysis using Google Gemini API. Upload and analyze video files — describe scenes, extract text/UI, answer questions about content, transcribe speech, identify objects and actions. Use when: (1) User sends a video file and wants it analyzed, (2) Video summarization or description needed, (3) Extracting text, UI elements, or information from screen recordings, (4) Answering questions about video content, (5) Comparing multiple videos, (6) Analyzing tutorials, demos, or walkthroughs.
homepage: https://www.agxntsix.ai
metadata:
  {
    "openclaw":
      {
        "emoji": "🎬",
        "requires": { "bins": ["python3", "curl"], "env": ["GOOGLE_AI_API_KEY"] },
        "primaryEnv": "GOOGLE_AI_API_KEY",
      },
  }
---

# Gemini Video Analyzer

Analyze videos natively using Google Gemini's multimodal API. No frame extraction needed — Gemini processes video at 1 FPS with full motion, audio, and visual understanding.

## Quick Start

```bash
# Analyze a video with default prompt (full description)
GOOGLE_AI_API_KEY=$GOOGLE_AI_API_KEY python3 {baseDir}/scripts/analyze.py /path/to/video.mp4

# Ask a specific question
GOOGLE_AI_API_KEY=$GOOGLE_AI_API_KEY python3 {baseDir}/scripts/analyze.py /path/to/video.mp4 "What text is visible on screen?"

# Manage uploaded files
GOOGLE_AI_API_KEY=$GOOGLE_AI_API_KEY python3 {baseDir}/scripts/manage_files.py list
GOOGLE_AI_API_KEY=$GOOGLE_AI_API_KEY python3 {baseDir}/scripts/manage_files.py cleanup
```

## Supported Formats

MP4, AVI, MOV, MKV, WebM, FLV, MPEG, MPG, WMV, 3GP — up to 2GB per file.

## How It Works

1. Video uploads to Google's Files API (temporary, auto-deletes after 48h)
2. Gemini processes at 1 frame/sec — understands motion, transitions, audio context
3. Model generates response based on your prompt
4. Way better than frame extraction for understanding temporal content

## Use Cases

| Task | Example Prompt |
|------|---------------|
| General description | *(default — no prompt needed)* |
| UI/text extraction | `"What text and UI elements are visible?"` |
| Tutorial summary | `"Summarize the steps shown in this tutorial"` |
| Bug report from video | `"Describe what went wrong in this screen recording"` |
| Meeting notes | `"Summarize the key points discussed"` |
| Content comparison | Upload 2 videos, ask for differences |

## Configuration

Set `GOOGLE_AI_API_KEY` in your environment or `.env` file. Get a free key at [aistudio.google.com](https://aistudio.google.com/apikey).

Default model: `gemini-2.5-flash` (fast, cheap, excellent vision). Override with `--model gemini-2.5-pro` for complex analysis.

## API Reference

See [references/gemini-files-api.md](references/gemini-files-api.md) for file upload limits, processing details, and advanced options.

Overview

This skill performs native video analysis using the Google Gemini multimodal API. Upload a video and get scene descriptions, speech transcription, object and action recognition, text/UI extraction, and direct answers about content. It handles temporal context without manual frame extraction and supports common video formats up to 2GB.

How this skill works

The skill uploads videos to Google Files (temporary storage) and sends them to Gemini, which processes visual and audio signals at roughly 1 frame per second while preserving motion and audio context. Gemini produces structured responses based on your prompt—defaulting to a full description—or answers specific questions, extracts on-screen text and UI elements, transcribes speech, and compares multiple videos when provided.

When to use it

You have a video file and need a detailed description or summary.
Extracting text, UI components, or captions from screen recordings or demos.
Transcribing speech or generating meeting notes from video recordings.
Answering specific questions about actions, objects, or events in a clip.
Comparing two or more videos to identify differences or changes.

Best practices

Provide a concise prompt when you need specific output (e.g., 'List UI elements and visible text').
Keep files under 2GB and use MP4/MP3-equivalent codecs for best compatibility.
Use the default model for fast, cost-effective results and upgrade for deeper analysis.
Batch similar videos together when comparing content to reduce back-and-forth prompts.
Store your API key in environment variables and remove uploads after analysis if privacy is a concern.

Example use cases

Summarize a 10-minute tutorial into step-by-step instructions and timestamps.
Extract all visible text and UI controls from a mobile app screen recording for bug reports.
Transcribe a recorded interview and produce action items and speaker cues.
Compare two product demos to list differences in features shown and user flows.
Identify and describe objects and actions in security or inspection footage.

FAQ

What video formats and size limits are supported?

Common formats like MP4, AVI, MOV, MKV, WebM, FLV, MPEG, MPG, WMV, and 3GP are supported, with a per-file limit of 2GB.

How long are uploaded videos stored?

Uploads use temporary storage and are auto-deleted after 48 hours by the Files API; remove files sooner if needed for privacy.