home / skills / openclaw / skills / gemini-reader

gemini-reader skill

/skills/shigo-45/gemini-reader

This skill analyzes local PDF, video, and audio files via Gemini API to read, summarize, or transcribe content.

npx playbooks add skill openclaw/skills --skill gemini-reader

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
2.5 KB
---
name: gemini-reader
description: Understand local non-text files (PDF, video, audio) using Gemini API. Use when the user asks to read, summarize, or analyze a PDF document, video file (mp4/mov/webm), or audio file (mp3/wav/m4a/ogg), including audio transcription. NOT for images — the main model already has vision capabilities, prefer using it directly for image understanding.
metadata:
  {
    "openclaw":
      {
        "emoji": "📄",
        "requires": { "env": ["GEMINI_API_KEY"], "pip": ["google-genai"] },
      },
  }
---

# Gemini Reader

Analyze local PDF, video, and audio files via Gemini API (Python SDK `google-genai`).

## Prerequisites

- `google-genai` Python package installed (`pip install google-genai`)
- `GEMINI_API_KEY` environment variable set
- Supported: PDF, video (mp4/webm/mov/avi/mkv), audio (mp3/wav/m4a/ogg)

## Usage

```bash
python3 scripts/gemini_read.py <file> "<prompt>" [--model MODEL] [--output PATH]
```

### Examples

```bash
# Summarize a PDF
python3 scripts/gemini_read.py paper.pdf "Summarize the key findings of this paper"

# Analyze a video
python3 scripts/gemini_read.py lecture.mp4 "List the main topics covered in this video"

# Transcribe audio
python3 scripts/gemini_read.py recording.m4a "Transcribe this audio verbatim"

# Save output to file
python3 scripts/gemini_read.py report.pdf "Extract all data tables" --output tables.txt
```

### Model selection

| Alias | Full name | Best for |
|-------|-----------|----------|
| `3-flash` (default) | gemini-3-flash-preview | Fast, cheap, everyday use |
| `2.5-flash` | gemini-2.5-flash | Stable, good balance |
| `2.5-pro` | gemini-2.5-pro | Deep analysis, long docs |
| `3-pro` | gemini-3-pro-preview | Advanced reasoning |
| `3.1-pro` | gemini-3.1-pro-preview | Latest pro capabilities |

Use alias with `-m`: `gemini_read.py file.pdf "prompt" -m 2.5-pro`

## Notes

- Files are uploaded to Google's Gemini API for processing and deleted after use. Do not use with confidential or sensitive files.
- The script enforces a file extension whitelist (PDF/video/audio only), blocks known sensitive paths, and rejects symlinks.
- All files go through File Upload API (upload -> generate -> cleanup), unified flow regardless of size
- For files on remote nodes (e.g. Mac), transfer to VM first using Tailscale or scp
- The script auto-detects MIME type from file extension
- API calls are direct — no sandbox restrictions, no CLI overhead
- Requires `GEMINI_API_KEY` env var or `google-genai` configured auth

Overview

This skill helps you read, summarize, transcribe, and analyze local PDF, video, and audio files using the Gemini API via the google-genai Python SDK. It supports common formats (PDF, mp4/webm/mov/avi/mkv, mp3/wav/m4a/ogg) and is tuned for fast, practical extraction and analysis. Use it to convert non-text media into concise, actionable text outputs.

How this skill works

The tool uploads a local file to the Gemini File Upload API, issues a generation request with your prompt and chosen model alias, retrieves the generated text, and cleans up the temporary upload. It auto-detects MIME type from the file extension and supports model selection for speed or deeper analysis. Typical flows include summarization, transcription, table extraction, and key-point listing.

When to use it

  • You need a summary or analysis of a PDF document not easily copyable.
  • You want a transcript or highlights from a lecture, meeting, or interview audio/video file.
  • You need to extract tables, figures, or structured data from reports or scanned PDFs.
  • You want to test different Gemini model flavors for speed vs. depth on a local media file.

Best practices

  • Set GEMINI_API_KEY in your environment or configure google-genai auth before running the script.
  • Choose model aliases by trade-off: 3-flash for speed, 2.5-pro/3-pro for deeper parsing or long documents.
  • Transfer large or remote files to the processing machine (scp or Tailscale) before running to avoid network issues.
  • Provide a clear, specific prompt (task, format, length) to get focused outputs (e.g., "Summarize in 5 bullet points").
  • Use the --output option to save results when processing long transcripts or extracted tables.

Example use cases

  • Summarize a 50-page PDF into a two-paragraph executive summary and three action items.
  • Transcribe a 90-minute lecture mp4 into timestamped notes and a short highlights list.
  • Extract all tables from a financial report PDF and save them to a text file for downstream parsing.
  • Analyze webinar audio to list key claims, speakers, and follow-up questions.
  • Compare outputs from 3-flash and 2.5-pro to balance cost and depth on long research papers.

FAQ

Which file formats are supported?

PDF, video (mp4, webm, mov, avi, mkv) and audio (mp3, wav, m4a, ogg) are supported; images are not the target for this skill.

How do I change models?

Use the model alias flag (e.g., -m 2.5-pro). Aliases map to Gemini models for speed vs. deeper reasoning.

Do I need an API key?

Yes. Set GEMINI_API_KEY in your environment or configure authentication for the google-genai SDK.