home / skills / yonatangross / orchestkit / vision-language-models

vision-language-models skill

/plugins/ork/skills/vision-language-models

This skill enables seamless vision-language integration, empowering image captioning, VQA, and document analysis across multimodal tasks.

npx playbooks add skill yonatangross/orchestkit --skill vision-language-models

Review the files below or copy the command above to add this skill to your agents.

Files (5)
SKILL.md
9.4 KB
---
name: vision-language-models
description: GPT-5/4o, Claude 4.5, Gemini 2.5/3, Grok 4 vision patterns for image analysis, document understanding, and visual QA. Use when implementing image captioning, document/chart analysis, or multi-image comparison.
context: fork
agent: multimodal-specialist
version: 1.0.0
author: OrchestKit
user-invocable: false
tags: [vision, multimodal, image, gpt-5, claude-4, gemini, grok, vlm, 2026]
---

# Vision Language Models (2026)

Integrate vision capabilities from leading multimodal models for image understanding, document analysis, and visual reasoning.

## Overview

- Image captioning and description generation
- Visual question answering (VQA)
- Document/chart/diagram analysis with OCR
- Multi-image comparison and reasoning
- Bounding box detection and region analysis
- Video frame analysis

## Model Comparison (January 2026)

| Model | Context | Strengths | Vision Input |
|-------|---------|-----------|--------------|
| **GPT-5.2** | 128K | Best general reasoning, multimodal | Up to 10 images |
| **Claude Opus 4.5** | 200K | Best coding, sustained agent tasks | Up to 100 images |
| **Gemini 2.5 Pro** | 1M+ | Longest context, video analysis | 3,600 images max |
| **Gemini 3 Pro** | 1M | Deep Think, 100% AIME 2025 | Enhanced segmentation |
| **Grok 4** | 2M | Real-time X integration, DeepSearch | Images + upcoming video |

## Image Input Methods

### Base64 Encoding (All Providers)

```python
import base64
import mimetypes

def encode_image_base64(image_path: str) -> tuple[str, str]:
    """Encode local image to base64 with MIME type."""
    mime_type, _ = mimetypes.guess_type(image_path)
    mime_type = mime_type or "image/png"

    with open(image_path, "rb") as f:
        base64_data = base64.standard_b64encode(f.read()).decode("utf-8")

    return base64_data, mime_type
```

### OpenAI GPT-5/4o Vision

```python
from openai import OpenAI

client = OpenAI()

def analyze_image_openai(image_path: str, prompt: str) -> str:
    """Analyze image using GPT-5 or GPT-4o."""
    base64_data, mime_type = encode_image_base64(image_path)

    response = client.chat.completions.create(
        model="gpt-5",  # or "gpt-4o", "gpt-4.1"
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {
                    "url": f"data:{mime_type};base64,{base64_data}",
                    "detail": "high"  # low, high, or auto
                }}
            ]
        }],
        max_tokens=4096  # Required for vision
    )
    return response.choices[0].message.content
```

### Claude 4.5 Vision (Anthropic)

```python
import anthropic

client = anthropic.Anthropic()

def analyze_image_claude(image_path: str, prompt: str) -> str:
    """Analyze image using Claude Opus 4.5 or Sonnet 4.5."""
    base64_data, media_type = encode_image_base64(image_path)

    response = client.messages.create(
        model="claude-opus-4-5-20251124",  # or claude-sonnet-4-5
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": base64_data
                    }
                },
                {"type": "text", "text": prompt}
            ]
        }]
    )
    return response.content[0].text
```

### Gemini 2.5/3 Vision (Google)

```python
import google.generativeai as genai
from PIL import Image

genai.configure(api_key="YOUR_API_KEY")

def analyze_image_gemini(image_path: str, prompt: str) -> str:
    """Analyze image using Gemini 2.5 Pro or Gemini 3."""
    model = genai.GenerativeModel("gemini-2.5-pro")  # or gemini-3-pro

    image = Image.open(image_path)

    response = model.generate_content([prompt, image])
    return response.text

# For video analysis (Gemini excels here)
def analyze_video_gemini(video_path: str, prompt: str) -> str:
    """Analyze video using Gemini's native video support."""
    model = genai.GenerativeModel("gemini-2.5-pro")

    video_file = genai.upload_file(video_path)

    response = model.generate_content([prompt, video_file])
    return response.text
```

### Grok 4 Vision (xAI)

```python
from openai import OpenAI  # Grok uses OpenAI-compatible API

client = OpenAI(
    api_key="YOUR_XAI_API_KEY",
    base_url="https://api.x.ai/v1"
)

def analyze_image_grok(image_path: str, prompt: str) -> str:
    """Analyze image using Grok 4 with real-time capabilities."""
    base64_data, mime_type = encode_image_base64(image_path)

    response = client.chat.completions.create(
        model="grok-4",  # or grok-2-vision-1212
        messages=[{
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {
                    "url": f"data:{mime_type};base64,{base64_data}"
                }}
            ]
        }]
    )
    return response.choices[0].message.content
```

## Multi-Image Analysis

```python
async def compare_images(images: list[str], prompt: str) -> str:
    """Compare multiple images (Claude supports up to 100)."""
    content = []

    for img_path in images:
        base64_data, media_type = encode_image_base64(img_path)
        content.append({
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": media_type,
                "data": base64_data
            }
        })

    content.append({"type": "text", "text": prompt})

    response = client.messages.create(
        model="claude-opus-4-5-20251124",
        max_tokens=8192,
        messages=[{"role": "user", "content": content}]
    )
    return response.content[0].text
```

## Object Detection (Gemini 2.5+)

```python
def detect_objects_gemini(image_path: str) -> list[dict]:
    """Detect objects with bounding boxes using Gemini 2.5+."""
    model = genai.GenerativeModel("gemini-2.5-pro")
    image = Image.open(image_path)

    response = model.generate_content([
        "Detect all objects in this image. Return bounding boxes "
        "as JSON with format: {objects: [{label, box: [x1,y1,x2,y2]}]}",
        image
    ])

    import json
    return json.loads(response.text)
```

## Token Cost Optimization

| Provider | Detail Level | Cost Impact |
|----------|-------------|-------------|
| OpenAI | `low` (65 tokens) | Use for classification |
| OpenAI | `high` (129+ tokens/tile) | Use for OCR/charts |
| Gemini | 258 tokens base | Scales with resolution |
| Claude | Per-image pricing | Batch for efficiency |

```python
# Cost-optimized simple classification
response = client.chat.completions.create(
    model="gpt-4o-mini",  # Cheaper for simple tasks
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Is there a person? Reply: yes/no"},
            {"type": "image_url", "image_url": {
                "url": image_url,
                "detail": "low"  # Minimal tokens
            }}
        ]
    }]
)
```

## Image Size Limits (2026)

| Provider | Max Size | Max Images | Notes |
|----------|----------|------------|-------|
| OpenAI | 20MB | 10/request | GPT-5 series |
| Claude | 8000x8000 px | 100/request | 2000px if >20 images |
| Gemini | 20MB | 3,600/request | Best for batch |
| Grok | 20MB | Limited | Grok 5 expands this |

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| High accuracy | Claude Opus 4.5 or GPT-5 |
| Long documents | Gemini 2.5 Pro (1M context) |
| Cost efficiency | Gemini 2.5 Flash ($0.15/M tokens) |
| Real-time/X data | Grok 4 with DeepSearch |
| Video analysis | Gemini 2.5/3 Pro (native) |

## Common Mistakes

- Not setting `max_tokens` (responses truncated)
- Sending oversized images (resize to 2048px max)
- Using `high` detail for yes/no questions
- Not validating image format before encoding
- Ignoring rate limits on vision endpoints
- Using deprecated models (GPT-4V retired)

## Limitations

- Cannot identify specific people (privacy restriction)
- May hallucinate on low-quality/rotated images (<200px)
- GPT-4o: struggles with non-Latin text, precise spatial reasoning
- No real-time video (use frame extraction except Gemini)

## Related Skills

- `audio-language-models` - Audio/speech processing
- `multimodal-rag` - Image + text retrieval
- `llm-streaming` - Streaming vision responses

## Capability Details

### image-captioning
**Keywords:** caption, describe, image description, alt text, accessibility
**Solves:**
- Generate descriptive captions for images
- Create accessibility alt text
- Extract visual content summary

### visual-qa
**Keywords:** VQA, visual question, image question, analyze image
**Solves:**
- Answer questions about image content
- Extract specific information from visuals
- Reason about image elements

### document-vision
**Keywords:** document, PDF, chart, diagram, OCR, extract, table
**Solves:**
- Extract text from documents and charts
- Analyze diagrams and flowcharts
- Process forms and tables with structure

### multi-image-analysis
**Keywords:** compare images, multiple images, image comparison, batch
**Solves:**
- Compare visual elements across images
- Track changes between versions
- Analyze image sequences

### object-detection
**Keywords:** bounding box, detect objects, locate, segmentation
**Solves:**
- Detect and locate objects in images
- Generate bounding box coordinates
- Segment image regions (Gemini 2.5+)

Overview

This skill integrates vision capabilities from leading multimodal models (GPT-5/4o, Claude 4.5, Gemini 2.5/3, Grok 4) for image understanding, document analysis, and visual question answering. It provides production-ready patterns for image captioning, OCR/chart extraction, object detection, multi-image comparison, and video-frame workflows. Use it to choose models, optimize token and image costs, and implement robust encoding/ingestion patterns in TypeScript-based agents and services.

How this skill works

The skill supplies concrete input patterns (base64/image objects, file uploads, and model-specific API payloads) and code examples to call vision endpoints across providers. It inspects image size limits, batching constraints, and token cost tradeoffs, and maps capabilities to tasks like captioning, VQA, document OCR, object detection, and multi-image comparison. It also recommends models for accuracy, context length, video support, and real-time use cases, plus practical pitfalls to avoid.

When to use it

  • Build image captioning or accessibility alt-text generation
  • Implement visual question answering or image-guided assistants
  • Extract text/tables from PDFs, charts, or diagrams with OCR
  • Compare multiple images or track visual changes across versions
  • Detect objects, bounding boxes, or run segmentation on images or frames

Best practices

  • Encode images as base64 with MIME type and validate formats before upload
  • Set max_tokens appropriately for vision tasks to avoid truncation
  • Resize oversized images (recommend ~2048px) and choose detail level based on task cost
  • Batch images where supported (Claude, Gemini) to reduce per-image costs and latency
  • Choose model by task: high accuracy (Claude Opus 4.5/GPT-5), long documents/video (Gemini 2.5+), real-time/X data (Grok 4)

Example use cases

  • Generate descriptive captions and alt text for a content management system
  • Answer product or scene questions from user-uploaded photos (VQA)
  • Extract tables, fields, and charts from financial reports and PDFs
  • Detect objects and bounding boxes for inventory or surveillance pipelines
  • Compare design mockups or before/after images for change detection

FAQ

Which model should I pick for large document analysis?

Use Gemini 2.5 Pro for the longest context and native video support; Gemini handles large batches and multi-page documents well.

How do I reduce cost for simple classification?

Use a cheaper vision-capable model or low-detail image option (e.g., gpt-4o-mini or low detail in OpenAI requests) and batch images when possible to lower per-image pricing.