home / skills / yonatangross / orchestkit / multimodal-llm

multimodal-llm skill

/plugins/ork/skills/multimodal-llm

This skill helps you build and orchestrate multimodal pipelines by integrating vision and audio processing with LLMs for image, audio, and document tasks.

npx playbooks add skill yonatangross/orchestkit --skill multimodal-llm

Review the files below or copy the command above to add this skill to your agents.

Files (9)
SKILL.md
4.8 KB
---
name: multimodal-llm
license: MIT
compatibility: "Claude Code 2.1.34+."
author: OrchestKit
version: 1.0.0
description: Vision, audio, and multimodal LLM integration patterns. Use when processing images, transcribing audio, generating speech, or building multimodal AI pipelines.
tags: [vision, audio, multimodal, image, speech, transcription, tts]
user-invocable: false
context: fork
complexity: high
metadata:
  category: mcp-enhancement
---

# Multimodal LLM Patterns

Integrate vision and audio capabilities from leading multimodal models. Covers image analysis, document understanding, real-time voice agents, speech-to-text, and text-to-speech.

## Quick Reference

| Category | Rules | Impact | When to Use |
|----------|-------|--------|-------------|
| [Vision: Image Analysis](#vision-image-analysis) | 1 | HIGH | Image captioning, VQA, multi-image comparison, object detection |
| [Vision: Document Understanding](#vision-document-understanding) | 1 | HIGH | OCR, chart/diagram analysis, PDF processing, table extraction |
| [Vision: Model Selection](#vision-model-selection) | 1 | MEDIUM | Choosing provider, cost optimization, image size limits |
| [Audio: Speech-to-Text](#audio-speech-to-text) | 1 | HIGH | Transcription, speaker diarization, long-form audio |
| [Audio: Text-to-Speech](#audio-text-to-speech) | 1 | MEDIUM | Voice synthesis, expressive TTS, multi-speaker dialogue |
| [Audio: Model Selection](#audio-model-selection) | 1 | MEDIUM | Real-time voice agents, provider comparison, pricing |

**Total: 6 rules across 2 categories (Vision, Audio)**

## Vision: Image Analysis

Send images to multimodal LLMs for captioning, visual QA, and object detection. Always set `max_tokens` and resize images before encoding.

| Rule | File | Key Pattern |
|------|------|-------------|
| Image Analysis | `rules/vision-image-analysis.md` | Base64 encoding, multi-image, bounding boxes |

## Vision: Document Understanding

Extract structured data from documents, charts, and PDFs using vision models.

| Rule | File | Key Pattern |
|------|------|-------------|
| Document Vision | `rules/vision-document.md` | PDF page ranges, detail levels, OCR strategies |

## Vision: Model Selection

Choose the right vision provider based on accuracy, cost, and context window needs.

| Rule | File | Key Pattern |
|------|------|-------------|
| Vision Models | `rules/vision-models.md` | Provider comparison, token costs, image limits |

## Audio: Speech-to-Text

Convert audio to text with speaker diarization, timestamps, and sentiment analysis.

| Rule | File | Key Pattern |
|------|------|-------------|
| Speech-to-Text | `rules/audio-speech-to-text.md` | Gemini long-form, GPT-4o-Transcribe, AssemblyAI features |

## Audio: Text-to-Speech

Generate natural speech from text with voice selection and expressive cues.

| Rule | File | Key Pattern |
|------|------|-------------|
| Text-to-Speech | `rules/audio-text-to-speech.md` | Gemini TTS, voice config, auditory cues |

## Audio: Model Selection

Select the right audio/voice provider for real-time, transcription, or TTS use cases.

| Rule | File | Key Pattern |
|------|------|-------------|
| Audio Models | `rules/audio-models.md` | Real-time voice comparison, STT benchmarks, pricing |

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| High accuracy vision | Claude Opus 4.6 or GPT-5 |
| Long documents | Gemini 2.5 Pro (1M context) |
| Cost-efficient vision | Gemini 2.5 Flash ($0.15/M tokens) |
| Video analysis | Gemini 2.5/3 Pro (native video) |
| Voice assistant | Grok Voice Agent (fastest, <1s) |
| Emotional voice AI | Gemini Live API |
| Long audio transcription | Gemini 2.5 Pro (9.5hr) |
| Speaker diarization | AssemblyAI or Gemini |
| Self-hosted STT | Whisper Large V3 |

## Example

```python
import anthropic, base64

client = anthropic.Anthropic()
with open("image.png", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
        {"type": "text", "text": "Describe this image"}
    ]}]
)
```

## Common Mistakes

1. Not setting `max_tokens` on vision requests (responses truncated)
2. Sending oversized images without resizing (>2048px)
3. Using `high` detail level for simple yes/no classification
4. Using STT+LLM+TTS pipeline instead of native speech-to-speech
5. Not leveraging barge-in support for natural voice conversations
6. Using deprecated models (GPT-4V, Whisper-1)
7. Ignoring rate limits on vision and audio endpoints

## Related Skills

- `rag-retrieval` - Multimodal RAG with image + text retrieval
- `llm-integration` - General LLM function calling patterns
- `streaming-api-patterns` - WebSocket patterns for real-time audio

Overview

This skill provides production-ready patterns for integrating vision and audio capabilities with multimodal LLMs. It covers image analysis, document understanding, speech-to-text, text-to-speech, and end-to-end multimodal pipelines for real-time and batch workflows. Use these patterns to build reliable, cost-aware multimodal agents in TypeScript, React, FastAPI, and LangGraph environments.

How this skill works

The skill defines clear rules and example patterns for sending images and audio to multimodal models, including resizing, base64 encoding, and token limits. It recommends model selection based on accuracy, latency, and context window, and shows how to combine STT, LLM reasoning, and TTS while avoiding common pitfalls like oversized media and missing max_tokens. Included patterns cover diarization, timestamps, OCR, chart extraction, and real-time voice agents.

When to use it

  • Image captioning, visual question answering, or multi-image comparison tasks
  • Extracting structured data from PDFs, charts, invoices, or diagrams
  • Long-form transcription, speaker diarization, or timestamped transcripts
  • Building real-time voice agents or voice-driven web apps with low latency
  • Prototyping multimodal RAG pipelines that combine image/text retrieval
  • Choosing providers and models for cost, latency, or large context needs

Best practices

  • Always set max_tokens on vision-enabled requests to avoid truncated responses
  • Resize images to sensible limits (e.g., <= 2048px) before encoding to base64
  • Pick models by capability: high-accuracy vision for sensitive tasks, Pro/large-context models for long documents
  • Prefer native speech-to-speech or integrated voice APIs over stitched STT+LLM+TTS when available
  • Use diarization and timestamps for multi-speaker audio and long recordings
  • Monitor rate limits and avoid deprecated models to maintain stability

Example use cases

  • Automated invoice processing: OCR -> table extraction -> structured JSON output
  • Customer support voice agent: real-time STT -> LLM intent + context -> expressive TTS
  • Academic paper ingest: PDF page-range OCR -> chart interpretation -> searchable RAG index
  • Multimodal chatbot: image + text context -> VQA and image-based recommendations
  • Podcast transcription with speaker labels and sentiment annotations

FAQ

Which model should I pick for high-accuracy vision tasks?

Choose high-accuracy multimodal models (e.g., Claude Opus or GPT-5 equivalents) when visual fidelity matters; consider cost vs. accuracy trade-offs.

How do I handle very long audio or documents?

Use large-context or Pro-tier models that support extended context windows, or split input into chunks with stitching and RAG where appropriate.