home / skills / eyadsibai / ltk / multimodal-models

multimodal-models skill

/plugins/ltk-data/skills/multimodal-models

This skill helps you orchestrate multimodal models for vision, audio, and text tasks, enabling zero-shot classification, transcription, and image generation.

npx playbooks add skill eyadsibai/ltk --skill multimodal-models

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
5.5 KB
---
name: multimodal-models
description: Use when "CLIP", "Whisper", "Stable Diffusion", "SDXL", "speech-to-text", "text-to-image", "image generation", "transcription", "zero-shot classification", "image-text similarity", "inpainting", "ControlNet"
version: 1.0.0
---

# Multimodal Models

Pre-trained models for vision, audio, and cross-modal tasks.

---

## Model Overview

| Model | Modality | Task |
|-------|----------|------|
| **CLIP** | Image + Text | Zero-shot classification, similarity |
| **Whisper** | Audio → Text | Transcription, translation |
| **Stable Diffusion** | Text → Image | Image generation, editing |

---

## CLIP (Vision-Language)

Zero-shot image classification without training on specific labels.

### CLIP Use Cases

| Task | How |
|------|-----|
| Zero-shot classification | Compare image to text label embeddings |
| Image search | Find images matching text query |
| Content moderation | Classify against safety categories |
| Image similarity | Compare image embeddings |

### CLIP Models

| Model | Parameters | Trade-off |
|-------|------------|-----------|
| ViT-B/32 | 151M | Recommended balance |
| ViT-L/14 | 428M | Best quality, slower |
| RN50 | 102M | Fastest, lower quality |

### CLIP Concepts

| Concept | Description |
|---------|-------------|
| **Dual encoder** | Separate encoders for image and text |
| **Contrastive learning** | Trained to match image-text pairs |
| **Normalization** | Always normalize embeddings before similarity |
| **Descriptive labels** | Better labels = better zero-shot accuracy |

**Key concept**: CLIP embeds images and text in same space. Classification = find nearest text embedding.

### CLIP Limitations

- Not for fine-grained classification
- No spatial understanding (whole image only)
- May reflect training data biases

---

## Whisper (Speech Recognition)

Robust multilingual transcription supporting 99 languages.

### Whisper Use Cases

| Task | Configuration |
|------|---------------|
| Transcription | Default `transcribe` task |
| Translation to English | `task="translate"` |
| Subtitles | Output format SRT/VTT |
| Word timestamps | `word_timestamps=True` |

### Whisper Models

| Model | Size | Speed | Recommendation |
|-------|------|-------|----------------|
| turbo | 809M | Fast | **Recommended** |
| large | 1550M | Slow | Maximum quality |
| small | 244M | Medium | Good balance |
| base | 74M | Fast | Quick tests |
| tiny | 39M | Fastest | Prototyping only |

### Whisper Concepts

| Concept | Description |
|---------|-------------|
| **Language detection** | Auto-detects, or specify for speed |
| **Initial prompt** | Improves technical terms accuracy |
| **Timestamps** | Segment-level or word-level |
| **faster-whisper** | 4× faster alternative implementation |

**Key concept**: Specify language when known—auto-detection adds latency.

### Whisper Limitations

- May hallucinate on silence/noise
- No speaker diarization (who said what)
- Accuracy degrades on >30 min audio
- Not suitable for real-time captioning

---

## Stable Diffusion (Image Generation)

Text-to-image generation with various control methods.

### SD Use Cases

| Task | Pipeline |
|------|----------|
| Text-to-image | `DiffusionPipeline` |
| Style transfer | `Image2Image` |
| Fill regions | `Inpainting` |
| Guided generation | `ControlNet` |
| Custom styles | LoRA adapters |

### SD Models

| Model | Resolution | Quality |
|-------|------------|---------|
| SDXL | 1024×1024 | Best |
| SD 1.5 | 512×512 | Good, faster |
| SD 2.1 | 768×768 | Middle ground |

### Key Parameters

| Parameter | Effect | Typical Value |
|-----------|--------|---------------|
| **num_inference_steps** | Quality vs speed | 20-50 |
| **guidance_scale** | Prompt adherence | 7-12 |
| **negative_prompt** | Avoid artifacts | "blurry, low quality" |
| **strength** (img2img) | How much to change | 0.5-0.8 |
| **seed** | Reproducibility | Fixed number |

### Control Methods

| Method | Input | Use Case |
|--------|-------|----------|
| **ControlNet** | Edge/depth/pose | Structural guidance |
| **LoRA** | Trained weights | Custom styles |
| **Img2Img** | Source image | Style transfer |
| **Inpainting** | Image + mask | Fill regions |

### Memory Optimization

| Technique | Effect |
|-----------|--------|
| CPU offload | Reduces VRAM usage |
| Attention slicing | Trades speed for memory |
| VAE tiling | Large image support |
| xFormers | Faster attention |
| DPM scheduler | Fewer steps needed |

**Key concept**: Use SDXL for quality, SD 1.5 for speed. Always use negative prompts.

### SD Limitations

- GPU strongly recommended (CPU very slow)
- Large VRAM requirements for SDXL
- May generate anatomical errors
- Prompt engineering matters

---

## Common Patterns

### Embedding and Similarity

All three models use embeddings:

- CLIP: Image/text embeddings for similarity
- Whisper: Audio embeddings for transcription
- SD: Text embeddings for image conditioning

### GPU Acceleration

| Model | VRAM Needed |
|-------|-------------|
| CLIP ViT-B/32 | ~2 GB |
| Whisper turbo | ~6 GB |
| SD 1.5 | ~6 GB |
| SDXL | ~10 GB |

### Best Practices

| Practice | Why |
|----------|-----|
| Use recommended model sizes | Best quality/speed balance |
| Cache embeddings (CLIP) | Expensive to recompute |
| Specify language (Whisper) | Faster than auto-detect |
| Use negative prompts (SD) | Avoid common artifacts |
| Set seeds for reproducibility | Consistent results |

## Resources

- CLIP: <https://github.com/openai/CLIP>
- Whisper: <https://github.com/openai/whisper>
- Diffusers: <https://huggingface.co/docs/diffusers>

Overview

This skill exposes practical guidance and recommended configurations for working with multimodal pre-trained models: CLIP (vision-language), Whisper (speech-to-text), and Stable Diffusion (text-to-image). It focuses on real-world tasks like zero-shot classification, transcription, image generation, inpainting, and control-guided synthesis. The goal is to help you choose the right model, tune key parameters, and apply memory and performance optimizations.

How this skill works

The skill summarizes each model's modality and primary tasks: CLIP embeds images and text into a shared space for similarity and zero-shot classification; Whisper converts audio to text with configurable transcription or translation modes; Stable Diffusion generates and edits images from text and image inputs, with options like ControlNet and inpainting for structural guidance. It highlights model-size trade-offs, essential parameters (e.g., guidance_scale, num_inference_steps, transcription settings), and practical techniques for GPU/memory management.

When to use it

  • Zero-shot image classification, image search, or content moderation without custom training (use CLIP).
  • Batch or offline transcription, translation, and subtitle generation from audio (use Whisper).
  • Text-to-image creation, style transfer, inpainting, or guided generation with structural controls (use Stable Diffusion / SDXL).
  • When you need fast prototyping with lower VRAM use (choose smaller models like CLIP RN50, Whisper tiny/base, SD 1.5).
  • When quality matters and you have ample GPU VRAM (choose ViT-L/14 for CLIP, Whisper large, or SDXL).

Best practices

  • Pick recommended model sizes for your quality/speed needs and test a smaller model first.
  • Normalize and cache embeddings for CLIP-based pipelines to avoid repeated expensive computation.
  • Specify language and provide an initial prompt for Whisper to reduce latency and improve domain accuracy.
  • Use negative prompts, set seeds, and tune guidance_scale and num_inference_steps for Stable Diffusion reproducibility and prompt adherence.
  • Apply memory optimizations (CPU offload, attention slicing, xFormers, VAE tiling) when VRAM is limited.
  • Be aware of each model's limitations: bias and lack of spatial detail for CLIP, no diarization for Whisper, and anatomy/artifact risks for Stable Diffusion.

Example use cases

  • Search and retrieval: index image embeddings with CLIP to enable semantic image search for an app.
  • Automated captioning and subtitles: transcribe meeting audio to SRT with Whisper and optional English translation.
  • Creative asset generation: produce campaign visuals with SDXL, using ControlNet to lock poses or layouts.
  • Image editing workflow: inpaint product photos or apply style transfer with img2img pipelines and LoRA adapters.
  • Content safety: run CLIP zero-shot checks against safety label templates before publishing images.

FAQ

Which model should I choose for best quality vs speed?

Use SDXL, CLIP ViT-L/14, and Whisper large for top quality if you have sufficient GPU VRAM; choose SD 1.5, CLIP RN50 or ViT-B/32, and Whisper turbo/smaller models for faster, lower-cost inference.

How do I reduce VRAM usage for Stable Diffusion?

Enable CPU offload, attention slicing, VAE tiling, and consider lower-resolution renders or SD 1.5; use xFormers and DPM schedulers to lower memory and speed trade-offs.