home / skills / cdeistopened / opened-vault / video-generator

video-generator skill

This skill helps you generate short-form AI videos with synchronized audio using Veo 3.1, enabling fast social clips and promos.

npx playbooks add skill cdeistopened/opened-vault --skill video-generator

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

15.4 KB

---
name: video-generator
description: Generate AI videos using Google VEO 3.1 or OpenAI Sora. Use this skill when content needs short-form video — social clips, promotional teasers, ambient backgrounds, or creative visual storytelling. Two providers for different strengths - VEO for native audio, Sora for visual quality and longer clips.
---

# Video Generator

Generate professional short-form videos using Google VEO 3.1 (with native audio) or OpenAI Sora (high visual quality, up to 12s).

## Prerequisites & Setup

### API Keys

**VEO (Google):** Uses the same key as the image-prompt-generator skill. If you already have `GEMINI_API_KEY` set, you're ready.

**Sora (OpenAI):** Requires `OPENAI_API_KEY` (already in `OpenEd Vault/.env`).

```bash
export GEMINI_API_KEY=your_gemini_key_here
export OPENAI_API_KEY=your_openai_key_here
```

### Install Dependencies

```bash
pip install google-genai requests
```

### Available Models

| Provider | Model | CLI `--model` | Best For |
|----------|-------|---------------|----------|
| **VEO** | Veo 3.1 Standard | `standard` (default) | Quality, audio fidelity, final assets |
| **VEO** | Veo 3.1 Fast | `fast` | Drafts, iteration, quick previews |
| **Sora** | Sora 2 | `sora-2` (default) | Visual quality, creative motion |
| **Sora** | Sora 2 Pro | `sora-2-pro` | Highest Sora quality, slower |

### When to Use Which

| Need | Use |
|------|-----|
| Native synchronized audio (dialogue, SFX) | **VEO** - Sora has no audio |
| Longer clips (12 seconds) | **Sora** - VEO maxes at 8s |
| Higher visual fidelity / artistic styles | **Sora** - stronger on visual aesthetics |
| Fast iteration / drafts | **VEO Fast** - quickest turnaround |
| 4K resolution | **VEO** - Sora uses fixed sizes |
| Negative prompts (exclude elements) | **VEO** - Sora doesn't support them |

### Video Parameters

| Parameter | VEO Options | Sora Options | Default |
|-----------|-------------|--------------|---------|
| **Duration** | 4, 6, or 8 seconds | 4, 8, or 12 seconds | 8s |
| **Resolution** | 720p, 1080p, 4K | Fixed (from aspect ratio) | 720p |
| **Aspect Ratio** | 16:9, 9:16 | 16:9 → 1280x720, 9:16 → 720x1280 | 16:9 |
| **Count** | 1-4 variations | 1-4 variations | 1 |
| **Negative Prompt** | Supported | Not supported (ignored) | none |

### Latency

Video generation is async — expect 11 seconds to 6 minutes depending on server load and provider. The script polls automatically and saves when ready.

---

## Workflow Overview

1. **Define the Concept** — What story does the video tell in 4-8 seconds?
2. **Storyboard the Shot** — Camera, motion, subject, environment
3. **Add Audio Direction** — Dialogue, sound effects, ambient sound
4. **Generate Video** — Run via API
5. **Iterate** — Adjust prompt based on results

## Prompt Rules (From Research)

Before diving into the workflow, internalize these rules from extensive testing across Veo, Runway, and Sora:

1. **150-300 characters is the sweet spot.** Under 100 = generic. Over 400 = the model drops elements unpredictably.
2. **One shot = one action.** Don't pack multiple scene changes or style shifts into one prompt. One camera move + one subject action.
3. **Describe what you want, not what you don't want.** Use the `--negative` flag for exclusions, not the main prompt.
4. **Treat audio as a separate layer.** Write audio cues in their own sentences, not mixed into visual descriptions.
5. **Use colon syntax for dialogue.** `A man says: "Hello!"` prevents subtitle artifacts. Without the colon, text may appear on screen.
6. **Keep dialogue under 7 words per line.** Longer speech causes lip-sync drift or rushed garbling.
7. **Start simple, then layer.** Begin with a basic prompt, evaluate, then add one variable at a time.
8. **Slow camera movements win.** Fast pans and spins break output. Use tight framing for perceived speed.

See `references/prompt-engineering-research.md` for the complete research.

---

## Step 1: Define the Concept

A good video prompt answers three questions:

- **What's happening?** (action/motion)
- **Where?** (environment/setting)
- **What does it sound like?** (audio landscape)

Unlike images, video is *temporal*. Think in terms of movement and change, not a static composition.

**Good concepts for 8-second clips:**

| Use Case | Example Concept |
|----------|----------------|
| Social teaser | A hand flipping through pages of a book, stopping on a highlighted passage |
| Ambient background | Rain falling on a window with city lights blurring behind it |
| Product reveal | Camera slowly orbits a product on a table, warm studio lighting |
| Podcast promo | A microphone in a cozy studio, coffee steam rising, morning light |
| Newsletter visual | A typewriter striking keys, with the sound of each keystroke |

## Step 2: Storyboard the Shot

Structure your prompt with cinematic language. Veo responds well to film terminology:

### Camera Language

| Term | Effect |
|------|--------|
| **Wide shot** | Shows full environment, establishes context |
| **Close-up** | Tight on a subject, emphasizes detail |
| **Tracking shot** | Camera follows subject movement |
| **Dolly in/out** | Camera moves toward or away from subject |
| **Static shot** | Locked camera, subject moves within frame |
| **Slow pan** | Camera rotates horizontally across scene |
| **Overhead / bird's eye** | Looking straight down |
| **Low angle** | Looking up at subject, adds drama |

### Motion Description

Be explicit about what moves and how:

| Don't | Do |
|-------|-----|
| "A dog in a park" | "A golden retriever runs toward camera through tall grass, ears bouncing" |
| "City at night" | "Camera slowly dollies through a neon-lit Tokyo alley as rain puddles reflect signs" |
| "Ocean" | "A single wave forms, curls, and crashes onto wet sand in slow motion" |

### Lighting & Atmosphere

| Term | Mood |
|------|------|
| Golden hour | Warm, nostalgic, cinematic |
| Overcast | Soft, even, contemplative |
| Neon / artificial | Urban, energetic, modern |
| Candlelight | Intimate, quiet |
| Hard shadows | Dramatic, high contrast |

## Step 3: Add Audio Direction

Veo 3.1 generates synchronized audio natively. This is a major differentiator — use it.

### Three Types of Audio Cues

**1. Dialogue** — Use colon syntax before quotes (prevents subtitle artifacts):
```
A barista says: "Here you go!" as she slides a latte across the counter.
```
Keep lines under 7 words for clean lip-sync. One sentence max per 8-second clip.

**2. Sound Effects** — Describe specific sounds:
```
The sound of a match striking, then a candle flame flickering to life.
```

**3. Ambient Sound** — Set the sonic environment:
```
Birds chirping in the background, distant traffic hum, morning atmosphere.
```

### Audio Tips

- Be specific: "the crunch of gravel underfoot" beats "footstep sounds"
- Layer audio: combine ambient + specific sounds for depth
- Match audio to motion: "a door creaks open" timed with the visual action
- Dialogue should be short — 1-2 sentences max for 8 seconds

## Step 4: Craft the Full Prompt

### Prompt Structure (5-Element Priority)

Structure prompts in this order of priority — you don't need all five every time:

1. **Shot Specification** — camera work, framing, movement
2. **Setting & Atmosphere** — location, time, weather, lighting
3. **Subject & Action** — who/what, described in beats
4. **Audio Layer** — dialogue, SFX, ambient (separate sentences)
5. **Style/Grade** — artistic treatment, lens, color

```
[Shot type + camera movement]. [Setting and lighting]. [Subject doing action].
[Audio: what you hear]. [Style/grade].
```

### Example Prompts

**Podcast promo (16:9):**
```
A close-up tracking shot of a vintage microphone in a warmly lit podcast studio.
Steam rises slowly from a coffee mug beside it. Morning sunlight filters through
blinds, casting soft stripes across the desk. The sound of a quiet room — a clock
ticking, the faint hum of equipment. Cinematic, intimate, inviting.
```

**Social teaser — vertical (9:16):**
```
A hand reaches into frame and opens a leather-bound journal on a wooden desk.
The pages flutter briefly before settling on a page covered in handwritten notes.
A pen is set down beside the book. The sound of pages rustling, a pen clicking,
and soft ambient music. Warm overhead lighting, shallow depth of field.
```

**Newsletter header — ambient loop (16:9):**
```
A static wide shot of rain falling on a large window. Behind the glass, a blurred
cityscape with warm lights. Water droplets slide slowly down the pane. The sound of
steady rain and distant muffled city noise. Moody, contemplative, cozy.
```

**Product reveal (16:9):**
```
Camera slowly orbits a pair of wireless headphones placed on a dark marble surface.
Dramatic studio lighting with a single warm key light from the left. The headphones
cast a sharp shadow. Subtle electronic ambient music. Premium, minimal, modern.
```

## Step 5: Generate via API

### Running the Script

```bash
# VEO: Basic generation (8s, 720p, 16:9)
python scripts/generate_video.py "Your prompt here"

# VEO: Fast draft for iteration
python scripts/generate_video.py "Your prompt here" --model fast

# VEO: High quality vertical video for social
python scripts/generate_video.py "Your prompt" --aspect 9:16 --resolution 1080p

# VEO: Multiple variations to choose from
python scripts/generate_video.py "Your prompt" --count 2 --output ./videos

# VEO: Short clip with specific settings
python scripts/generate_video.py "Your prompt" --duration 4 --resolution 4k --name "hero-clip"

# VEO: Exclude unwanted elements
python scripts/generate_video.py "Your prompt" --negative "text overlays, watermarks, blurry"

# Sora: Basic generation
python scripts/generate_video.py "A cat on a windowsill, warm light" --provider sora

# Sora: 12-second clip (longer than VEO allows)
python scripts/generate_video.py "A dog running through a meadow" --provider sora --duration 12

# Sora: Pro model, vertical
python scripts/generate_video.py "Latte art being poured" --provider sora --model sora-2-pro --aspect 9:16

# Sora: Multiple variations
python scripts/generate_video.py "Ocean waves at sunset" --provider sora --count 2 --output ./videos
```

**Options:**

| Flag | Values | Default | Notes |
|------|--------|---------|-------|
| `--provider` | `veo`, `sora` | `veo` | VEO for audio, Sora for visual quality |
| `--model` | VEO: `standard`, `fast` / Sora: `sora-2`, `sora-2-pro` | `standard` / `sora-2` | Provider-specific models |
| `--aspect` | `16:9`, `9:16` | `16:9` | Vertical for Reels/TikTok/Shorts |
| `--resolution` | `720p`, `1080p`, `4k` | `720p` | VEO only (ignored by Sora) |
| `--duration` | VEO: `4`, `6`, `8` / Sora: `4`, `8`, `12` | `8` | Sora supports 12s |
| `--negative` | text | none | VEO only (ignored by Sora) |
| `--count` | `1`-`4` | `1` | Generate variations |
| `--output` | path | `.` | Save directory |
| `--name` | text | none | Filename prefix |

**Output:** MP4 files with timestamp-based filenames.

## Step 6: Iterate

After reviewing generated video:

- **Motion wrong?** Be more explicit about direction, speed, and sequence
- **Audio off?** Add or refine audio cues — the model needs clear direction
- **Too much happening?** Simplify. One clear action per clip works best
- **Style drift?** Add a negative prompt to exclude unwanted aesthetics
- **Wrong mood?** Adjust lighting and atmosphere descriptors

### Iteration Strategy

1. Start with `--model fast` and `--duration 4` for quick drafts
2. Refine the prompt through 2-3 fast iterations
3. Switch to `--model standard` with full duration/resolution for the final take
4. Generate 2 variations of the final prompt and pick the best

---

## Negative Prompt Guide

Use `--negative` to steer away from common problems:

| Problem | Negative Prompt |
|---------|-----------------|
| Text/watermarks appearing | "text, watermarks, logos, subtitles" |
| Uncanny faces | "distorted faces, morphing features" |
| Jittery motion | "jerky motion, flickering, stuttering" |
| Over-saturated look | "oversaturated, HDR, neon colors" |
| Stock footage feel | "generic, corporate, stock footage aesthetic" |

---

## Prompting Principles

### Think in Shots, Not Scenes

8 seconds is one shot. Don't try to cram a narrative arc — describe a single continuous moment.

| Don't | Do |
|-------|-----|
| "A chef makes a meal from scratch and serves it" | "A chef's hands julienne carrots on a wooden cutting board, knife moving rhythmically" |
| "A day at the beach from sunrise to sunset" | "Waves gently lap at bare feet on sand, golden hour light, camera at ground level" |

### Be Specific About Motion

Vague motion descriptions produce vague results. Describe *what moves*, *how fast*, and *in which direction*.

### Layer Your Audio

Don't just describe one sound — create a soundscape:
```
The crackling of a vinyl record playing soft jazz,
a distant car horn outside the window,
the quiet clink of an ice cube in a glass.
```

### Use Negative Prompts Proactively

Always include `--negative "text, watermarks"` at minimum. The model occasionally generates unwanted text overlays.

---

## Use Cases by Content Type

### Social Media (9:16, 4-8s)

Short, punchy, loop-friendly. Favor close-ups and strong motion.

```bash
python scripts/generate_video.py "Close-up of coffee being poured into a ceramic mug, steam rising, warm morning light. The sound of liquid pouring and a soft sigh." \
  --aspect 9:16 --duration 4 --resolution 1080p
```

### Podcast/Newsletter Headers (16:9, 8s)

Ambient, atmospheric. Favor wide shots and subtle motion.

```bash
python scripts/generate_video.py "A vintage radio on a wooden shelf, dial slowly turning. Warm tungsten light. Soft static transitioning into faint music." \
  --resolution 1080p --name "podcast-header"
```

### Product/Brand (16:9, 6-8s)

Clean, controlled, premium feel. Studio lighting, slow orbits.

```bash
python scripts/generate_video.py "Camera slowly orbits a leather notebook on a dark wood desk. Single warm key light. The sound of pages turning gently." \
  --resolution 4k --duration 6 --negative "text, watermarks, busy background"
```

---

## Multi-Clip Consistency

When generating multiple clips for a project (e.g. a social series, product launch, or multi-shot sequence):

### Lock Your Constants

Create a consistency block and repeat it verbatim across all prompts:

```
CHARACTER: A woman in her thirties with short silver hair and a black turtleneck
PALETTE: amber, cream, walnut brown, deep olive
LIGHTING: Soft key light from camera right, warm tungsten
STYLE: Cinematic, shallow depth of field, warm film grain
NEGATIVE: no subtitles, no on-screen text, no watermarks
```

### Frame Chaining

For sequential shots, use the last frame of clip N as the reference image for clip N+1. This preserves subject orientation, lighting continuity, and motion vectors.

### Consistency Checklist

- [ ] Same character description, word for word — never paraphrase between shots
- [ ] Same palette anchors (3-5 named colors)
- [ ] Same lighting direction and quality
- [ ] Same aspect ratio and resolution
- [ ] Same style/grade language
- [ ] "No subtitles, no on-screen text" included
- [ ] Simple wardrobe — solid colors and notable anchors (red jacket, silver pendant) are more consistent than busy patterns

---

## Related Skills

- **image-prompt-generator** — Static images with the same API key and similar workflow
- **youtube-title-creator** — Pair video content with optimized titles
- **social-content-creation** — Use videos in platform-optimized posts

---

*For prompt engineering research and advanced techniques, see `references/prompt-engineering-research.md`*

Overview

This skill generates short-form AI videos with synchronized native audio using Google's Veo 3.1 API. It follows an iterative workflow: storyboard the shot, craft a concise prompt with separate audio cues, generate via the API, then refine until the result matches the concept. The tool supports fast drafts and high-quality final renders with configurable duration, aspect, and resolution.

How this skill works

You provide a single-shot prompt that specifies camera work, setting, subject action, audio layers, and optional style notes. The script calls Veo 3.1 (standard or fast) using your GEMINI_API_KEY, polls for the asynchronous result, and saves MP4 assets. Iteration is encouraged: run fast drafts to refine prompts and switch to the standard model for final outputs.

When to use it

Create social clips, reels, or TikToks that need tight motion and loop-friendly content.
Produce podcast or newsletter headers with ambient motion and synchronized soundscapes.
Make short product reveals or brand teasers with cinematic lighting and slow camera moves.
Generate ambient background loops for websites or presentations.
Rapidly prototype concepts with fast drafts before finalizing high-resolution assets.

Best practices

Keep prompts 150–300 characters for reliable, detailed output.
Describe one continuous shot and one clear action; avoid scene changes within an 8-second clip.
Write audio cues in separate sentences and keep dialogue under seven words per line for clean lip-sync.
Start simple, iterate with the fast model, then finalize with the standard model at higher resolution.
Always include a negative prompt to exclude text, watermarks, or unwanted artifacts.

Example use cases

Social teaser — close-up hand action in 9:16, 4–6s for Reels or Shorts.
Podcast header — 16:9, 8s ambient studio shot with subtle room sounds.
Product reveal — slow orbit around an object, dramatic lighting, 6–8s at 4K for hero assets.
Newsletter visual — static window rain loop with city hum and soft music.
Rapid prototyping — use fast model, short duration, generate 2–4 variations to pick the best.

FAQ

What API key do I need?

Use the GEMINI_API_KEY from Google AI Studio. The same key used for image generation works for Veo 3.1.

How long does generation take?

Generation is asynchronous and ranges from ~11 seconds (fast drafts) to several minutes depending on model, duration, and server load.

How do I avoid text or watermarks appearing?

Include a negative prompt such as "text, watermarks, logos, subtitles" and keep the visual prompt focused on a single action.