home / skills / openclaw / skills / ai-video-gen-tools

ai-video-gen-tools skill

needs review

This skill helps you generate complete AI-powered videos from text prompts by coordinating image, video, voice-over, and editing pipelines.

This is most likely a fork of the ai-video-gen skill from openclaw

npx playbooks add skill openclaw/skills --skill ai-video-gen-tools

Review the files below or copy the command above to add this skill to your agents.

Files (9)

SKILL.md

3.2 KB

---
name: ai-video-gen
description: End-to-end AI video generation - create videos from text prompts using image generation, video synthesis, voice-over, and editing. Supports OpenAI DALL-E, Replicate models, LumaAI, Runway, and FFmpeg editing.
---

# AI Video Generation Skill

Generate complete videos from text descriptions using AI.

## Capabilities

1. **Image Generation** - DALL-E 3, Stable Diffusion, Flux
2. **Video Generation** - LumaAI, Runway, Replicate models
3. **Voice-over** - OpenAI TTS, ElevenLabs
4. **Video Editing** - FFmpeg assembly, transitions, overlays

## Quick Start

```bash
# Generate a complete video
python skills/ai-video-gen/generate_video.py --prompt "A sunset over mountains" --output sunset.mp4

# Just images to video
python skills/ai-video-gen/images_to_video.py --images img1.png img2.png --output result.mp4

# Add voiceover
python skills/ai-video-gen/add_voiceover.py --video input.mp4 --text "Your narration" --output final.mp4
```

## Setup

### Required API Keys

Add to your environment or `.env` file:

```bash
# Image Generation (pick one)
OPENAI_API_KEY=sk-...              # DALL-E 3
REPLICATE_API_TOKEN=r8_...         # Stable Diffusion, Flux

# Video Generation (pick one)
LUMAAI_API_KEY=luma_...           # LumaAI Dream Machine
RUNWAY_API_KEY=...                # Runway ML
REPLICATE_API_TOKEN=r8_...        # Multiple models

# Voice (optional)
OPENAI_API_KEY=sk-...             # OpenAI TTS
ELEVENLABS_API_KEY=...            # ElevenLabs

# Or use FREE local options (no API needed)
```

### Install Dependencies

```bash
pip install openai requests pillow replicate python-dotenv
```

### FFmpeg

Already installed via winget.

## Usage Examples

### 1. Text to Video (Full Pipeline)

```bash
python skills/ai-video-gen/generate_video.py \
  --prompt "A futuristic city at night with flying cars" \
  --duration 5 \
  --voiceover "Welcome to the future" \
  --output future_city.mp4
```

### 2. Multiple Scenes

```bash
python skills/ai-video-gen/multi_scene.py \
  --scenes "Morning sunrise" "Busy city street" "Peaceful night" \
  --duration 3 \
  --output day_in_life.mp4
```

### 3. Image Sequence to Video

```bash
python skills/ai-video-gen/images_to_video.py \
  --images frame1.png frame2.png frame3.png \
  --fps 24 \
  --output animation.mp4
```

## Workflow Options

### Budget Mode (FREE)
- Image: Stable Diffusion (local or free API)
- Video: Open source models
- Voice: OpenAI TTS (cheap) or free TTS
- Edit: FFmpeg

### Quality Mode (Paid)
- Image: DALL-E 3 or Midjourney
- Video: Runway Gen-3 or LumaAI
- Voice: ElevenLabs
- Edit: FFmpeg + effects

## Scripts Reference

- `generate_video.py` - Main end-to-end generator
- `images_to_video.py` - Convert image sequence to video
- `add_voiceover.py` - Add narration to existing video
- `multi_scene.py` - Create multi-scene videos
- `edit_video.py` - Apply effects, transitions, overlays

## API Cost Estimates

- **DALL-E 3**: ~$0.04-0.08 per image
- **Replicate**: ~$0.01-0.10 per generation
- **LumaAI**: $0-0.50 per 5sec (free tier available)
- **Runway**: ~$0.05 per second
- **OpenAI TTS**: ~$0.015 per 1K characters
- **ElevenLabs**: ~$0.30 per 1K characters (better quality)

## Examples

See `examples/` folder for sample outputs and prompts.

Overview

This skill provides an end-to-end AI video generation pipeline that builds videos from text prompts, image sequences, and optional voice-over. It integrates image models (DALL·E, Stable Diffusion), video synthesis (LumaAI, Runway, Replicate), TTS options, and FFmpeg-based editing to produce finished MP4 output. It supports both budget (local/open-source) and quality (commercial APIs) workflows.

How this skill works

The skill first generates or accepts images for scenes via selected image models, then assembles those frames into motion clips using video synthesis models or image-to-video tools. Optional text-to-speech produces narration tracks, which are mixed and timed to scenes. Final composition, transitions, overlays, and encoding are handled with FFmpeg scripts to produce a single deliverable video.

When to use it

Create short promotional or concept videos directly from text prompts.
Prototype storyboards and animatics from image sequences.
Add professional or quick TTS narration to existing footage.
Combine multiple scene prompts into a cohesive multi-scene video.
Produce rapid iterations using low-cost local models before upgrading to paid APIs.

Best practices

Define scene-by-scene prompts with durations and orientation to avoid re-renders.
Use local/open-source models for fast iteration and switch to paid APIs for final quality renders.
Render voice-over separately and synchronize timing before final FFmpeg assembly.
Keep asset resolution consistent across images to prevent scaling artifacts during encoding.
Monitor API costs and test a short clip (1–3s) before committing to long renders.

Example use cases

Text-to-video concept: generate a 5s cinematic clip of a futuristic city with narration.
Multi-scene short: assemble morning, daytime, and night scenes into a 30s social clip.
Image sequence conversion: turn hand-drawn frames into a 24fps animation MP4.
Voiceover addition: add OpenAI or ElevenLabs narration to a product demo video.
Budget prototyping: use local Stable Diffusion + open-source video tools to iterate quickly.

FAQ

Which APIs are required to run the pipeline?

No single API is mandatory. You can run with local/open-source models for image and video. Paid APIs (DALL·E, LumaAI, Runway, ElevenLabs) are optional for higher quality.

How do I control final video length and pacing?

Specify per-scene duration or overall duration flags. The scripts map image frames and TTS timing to scene durations before FFmpeg assembly.