home / skills / benchflow-ai / skillsbench / extract-moves-from-video

This skill guides extracting text commands from video recordings using OCR, helping you reliably capture terminal inputs and game commands.

npx playbooks add skill benchflow-ai/skillsbench --skill extract-moves-from-video

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
7.7 KB
---
name: extract-moves-from-video
description: This skill provides guidance for extracting text commands, moves, or typed input from video recordings using OCR. It applies when extracting gameplay commands (e.g., Zork), terminal sessions, or any text-based interactions captured in video format. Use this skill when processing videos of text-based games, terminal recordings, or any scenario requiring OCR-based command extraction from screen recordings.
---

# Extract Moves From Video

## Overview

This skill provides a systematic approach for extracting text commands from video recordings. Common use cases include extracting gameplay commands from text adventure games (like Zork), capturing terminal commands from screen recordings, or transcribing any typed input visible in video content.

## Workflow

### Step 1: Analyze the Source Video

Before processing, understand the video characteristics:

1. **Determine video properties**: Resolution, duration, frame rate
2. **Identify text regions**: Where commands appear on screen (e.g., after a prompt character like `>`)
3. **Assess text style**: Font type, color, background contrast (terminal text on dark backgrounds requires specific handling)
4. **Check for audio**: Determine if audio transcription could supplement OCR (verify audio contains relevant content before installing large packages like Whisper)
5. **Understand typing patterns**: Estimate how frequently new commands appear to inform frame sampling rate

### Step 2: Download and Prepare Video

1. **Download video** using appropriate tools (`yt-dlp`, `youtube-dl`, or direct download)
2. **Verify download integrity** before proceeding
3. **Extract video metadata** to confirm properties match expectations

### Step 3: Extract Frames Strategically

Frame extraction requires balancing coverage against processing time:

1. **Analyze command frequency first**: Manually review a sample of the video to understand how often new commands appear
2. **Choose appropriate sampling rate**:
   - Fast typing: 0.5-1 second intervals
   - Slow typing: 2-3 second intervals
   - When uncertain, extract at higher frequency and subsample later (avoids re-extraction)
3. **Use FFmpeg for extraction**:
   ```bash
   ffmpeg -i video.mp4 -vf "fps=1" frames/frame_%04d.png
   ```
4. **Focus on relevant screen regions**: If commands appear in a specific area, crop frames to that region to improve OCR accuracy

### Step 4: Optimize OCR Configuration

OCR accuracy depends heavily on proper configuration for the specific video type:

1. **Test on sample frames first**: Before processing all frames, tune OCR settings on 5-10 representative frames
2. **Configure Tesseract page segmentation mode (`--psm`)**:
   - `--psm 6`: Assume uniform block of text
   - `--psm 7`: Single text line
   - `--psm 13`: Raw line (treat as single line, no analysis)
3. **Preprocess images for better OCR**:
   - **Binarization**: Convert to black/white with appropriate threshold
   - **Invert colors** if text is light on dark background
   - **Increase contrast** for low-contrast videos
   - **Scale up** small text (2x-3x enlargement often helps)
4. **Test multiple threshold values**: Common values (127, 150, 180) work differently depending on video; empirically test which produces best results

Example preprocessing with Python/OpenCV:
```python
import cv2
img = cv2.imread('frame.png', cv2.IMREAD_GRAYSCALE)
# Invert if light text on dark background
img = cv2.bitwise_not(img)
# Binarize with tested threshold
_, img = cv2.threshold(img, 150, 255, cv2.THRESH_BINARY)
# Scale up for better OCR
img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
```

### Step 5: Extract and Parse Commands

1. **Run OCR on preprocessed frames**: Use Python bindings (`pytesseract`) for efficiency over subprocess calls
2. **Identify command patterns**: Look for prompt markers (e.g., `>`, `$`, `>>>`) that precede commands
3. **Handle OCR output carefully**:
   - Do not assume commands start at line beginning (OCR introduces whitespace)
   - Account for partial prompt character recognition (e.g., `>` may become `›` or `»`)
4. **Use flexible pattern matching**:
   ```python
   # More robust than grep "^>"
   import re
   command_pattern = re.compile(r'[>›»]\s*(.+)')
   ```

### Step 6: Clean and Deduplicate Results

**Critical**: Understand the data domain before cleaning:

1. **Preserve legitimate duplicates**: In many contexts (games, shell sessions), the same command can appear multiple times intentionally
2. **Use temporal deduplication**: Only remove duplicates from consecutive frames showing the same command, not all duplicates globally
3. **Handle partial commands**: Commands being typed appear partially; only capture complete commands
4. **Validate corrections**: When fixing OCR errors, verify corrections are contextually appropriate

Temporal deduplication approach:
```python
def temporal_dedupe(commands):
    """Remove only consecutive duplicates, preserving repeated commands."""
    result = []
    prev = None
    for cmd in commands:
        if cmd != prev:
            result.append(cmd)
            prev = cmd
    return result
```

### Step 7: Verify Results

Verification is essential for accuracy:

1. **Sample verification**: Manually compare extracted commands against source frames for a random sample
2. **Domain validation**: If extracting game commands, verify they are valid commands for that game
3. **Sequence logic check**: Verify the command sequence makes logical sense (e.g., movement commands follow plausible paths)
4. **Count verification**: Compare total extracted commands against expected count based on video length and typing speed

## Common Pitfalls

### OCR Quality Issues
- **Mistake**: Using default OCR settings without optimization
- **Solution**: Always tune `--psm` mode and image preprocessing on sample frames first

### Incorrect Deduplication
- **Mistake**: Using global deduplication (e.g., `awk '!seen[$0]++'`) which removes all repeated commands
- **Solution**: Use temporal deduplication that only removes consecutive duplicates

### Prompt Detection Failures
- **Mistake**: Using rigid patterns like `grep "^>"` that assume specific formatting
- **Solution**: Use flexible regex that accounts for OCR variations and whitespace

### Wasted Tool Installation
- **Mistake**: Installing large packages (Whisper for audio) without verifying they're needed
- **Solution**: Check if audio contains useful content before installing audio processing tools

### No Intermediate Checkpointing
- **Mistake**: Processing all frames without saving intermediate results, losing progress on timeouts
- **Solution**: Save results after each processing stage; implement progress checkpoints

### Abandoned Verification
- **Mistake**: Not validating extracted commands against source material
- **Solution**: Always verify a sample of extractions and validate overall sequence logic

## Verification Checklist

Before finalizing extracted commands:

- [ ] Sample of extracted commands verified against source frames
- [ ] Command count is reasonable for video duration
- [ ] No obvious OCR artifacts remain (random characters, split words)
- [ ] Legitimate repeated commands are preserved (not incorrectly deduplicated)
- [ ] Command sequence follows logical order
- [ ] Domain-specific validation performed (e.g., commands are valid for the game/application)

## Tool Selection Guide

| Task | Recommended Tool | Notes |
|------|-----------------|-------|
| Video download | `yt-dlp` | More maintained than `youtube-dl` |
| Frame extraction | `ffmpeg` | Industry standard, reliable |
| OCR | `tesseract` via `pytesseract` | Use Python bindings for efficiency |
| Image preprocessing | OpenCV (`cv2`) | Flexible, well-documented |
| Pattern matching | Python `re` module | More flexible than grep |

Overview

This skill provides a practical workflow for extracting typed commands and text moves from video recordings using OCR. It focuses on videos of text-based games, terminal sessions, and any screen recordings where commands appear visually. The guidance covers video analysis, frame extraction, OCR tuning, parsing, cleaning, and verification to produce accurate command sequences.

How this skill works

The skill inspects video properties, identifies where commands appear on screen, and extracts frames at a sampling rate matched to typing speed. Frames are preprocessed (crop, binarize, scale, invert) to maximize OCR accuracy, then processed with Tesseract (via pytesseract). Extracted text is parsed with flexible regex for prompt markers, temporally deduplicated, and verified against source frames and domain logic.

When to use it

  • Extracting commands from text-adventure gameplay recordings (e.g., Zork).
  • Transcribing terminal sessions or shell command videos.
  • Recovering typed inputs from screen recordings for documentation or debugging.
  • Processing videos where audio is absent or unreliable for command capture.

Best practices

  • Analyze a short sample of the video first to determine text region, font, and typing frequency.
  • Crop to the region containing commands to reduce OCR noise and speed processing.
  • Preprocess images: invert if needed, binarize with tested thresholds, and scale up small text before OCR.
  • Tune Tesseract page segmentation mode (psm 6/7/13) on representative frames.
  • Apply temporal deduplication to remove consecutive duplicates but preserve intentional repeats.
  • Save intermediate outputs and checkpoints to avoid reprocessing on timeouts or errors.

Example use cases

  • Extracting a sequence of player commands from a recorded text-adventure playthrough for replay or analysis.
  • Transcribing commands from an online coding demo or terminal screencast to create reproducible scripts.
  • Recovering typed entries from archival videos where source logs are missing.
  • Generating step-by-step documentation from recorded terminal walkthroughs.

FAQ

How often should I extract frames?

Match sampling to typing speed: 0.5–1s for fast typing, 2–3s for slow typing; when unsure, sample more frequently and subsample later.

Which OCR settings work best for terminal text?

Test psm modes 6, 7, and 13 on sample frames. Preprocess by inverting light-on-dark text, binarizing with empirically chosen thresholds, and scaling up small text.