home / skills / eddiebe147 / claude-settings / omni-vu

omni-vu skill

safe

/skills/omni-vu

This skill helps you see, understand, and automate macOS UI tasks by capturing the screen, analyzing UI state, and driving mouse and keyboard actions.

npx playbooks add skill eddiebe147/claude-settings --skill omni-vu

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

8.9 KB

---
name: omni-vu
description: Screen capture, AI vision analysis, and GUI automation for macOS. Use when you need to see what's on screen, analyze UI state, detect changes, or automate mouse/keyboard actions.
---

# omni.vu - Visual Understanding & Automation

## Overview

omni.vu gives you eyes and hands on the user's macOS screen. Use it to:
- **See** what the user sees (screen capture)
- **Understand** UI state with AI vision
- **Detect** changes and wait for events
- **Act** with mouse and keyboard automation

## When to Use This Skill

### Proactively Use omni.vu When:
1. **Debugging UI issues** - "The button isn't working" → capture and analyze
2. **Verifying changes** - After modifying UI code, check if it rendered correctly
3. **Waiting for operations** - Build finishing, deployment completing, tests running
4. **Understanding context** - User describes something on screen you can't see
5. **Automating repetitive tasks** - Clicking through UI flows, filling forms
6. **Documentation** - Capturing screenshots of features

### Do NOT Use When:
- Reading/writing files (use Read/Write tools)
- Running terminal commands (use Bash)
- Making API calls (use appropriate tools)
- User hasn't granted screen recording permission

## Tool Reference

### Capture Tools

#### `vu` - Full Screen Capture
```
vu(monitor=0, save_to_history=True)
```
Captures the entire screen. Returns base64 image.

**Use when**: You need to see everything on screen.

#### `vu_window` - Window Capture
```
vu_window(window_id=None, title="VS Code", include_frame=False)
```
Captures a specific window by ID or title (partial match).

**Use when**: You only need one application's content.

**Workflow**:
1. Call `vu_list_windows` to see available windows
2. Find the window_id or use title matching
3. Call `vu_window` with that ID/title

#### `vu_region` - Region Capture
```
vu_region(x=100, y=100, width=500, height=300)
```
Captures a specific rectangle of the screen.

**Use when**: You need a precise area (error message, specific component).

#### `vu_list_windows` - List Windows
```
vu_list_windows(filter_app="Chrome")
```
Lists all visible windows with metadata.

**Returns**: List of `{window_id, title, owner, x, y, width, height}`

#### `vu_list_monitors` - List Displays
```
vu_list_monitors()
```
Lists connected displays with resolution and scale factor.

### Vision Tools

#### `vu_describe` - AI Vision Analysis
```
vu_describe(
    prompt="What errors are visible?",
    provider="claude",  # or openai, gemini, ollama
    max_tokens=1024,
    capture_first=True
)
```
Captures screen and analyzes with AI.

**Best prompts**:
- "Describe what you see on screen"
- "Are there any error messages visible?"
- "What is the state of the build/test output?"
- "Is the login form filled correctly?"
- "What color is the status indicator?"

**Providers**:
| Provider | Speed | Quality | Cost |
|----------|-------|---------|------|
| claude | Medium | Excellent | $$ |
| openai | Fast | Very Good | $$ |
| gemini | Fast | Good | $ |
| ollama | Slow | Varies | Free |

### Utility Tools

#### `vu_diff` - Change Detection
```
vu_diff(threshold=0.02, monitor=0)
```
Detects if screen changed since last capture.

**Returns**: `{changed: bool, diff_percentage: float}`

**Use for polling patterns**:
```python
# Wait for build to finish
while True:
    result = vu_diff(threshold=0.05)
    if result["changed"]:
        # Screen changed, check what happened
        analysis = vu_describe(prompt="Did the build succeed or fail?")
        break
    # Wait before checking again
```

#### `vu_history` - Capture History
```
vu_history(limit=10, capture_type="full_screen")
```
Gets recent capture metadata.

#### `vu_last` - Last Capture
```
vu_last()
```
Returns the most recent capture with image data.

**Use when**: You need to re-analyze without re-capturing.

#### `vu_status` - System Status
```
vu_status()
```
Returns system info: monitors, providers, safety settings.

### Automation Tools

#### `vu_click` - Mouse Click
```
vu_click(x=500, y=300, button="left", count=1)
```
Clicks at screen coordinates.

**Buttons**: `left`, `right`, `middle`
**Count**: 1 (single), 2 (double), 3 (triple)

**Safety**: Coordinates validated against screen bounds.

#### `vu_move` - Move Cursor
```
vu_move(x=500, y=300, duration_ms=0)
```
Moves cursor to position. Use `duration_ms` for animated movement.

#### `vu_drag` - Drag Operation
```
vu_drag(start_x=100, start_y=100, end_x=500, end_y=300, duration_ms=500)
```
Drags from start to end position.

**Safety**: Maximum 2000px drag distance.

#### `vu_scroll` - Scroll
```
vu_scroll(direction="down", amount=3, x=None, y=None)
```
Scrolls at current or specified position.

**Directions**: `up`, `down`, `left`, `right`
**Amount**: Lines to scroll (1-20)

#### `vu_type` - Type Text
```
vu_type(text="Hello World", delay_between_ms=0)
```
Types text character by character.

**Note**: Click on target field first!

#### `vu_hotkey` - Keyboard Shortcut
```
vu_hotkey(keys="cmd+s")
```
Executes keyboard shortcuts.

**Format**: `modifier+key` (e.g., `cmd+c`, `ctrl+shift+s`, `cmd+opt+i`)
**Modifiers**: `cmd`, `ctrl`, `alt`/`opt`, `shift`

**Blocked hotkeys** (for safety):
- `cmd+q` (quit)
- `cmd+shift+q` (logout)
- `cmd+opt+esc` (force quit)

## Common Workflows

### 1. Debug UI Issue
```
User: "The submit button doesn't do anything when I click it"

1. vu_describe(prompt="Describe the form and submit button state")
2. Analyze: Is button disabled? Is there validation error?
3. If needed: vu_window(title="Chrome") for focused capture
4. Check console: vu_describe(prompt="Are there any errors in the developer console?")
```

### 2. Verify Build/Deploy
```
User: "Run the build and let me know when it's done"

1. Run: npm run build (via Bash)
2. Loop: vu_diff(threshold=0.05)
3. When changed: vu_describe(prompt="What is the build status? Did it succeed or fail?")
4. Report result
```

### 3. Fill a Form
```
User: "Fill out the registration form with test data"

1. vu_describe(prompt="What form fields are visible?")
2. vu_click(x=field_x, y=field_y)  # Click first field
3. vu_type(text="[email protected]")
4. vu_hotkey(keys="tab")  # Move to next field
5. vu_type(text="Test User")
6. Continue for each field...
7. vu_click(x=submit_x, y=submit_y)  # Submit
8. vu_describe(prompt="Was the form submitted successfully?")
```

### 4. Navigate UI
```
User: "Open the settings panel in VS Code"

1. vu_list_windows(filter_app="Code")
2. vu_window(title="Code")  # Capture VS Code
3. vu_hotkey(keys="cmd+,")  # Open settings
4. vu_diff()  # Wait for settings to open
5. vu_describe(prompt="Are the VS Code settings now visible?")
```

### 5. Monitor for Changes
```
User: "Watch the dashboard and tell me when the status changes"

1. vu()  # Initial capture
2. Loop every 5 seconds:
   - vu_diff(threshold=0.02)
   - If changed: vu_describe(prompt="What changed on the dashboard?")
3. Report changes to user
```

## Best Practices

### 1. Capture Before Acting
Always capture and analyze before clicking/typing:
```
❌ Bad: vu_click(x=500, y=300)  # Hope it's the right spot

✅ Good:
1. vu_describe(prompt="Where is the submit button?")
2. Parse coordinates from description
3. vu_click(x=parsed_x, y=parsed_y)
```

### 2. Use Appropriate Capture Scope
```
Full screen → General context, multi-app workflows
Window → Single app focus, cleaner output
Region → Specific element, faster/smaller
```

### 3. Specific Vision Prompts
```
❌ Vague: "What do you see?"

✅ Specific:
- "Is there an error message visible? If so, what does it say?"
- "What is the status of the test runner? Passing, failing, or running?"
- "List all form fields visible and their current values"
```

### 4. Rate Limit Automation
Don't spam clicks. Wait between actions:
```
vu_click(x=100, y=100)
# Wait for UI response
vu_diff(threshold=0.01)
vu_click(x=200, y=200)
```

### 5. Verify After Actions
Always verify automation succeeded:
```
vu_type(text="[email protected]")
vu_describe(prompt="What text is now in the email field?")
```

## Troubleshooting

### "Permission denied" or blank captures
→ Grant Screen Recording permission: System Settings > Privacy > Screen Recording

### Automation not working
→ Grant Accessibility permission: System Settings > Privacy > Accessibility

### Vision analysis failing
→ Check API keys in environment variables
→ Try different provider: `vu_describe(provider="openai")`

### Coordinates seem off
→ Retina display? Coordinates are in logical points, not pixels
→ Multi-monitor? Check `vu_list_monitors()` for display arrangement

### Hotkey blocked
→ Some dangerous hotkeys (cmd+q) are blocked for safety
→ Unblock in config if absolutely necessary

## Configuration

Environment variables:
```bash
OMNI_VU_ANTHROPIC_API_KEY=sk-ant-...    # For Claude vision
OMNI_VU_OPENAI_API_KEY=sk-...           # For GPT-4 vision
OMNI_VU_DEFAULT_PROVIDER=claude          # Default AI provider
OMNI_VU_SAFETY_LEVEL=medium              # low/medium/high
```

History stored at: `~/.omni.vu/captures/`

Overview

This skill provides screen capture, AI vision analysis, and GUI automation for macOS to inspect UI state, detect changes, and drive mouse/keyboard actions. It gives an agent the ability to see the user’s screen, understand visible UI elements with AI, wait for events, and perform safe automated interactions. Use it when you must diagnose UI problems, verify renders, or automate repetitive on-screen tasks.

How this skill works

The skill captures full screen, windows, or precise regions and returns image data for AI analysis. Vision calls annotate screen content, answer focused prompts, or extract coordinates; change-detection tools compare captures to detect updates. Automation routines convert coordinates and actions into safe mouse moves, clicks, drags, scrolls, and typed input while enforcing permission and safety checks.

When to use it

Debugging UI behavior that requires visual inspection (broken button, missing element)
Verifying build, deploy, or rendering results by watching logs or screens
Waiting for or polling UI-driven events where no API exists
Automating repetitive GUI tasks like filling forms or navigating menus
Capturing screenshots or sequences for documentation or bug reports

Best practices

Capture and analyze before acting — detect target elements and parse coordinates first
Choose the narrowest capture scope possible: region for single elements, window for one app, full screen for context
Use specific vision prompts (errors, status, form field values) rather than vague requests
Rate-limit clicks and inputs; insert diffs or delays to confirm UI responses
Always verify outcome after automation with a follow-up capture/vision call

Example use cases

Reproduce a user-reported UI bug: capture form state, check for disabled controls, inspect console area
Monitor a long-running build by polling screen diffs and asking AI whether the build succeeded
Automate filling and submitting a registration form by locating fields, typing values, and verifying submission
Open and navigate VS Code settings by listing windows, sending hotkeys, and confirming with vision analysis
Watch a dashboard for status changes and produce a description of what changed whenever a diff is detected

FAQ

What permissions are required to run this skill?

Screen Recording and Accessibility permissions must be granted in macOS System Settings for captures and automation to work.

Which AI provider should I choose for vision analysis?

Providers trade cost, speed, and quality; try Claude for high-quality descriptions, OpenAI for speed, Gemini for lower cost, or Ollama for a free/local option.