home / skills / openclaw / skills / computer-use

computer-use skill

/skills/ram-raghav-s/computer-use

This skill enables headless desktop control on Linux servers by automating an XFCE/Xvfb environment with noVNC and xdotool.

npx playbooks add skill openclaw/skills --skill computer-use

Review the files below or copy the command above to add this skill to your agents.

Files (19)
SKILL.md
6.1 KB
---
name: computer-use
description: Full desktop computer use for headless Linux servers. Xvfb + XFCE virtual desktop with xdotool automation. 17 actions (click, type, scroll, screenshot, drag, etc). Unlike OpenClaw's browser tool, operates at the X11 level so websites cannot detect automation. Includes VNC for live viewing.
version: 1.2.1
---

# Computer Use Skill

Full desktop GUI control for headless Linux servers. Creates a virtual display (Xvfb + XFCE) so you can run and control desktop applications on VPS/cloud instances without a physical monitor.

## Environment

- **Display**: `:99`
- **Resolution**: 1024x768 (XGA, Anthropic recommended)
- **Desktop**: XFCE4 (minimal — xfwm4 + panel only)

## Quick Setup

Run the setup script to install everything (systemd services, flicker-free VNC):

```bash
./scripts/setup-vnc.sh
```

This installs:
- Xvfb virtual display on `:99`
- Minimal XFCE desktop (xfwm4 + panel, no xfdesktop)
- x11vnc with stability flags
- noVNC for browser access

All services auto-start on boot and auto-restart on crash.

## Actions Reference

| Action | Script | Arguments | Description |
|--------|--------|-----------|-------------|
| screenshot | `screenshot.sh` | — | Capture screen → base64 PNG |
| cursor_position | `cursor_position.sh` | — | Get current mouse X,Y |
| mouse_move | `mouse_move.sh` | x y | Move mouse to coordinates |
| left_click | `click.sh` | x y left | Left click at coordinates |
| right_click | `click.sh` | x y right | Right click |
| middle_click | `click.sh` | x y middle | Middle click |
| double_click | `click.sh` | x y double | Double click |
| triple_click | `click.sh` | x y triple | Triple click (select line) |
| left_click_drag | `drag.sh` | x1 y1 x2 y2 | Drag from start to end |
| left_mouse_down | `mouse_down.sh` | — | Press mouse button |
| left_mouse_up | `mouse_up.sh` | — | Release mouse button |
| type | `type_text.sh` | "text" | Type text (50 char chunks, 12ms delay) |
| key | `key.sh` | "combo" | Press key (Return, ctrl+c, alt+F4) |
| hold_key | `hold_key.sh` | "key" secs | Hold key for duration |
| scroll | `scroll.sh` | dir amt [x y] | Scroll up/down/left/right |
| wait | `wait.sh` | seconds | Wait then screenshot |
| zoom | `zoom.sh` | x1 y1 x2 y2 | Cropped region screenshot |

## Usage Examples

```bash
export DISPLAY=:99

# Take screenshot
./scripts/screenshot.sh

# Click at coordinates
./scripts/click.sh 512 384 left

# Type text
./scripts/type_text.sh "Hello world"

# Press key combo
./scripts/key.sh "ctrl+s"

# Scroll down
./scripts/scroll.sh down 5
```

## Workflow Pattern

1. **Screenshot** — Always start by seeing the screen
2. **Analyze** — Identify UI elements and coordinates
3. **Act** — Click, type, scroll
4. **Screenshot** — Verify result
5. **Repeat**

## Tips

- Screen is 1024x768, origin (0,0) at top-left
- Click to focus before typing in text fields
- Use `ctrl+End` to jump to page bottom in browsers
- Most actions auto-screenshot after 2 sec delay
- Long text is chunked (50 chars) with 12ms keystroke delay

## Live Desktop Viewing (VNC)

Watch the desktop in real-time via browser or VNC client.

### Connect via Browser

```bash
# SSH tunnel (run on your local machine)
ssh -L 6080:localhost:6080 your-server

# Open in browser
http://localhost:6080/vnc.html
```

### Connect via VNC Client

```bash
# SSH tunnel
ssh -L 5900:localhost:5900 your-server

# Connect VNC client to localhost:5900
```

### SSH Config (recommended)

Add to `~/.ssh/config` for automatic tunneling:

```
Host your-server
  HostName your.server.ip
  User your-user
  LocalForward 6080 127.0.0.1:6080
  LocalForward 5900 127.0.0.1:5900
```

Then just `ssh your-server` and VNC is available.

## System Services

```bash
# Check status
systemctl status xvfb xfce-minimal x11vnc novnc

# Restart if needed
sudo systemctl restart xvfb xfce-minimal x11vnc novnc
```

### Service Chain

```
xvfb → xfce-minimal → x11vnc → novnc
```

- **xvfb**: Virtual display :99 (1024x768x24)
- **xfce-minimal**: Watchdog that runs xfwm4+panel, kills xfdesktop
- **x11vnc**: VNC server with `-noxdamage` for stability
- **novnc**: WebSocket proxy with heartbeat for connection stability

## Opening Applications

```bash
export DISPLAY=:99

# Chrome — only use --no-sandbox if the kernel lacks user namespace support.
# Check: cat /proc/sys/kernel/unprivileged_userns_clone
#   1 = sandbox works, do NOT use --no-sandbox
#   0 = sandbox fails, --no-sandbox required as fallback
# Using --no-sandbox when unnecessary causes instability and crashes.
if [ "$(cat /proc/sys/kernel/unprivileged_userns_clone 2>/dev/null)" = "0" ]; then
    google-chrome --no-sandbox &
else
    google-chrome &
fi

xfce4-terminal &                # Terminal
thunar &                        # File manager
```

**Note**: Snap browsers (Firefox, Chromium) have sandbox issues on headless servers. Use Chrome `.deb` instead:

```bash
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome-stable_current_amd64.deb
sudo apt-get install -f
```

## Manual Setup

If you prefer manual setup instead of `setup-vnc.sh`:

```bash
# Install packages
sudo apt install -y xvfb xfce4 xfce4-terminal xdotool scrot imagemagick dbus-x11 x11vnc novnc websockify

# Run the setup script (generates systemd services, masks xfdesktop, starts everything)
./scripts/setup-vnc.sh
```

If you prefer fully manual setup, the `setup-vnc.sh` script generates all systemd service files inline -- read it for the exact service definitions.

## Troubleshooting

### VNC shows black screen
- Check if xfwm4 is running: `pgrep xfwm4`
- Restart desktop: `sudo systemctl restart xfce-minimal`

### VNC flickering/flashing
- Ensure xfdesktop is masked (check `/usr/bin/xfdesktop`)
- xfdesktop causes flicker due to clear→draw cycles on Xvfb

### VNC disconnects frequently
- Check noVNC has `--heartbeat 30` flag
- Check x11vnc has `-noxdamage` flag

### x11vnc crashes (SIGSEGV)
- Add `-noxdamage -noxfixes` flags
- The DAMAGE extension causes crashes on Xvfb

## Requirements

Installed by `setup-vnc.sh`:
```bash
xvfb xfce4 xfce4-terminal xdotool scrot imagemagick dbus-x11 x11vnc novnc websockify
```

Overview

This skill provides full desktop GUI control for headless Linux servers by creating a virtual X11 display (Xvfb) running a minimal XFCE desktop. It exposes 17 scriptable actions (click, type, scroll, screenshot, drag, key hold, etc.) and includes VNC and noVNC for live viewing and debugging. Use it to automate desktop apps and browsers at the X11 level so automation is not detectable by websites.

How this skill works

The skill starts an Xvfb virtual display on :99 with a lightweight XFCE session and runs x11vnc + noVNC for remote viewing. Actions are implemented as small shell scripts that call xdotool, scrot and related tools to move the mouse, click, type text in chunks, press keys, scroll, drag, and capture screenshots. Most actions optionally return base64 PNG screenshots and follow a screenshot→analyze→act→verify workflow.

When to use it

  • Automating GUI-only applications on headless VPS/cloud instances
  • Running browser automation where webpage bot-detection must not see browser-level automation hooks
  • Testing GUI workflows that require a real desktop environment (menus, dialogs, file pickers)
  • Remote troubleshooting and live demos via VNC/noVNC
  • Batch interactions with legacy desktop apps that lack CLI or API controls

Best practices

  • Always export DISPLAY=:99 before invoking scripts
  • Start with a screenshot to determine coordinates, then act and re-screenshot to verify
  • Click to focus input fields before typing; long text is split into 50-char chunks with small delays
  • Use SSH tunnels for secure noVNC and VNC access (forward 6080 and 5900)
  • Prefer Chrome .deb on headless servers; avoid snap builds due to sandbox issues
  • Monitor systemd services (xvfb, xfce-minimal, x11vnc, novnc) and restart if needed

Example use cases

  • Automate form-filling and file uploads in a browser running on a headless server without browser automation flags showing
  • Take periodic full-desktop screenshots of a GUI process for monitoring or archival
  • Drive legacy desktop-only applications to perform batch conversion or export tasks
  • Interact with desktop installers or configuration dialogs during automated server provisioning
  • Provide a live desktop for remote debugging and manual intervention via VNC/noVNC

FAQ

How do I view the desktop remotely?

SSH-tunnel ports 6080 and/or 5900 to the server then open http://localhost:6080/vnc.html or connect a VNC client to localhost:5900.

Why use this instead of browser-level automation tools?

This operates at the X11 level so websites cannot detect automation signals injected by browser automation APIs; it also supports non-browser desktop apps.