home / skills / openclaw / skills / gemini-computer-use

gemini-computer-use skill

/skills/am-will/gemini-computer-use

This skill automates web browser tasks using Gemini Computer Use with Playwright, providing an agent loop and optional safety confirmations.

npx playbooks add skill openclaw/skills --skill gemini-computer-use

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
2.0 KB
---
name: gemini-computer-use
description: Build and run Gemini 2.5 Computer Use browser-control agents with Playwright. Use when a user wants to automate web browser tasks via the Gemini Computer Use model, needs an agent loop (screenshot → function_call → action → function_response), or asks to integrate safety confirmation for risky UI actions.
---

# Gemini Computer Use

## Quick start

1. Source the env file and set your API key:

   ```bash
   cp env.example env.sh
   $EDITOR env.sh
   source env.sh
   ```

2. Create a virtual environment and install dependencies:

   ```bash
   python -m venv .venv
   source .venv/bin/activate
   pip install google-genai playwright
   playwright install chromium
   ```

3. Run the agent script with a prompt:

   ```bash
   python scripts/computer_use_agent.py \
     --prompt "Find the latest blog post title on example.com" \
     --start-url "https://example.com" \
     --turn-limit 6
   ```

## Browser selection

- Default: Playwright's bundled Chromium (no env vars required).
- Choose a channel (Chrome/Edge) with `COMPUTER_USE_BROWSER_CHANNEL`.
- Use a custom Chromium-based executable (e.g., Brave) with `COMPUTER_USE_BROWSER_EXECUTABLE`.

If both are set, `COMPUTER_USE_BROWSER_EXECUTABLE` takes precedence.

## Core workflow (agent loop)

1. Capture a screenshot and send the user goal + screenshot to the model.
2. Parse `function_call` actions in the response.
3. Execute each action in Playwright.
4. If a `safety_decision` is `require_confirmation`, prompt the user before executing.
5. Send `function_response` objects containing the latest URL + screenshot.
6. Repeat until the model returns only text (no actions) or you hit the turn limit.

## Operational guidance

- Run in a sandboxed browser profile or container.
- Use `--exclude` to block risky actions you do not want the model to take.
- Keep the viewport at 1440x900 unless you have a reason to change it.

## Resources

- Script: `scripts/computer_use_agent.py`
- Reference notes: `references/google-computer-use.md`
- Env template: `env.example`

Overview

This skill builds and runs Gemini 2.5 Computer Use browser-control agents using Playwright to automate web tasks. It orchestrates a loop of screenshots, model function calls, browser actions, and function responses. Use it to prototype safe, interactive browser automation driven by the Gemini Computer Use model.

How this skill works

The agent captures a screenshot and sends the user goal plus the image to the model. The model returns function_call actions which the script parses and executes in Playwright. If the model requests risky actions marked with safety_decision=require_confirmation, the agent can pause for explicit user confirmation before proceeding. After each action the agent returns a function_response with the current URL and a fresh screenshot and repeats until no actions remain or a turn limit is reached.

When to use it

  • Automating multi-step web interactions where visual context matters (screenshots).
  • Building agents that must confirm high-risk UI actions before execution.
  • Prototyping browser workflows driven by Gemini Computer Use model function calls.
  • Creating repeatable scraping, form-filling, or navigation tasks with human oversight.
  • Testing model-driven UI automation behaviors in a sandboxed environment.

Best practices

  • Run the browser in a sandboxed profile or container to limit exposure.
  • Use the viewport size 1440x900 unless a different layout is required.
  • Block or exclude risky actions via configuration to prevent undesired operations.
  • Prefer the bundled Playwright Chromium by default; set a browser channel or executable only when needed.
  • Set a conservative turn limit to avoid runaway loops and monitor logs closely.

Example use cases

  • Automatically find and report the latest blog post title from a site, returning a screenshot for verification.
  • Navigate a multi-page form, request user confirmation before final submission, and return the filled form screenshot.
  • Collect visible product details across category pages while avoiding checkout or account changes.
  • Audit a site visually by following the model-suggested navigation and capturing step-by-step screenshots.

FAQ

Which browser will the agent use by default?

By default the agent uses Playwright's bundled Chromium; you can override with a browser channel or a custom Chromium-based executable.

How does the agent handle dangerous actions?

If the model marks an action as require_confirmation, the agent can pause and ask for explicit user approval before executing that action.