home / skills / am-will / codex-skills / gemini-computer-use

gemini-computer-use skill

/skills/gemini-computer-use

This skill helps you automate browser tasks using Gemini Computer Use with Playwright, enabling goal-driven actions, safety prompts, and action loops.

This is most likely a fork of the gemini-computer-use skill from openclaw
npx playbooks add skill am-will/codex-skills --skill gemini-computer-use

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
2.0 KB
---
name: gemini-computer-use
description: Build and run Gemini 2.5 Computer Use browser-control agents with Playwright. Use when a user wants to automate web browser tasks via the Gemini Computer Use model, needs an agent loop (screenshot → function_call → action → function_response), or asks to integrate safety confirmation for risky UI actions.
---

# Gemini Computer Use

## Quick start

1. Source the env file and set your API key:

   ```bash
   cp env.example env.sh
   $EDITOR env.sh
   source env.sh
   ```

2. Create a virtual environment and install dependencies:

   ```bash
   python -m venv .venv
   source .venv/bin/activate
   pip install google-genai playwright
   playwright install chromium
   ```

3. Run the agent script with a prompt:

   ```bash
   python scripts/computer_use_agent.py \
     --prompt "Find the latest blog post title on example.com" \
     --start-url "https://example.com" \
     --turn-limit 6
   ```

## Browser selection

- Default: Playwright's bundled Chromium (no env vars required).
- Choose a channel (Chrome/Edge) with `COMPUTER_USE_BROWSER_CHANNEL`.
- Use a custom Chromium-based executable (e.g., Brave) with `COMPUTER_USE_BROWSER_EXECUTABLE`.

If both are set, `COMPUTER_USE_BROWSER_EXECUTABLE` takes precedence.

## Core workflow (agent loop)

1. Capture a screenshot and send the user goal + screenshot to the model.
2. Parse `function_call` actions in the response.
3. Execute each action in Playwright.
4. If a `safety_decision` is `require_confirmation`, prompt the user before executing.
5. Send `function_response` objects containing the latest URL + screenshot.
6. Repeat until the model returns only text (no actions) or you hit the turn limit.

## Operational guidance

- Run in a sandboxed browser profile or container.
- Use `--exclude` to block risky actions you do not want the model to take.
- Keep the viewport at 1440x900 unless you have a reason to change it.

## Resources

- Script: `scripts/computer_use_agent.py`
- Reference notes: `references/google-computer-use.md`
- Env template: `env.example`

Overview

This skill builds and runs Gemini 2.5 Computer Use browser-control agents using Playwright to automate web tasks. It implements an agent loop that captures screenshots, interprets model function_calls, performs browser actions, and returns function_responses with updated screenshots and URLs. It also integrates optional safety confirmations for risky UI actions and supports configurable browser selection and sandboxed profiles.

How this skill works

The agent captures a screenshot and sends the user goal plus image to the Gemini Computer Use model. It parses function_call actions from model responses and executes them in Playwright, returning function_response objects that include the current URL and screenshot. If the model signals safety_decision=require_confirmation, the agent prompts for human confirmation before executing potentially risky actions. The loop repeats until the model stops issuing actions or a turn limit is reached.

When to use it

  • Automating multi-step web tasks that require visual context (screenshots) and browser control.
  • Building agents that need to parse model function_calls to drive Playwright actions.
  • Scenarios where human confirmation is required before executing risky UI operations.
  • Rapid prototyping of browser automation workflows that rely on model reasoning.
  • Controlled testing of web workflows in sandboxed or containerized environments.

Best practices

  • Run the agent in a sandboxed browser profile or container to limit risk.
  • Use the --exclude option to block specific actions you do not want the model to take.
  • Keep viewport at 1440x900 for consistent visual context unless you need a different layout.
  • Prefer the bundled Chromium for quick setup; set COMPUTER_USE_BROWSER_EXECUTABLE only for custom browser needs.
  • Set a sensible turn limit to avoid long uncontrolled loops and monitor logs for unexpected behavior.

Example use cases

  • Find and extract the latest blog post title from a website by navigating and reading page content.
  • Automate form submission flows while requiring human confirmation before sensitive submissions.
  • Scrape structured information visible on pages that require step-by-step navigation and interaction.
  • Prototype an assistant that navigates web dashboards and reports back screenshots and status updates.
  • Test UI flows by having the model execute sequences of clicks, scrolls, and inputs while capturing evidence.

FAQ

Which browser does the agent use by default?

It uses Playwright's bundled Chromium by default; you can set COMPUTER_USE_BROWSER_CHANNEL or COMPUTER_USE_BROWSER_EXECUTABLE to change it.

How does safety confirmation work?

When the model returns safety_decision=require_confirmation, the agent pauses and prompts for human approval before performing the flagged action.