home / skills / richardanaya / agent-skills / interact-with-browser
This skill helps you automate browser tasks with a CLI, enabling navigation, element interaction, and snapshots for AI-assisted workflows.
npx playbooks add skill richardanaya/agent-skills --skill interact-with-browserReview the files below or copy the command above to add this skill to your agents.
---
name: interact-with-browser
description: a CLI for intearcting with browser
license: MIT
compatibility: opencode
metadata:
audience: testing
---
BE SURE TO CLEAN UP SCREEN SHOTS AFTER YOU ARE DONE WITH EVERYTHING
IF THIS NEEDS TO BE INSTALLED
```
npm install -g agent-browser
agent-browser install # to get chromium downloaded
```
agent-browser open example.com
agent-browser snapshot # Get accessibility tree with refs
agent-browser click @e2 # Click by ref from snapshot
agent-browser fill @e3 "[email protected]" # Fill by ref
agent-browser get text @e1 # Get text by ref
agent-browser screenshot page.png
agent-browser close
Traditional Selectors (also supported)
agent-browser click "#submit"
agent-browser fill "#email" "[email protected]"
agent-browser find role button click --name "Submit"
Commands
Core Commands
agent-browser open <url> # Navigate to URL (aliases: goto, navigate)
agent-browser click <sel> # Click element
agent-browser dblclick <sel> # Double-click element
agent-browser focus <sel> # Focus element
agent-browser type <sel> <text> # Type into element
agent-browser fill <sel> <text> # Clear and fill
agent-browser press <key> # Press key (Enter, Tab, Control+a) (alias: key)
agent-browser keydown <key> # Hold key down
agent-browser keyup <key> # Release key
agent-browser hover <sel> # Hover element
agent-browser select <sel> <val> # Select dropdown option
agent-browser check <sel> # Check checkbox
agent-browser uncheck <sel> # Uncheck checkbox
agent-browser scroll <dir> [px] # Scroll (up/down/left/right)
agent-browser scrollintoview <sel> # Scroll element into view (alias: scrollinto)
agent-browser drag <src> <tgt> # Drag and drop
agent-browser upload <sel> <files> # Upload files
agent-browser screenshot [path] # Take screenshot (--full for full page)
agent-browser pdf <path> # Save as PDF
agent-browser snapshot # Accessibility tree with refs (best for AI)
agent-browser eval <js> # Run JavaScript
agent-browser close # Close browser (aliases: quit, exit)
Get Info
agent-browser get text <sel> # Get text content
agent-browser get html <sel> # Get innerHTML
agent-browser get value <sel> # Get input value
agent-browser get attr <sel> <attr> # Get attribute
agent-browser get title # Get page title
agent-browser get url # Get current URL
agent-browser get count <sel> # Count matching elements
agent-browser get box <sel> # Get bounding box
Check State
agent-browser is visible <sel> # Check if visible
agent-browser is enabled <sel> # Check if enabled
agent-browser is checked <sel> # Check if checked
Find Elements (Semantic Locators)
agent-browser find role <role> <action> [value] # By ARIA role
agent-browser find text <text> <action> # By text content
agent-browser find label <label> <action> [value] # By label
agent-browser find placeholder <ph> <action> [value] # By placeholder
agent-browser find alt <text> <action> # By alt text
agent-browser find title <text> <action> # By title attr
agent-browser find testid <id> <action> [value] # By data-testid
agent-browser find first <sel> <action> [value] # First match
agent-browser find last <sel> <action> [value] # Last match
agent-browser find nth <n> <sel> <action> [value] # Nth match
Actions: click, fill, check, hover, text
Examples:
agent-browser find role button click --name "Submit"
agent-browser find text "Sign In" click
agent-browser find label "Email" fill "[email protected]"
agent-browser find first ".item" click
agent-browser find nth 2 "a" text
Wait
agent-browser wait <selector> # Wait for element to be visible
agent-browser wait <ms> # Wait for time (milliseconds)
agent-browser wait --text "Welcome" # Wait for text to appear
agent-browser wait --url "**/dash" # Wait for URL pattern
agent-browser wait --load networkidle # Wait for load state
agent-browser wait --fn "window.ready === true" # Wait for JS condition
Load states: load, domcontentloaded, networkidle
Mouse Control
agent-browser mouse move <x> <y> # Move mouse
agent-browser mouse down [button] # Press button (left/right/middle)
agent-browser mouse up [button] # Release button
agent-browser mouse wheel <dy> [dx] # Scroll wheel
Browser Settings
agent-browser set viewport <w> <h> # Set viewport size
agent-browser set device <name> # Emulate device ("iPhone 14")
agent-browser set geo <lat> <lng> # Set geolocation
agent-browser set offline [on|off] # Toggle offline mode
agent-browser set headers <json> # Extra HTTP headers
agent-browser set credentials <u> <p> # HTTP basic auth
agent-browser set media [dark|light] # Emulate color scheme
Cookies & Storage
agent-browser cookies # Get all cookies
agent-browser cookies set <name> <val> # Set cookie
agent-browser cookies clear # Clear cookies
agent-browser storage local # Get all localStorage
agent-browser storage local <key> # Get specific key
agent-browser storage local set <k> <v> # Set value
agent-browser storage local clear # Clear all
agent-browser storage session # Same for sessionStorage
Network
agent-browser network route <url> # Intercept requests
agent-browser network route <url> --abort # Block requests
agent-browser network route <url> --body <json> # Mock response
agent-browser network unroute [url] # Remove routes
agent-browser network requests # View tracked requests
agent-browser network requests --filter api # Filter requests
Tabs & Windows
agent-browser tab # List tabs
agent-browser tab new [url] # New tab (optionally with URL)
agent-browser tab <n> # Switch to tab n
agent-browser tab close [n] # Close tab
agent-browser window new # New window
Frames
agent-browser frame <sel> # Switch to iframe
agent-browser frame main # Back to main frame
Dialogs
agent-browser dialog accept [text] # Accept (with optional prompt text)
agent-browser dialog dismiss # Dismiss
Debug
agent-browser trace start [path] # Start recording trace
agent-browser trace stop [path] # Stop and save trace
agent-browser console # View console messages
agent-browser console --clear # Clear console
agent-browser errors # View page errors
agent-browser errors --clear # Clear errors
agent-browser highlight <sel> # Highlight element
agent-browser state save <path> # Save auth state
agent-browser state load <path> # Load auth state
Navigation
agent-browser back # Go back
agent-browser forward # Go forward
agent-browser reload # Reload page
Setup
agent-browser install # Download Chromium browser
agent-browser install --with-deps # Also install system deps (Linux)
Sessions
Run multiple isolated browser instances:
# Different sessions
agent-browser --session agent1 open site-a.com
agent-browser --session agent2 open site-b.com
# Or via environment variable
AGENT_BROWSER_SESSION=agent1 agent-browser click "#btn"
# List active sessions
agent-browser session list
# Output:
# Active sessions:
# -> default
# agent1
# Show current session
agent-browser session
Each session has its own:
Browser instance
Cookies and storage
Navigation history
Authentication state
Snapshot Options
The snapshot command supports filtering to reduce output size:
agent-browser snapshot # Full accessibility tree
agent-browser snapshot -i # Interactive elements only (buttons, inputs, links)
agent-browser snapshot -c # Compact (remove empty structural elements)
agent-browser snapshot -d 3 # Limit depth to 3 levels
agent-browser snapshot -s "#main" # Scope to CSS selector
agent-browser snapshot -i -c -d 5 # Combine options
Option Description
-i, --interactive Only show interactive elements (buttons, links, inputs)
-c, --compact Remove empty structural elements
-d, --depth <n> Limit tree depth
-s, --selector <sel> Scope to CSS selector
Options
Option Description
--session <name> Use isolated session (or AGENT_BROWSER_SESSION env)
--headers <json> Set HTTP headers scoped to the URL's origin
--executable-path <path> Custom browser executable (or AGENT_BROWSER_EXECUTABLE_PATH env)
--json JSON output (for agents)
--full, -f Full page screenshot
--name, -n Locator name filter
--exact Exact text match
--headed Show browser window (not headless)
--cdp <port> Connect via Chrome DevTools Protocol
--debug Debug output
Selectors
Refs (Recommended for AI)
Refs provide deterministic element selection from snapshots:
# 1. Get snapshot with refs
agent-browser snapshot
# Output:
# - heading "Example Domain" [ref=e1] [level=1]
# - button "Submit" [ref=e2]
# - textbox "Email" [ref=e3]
# - link "Learn more" [ref=e4]
# 2. Use refs to interact
agent-browser click @e2 # Click the button
agent-browser fill @e3 "[email protected]" # Fill the textbox
agent-browser get text @e1 # Get heading text
agent-browser hover @e4 # Hover the link
Why use refs?
Deterministic: Ref points to exact element from snapshot
Fast: No DOM re-query needed
AI-friendly: Snapshot + ref workflow is optimal for LLMs
CSS Selectors
agent-browser click "#id"
agent-browser click ".class"
agent-browser click "div > button"
Text & XPath
agent-browser click "text=Submit"
agent-browser click "xpath=//button"
Semantic Locators
agent-browser find role button click --name "Submit"
agent-browser find label "Email" fill "[email protected]"
Agent Mode
Use --json for machine-readable output:
agent-browser snapshot --json
# Returns: {"success":true,"data":{"snapshot":"...","refs":{"e1":{"role":"heading","name":"Title"},...}}}
agent-browser get text @e1 --json
agent-browser is visible @e2 --json
Optimal AI Workflow
# 1. Navigate and get snapshot
agent-browser open example.com
agent-browser snapshot -i --json # AI parses tree and refs
# 2. AI identifies target refs from snapshot
# 3. Execute actions using refs
agent-browser click @e2
agent-browser fill @e3 "input text"
# 4. Get new snapshot if page changed
agent-browser snapshot -i --json
Headed Mode
Show the browser window for debugging:
agent-browser open example.com --headed
This opens a visible browser window instead of running headless.
Authenticated Sessions
Use --headers to set HTTP headers for a specific origin, enabling authentication without login flows:
# Headers are scoped to api.example.com only
agent-browser open api.example.com --headers '{"Authorization": "Bearer <token>"}'
# Requests to api.example.com include the auth header
agent-browser snapshot -i --json
agent-browser click @e2
# Navigate to another domain - headers are NOT sent (safe!)
agent-browser open other-site.com
This is useful for:
Skipping login flows - Authenticate via headers instead of UI
Switching users - Start new sessions with different auth tokens
API testing - Access protected endpoints directly
Security - Headers are scoped to the origin, not leaked to other domains
To set headers for multiple origins, use --headers with each open command:
agent-browser open api.example.com --headers '{"Authorization": "Bearer token1"}'
agent-browser open api.acme.com --headers '{"Authorization": "Bearer token2"}'
For global headers (all domains), use set headers:
agent-browser set headers '{"X-Custom-Header": "value"}'
Custom Browser Executable
Use a custom browser executable instead of the bundled Chromium. This is useful for:
Serverless deployment: Use lightweight Chromium builds like @sparticuz/chromium (~50MB vs ~684MB)
System browsers: Use an existing Chrome/Chromium installation
Custom builds: Use modified browser builds
CLI Usage
# Via flag
agent-browser --executable-path /path/to/chromium open example.com
# Via environment variable
AGENT_BROWSER_EXECUTABLE_PATH=/path/to/chromium agent-browser open example.com
Serverless Example (Vercel/AWS Lambda)
import chromium from '@sparticuz/chromium';
import { BrowserManager } from 'agent-browser';
export async function handler() {
const browser = new BrowserManager();
await browser.launch({
executablePath: await chromium.executablePath(),
headless: true,
});
// ... use browser
}
This skill provides a CLI for interacting with a headless or headed browser to automate browsing, inspect accessibility trees, and run UI actions. It exposes deterministic refs, traditional selectors, semantic locators, and a rich set of commands for navigation, input, network control, and session management. The tool is optimized for AI-driven workflows by offering JSON output and snapshots that include stable refs for reliable automation. It also supports sessions, custom executables, and serverless usage patterns.
You drive the browser with simple CLI commands like open, click, fill, snapshot, and screenshot. Use snapshot to obtain an accessibility tree with refs; then reference elements by ref (e.g., @e2) for deterministic actions. For dynamic selection you can also use CSS, text, XPath, or semantic locators (find role/label/etc.). Many commands support --json for machine-readable output and options to scope, filter, or limit snapshot output.
How do refs improve reliability?
Refs map directly to nodes in a snapshot, giving deterministic element references that avoid repeated DOM queries and brittle selectors.
Can I run multiple isolated browsers?
Yes. Use --session or AGENT_BROWSER_SESSION to create isolated browser instances with separate cookies, storage, and history.