home / skills / bdambrosio / cognitive_workbench / osworld-observe

This skill retrieves current OSWorld observation including screenshot and accessibility tree to empower debugging and UI analysis.

npx playbooks add skill bdambrosio/cognitive_workbench --skill osworld-observe

Review the files below or copy the command above to add this skill to your agents.

Files (2)
Skill.md
1.6 KB
---
name: osworld-observe
type: python
description: "Get the current observation from the OSWorld environment. Returns screenshot (base64 PNG) and accessibility tree (JSON)."
schema_hint:
  value: "ignored"
  include_screenshot: "bool (default: true)"
  include_a11y: "bool (default: true)"
  out: "$variable"
examples:
  - '{"type":"osworld-observe","out":"$obs"}'
  - '{"type":"osworld-observe","include_screenshot":false,"out":"$a11y_only"}'
---

# OSWorld Observe Tool (Level 4)

## Input
- `include_screenshot`: bool (default: true) - include screenshot in observation
- `include_a11y`: bool (default: true) - include accessibility tree in observation
- `value` parameter is ignored

## Output
- Note ID (bound to `out` variable) containing:
  - `text`: formatted observation summary
  - `format`: "json"
  - `metadata`: observation data including:
    - `timestamp`: observation timestamp
    - `step_counter`: current step counter
    - `observation.screenshot`: dict with `encoding` ("png") and `data_base64` (base64-encoded PNG)
    - `observation.accessibility_tree`: raw accessibility tree JSON

## Configuration
- `OSWORLD_URL` environment variable (defaults to `http://localhost:3002`)
- Or pass `osworld_url` in character config's `osworld_config` section

## Common Workflow
```json
{"type":"osworld-observe","out":"$obs"}
{"type":"osworld-execute","python":"pyautogui.click(100,200)","out":"$result"}
{"type":"osworld-observe","out":"$obs2"}
```

## Notes
- Screenshot is returned as base64-encoded PNG data
- Accessibility tree is raw JSON from OSWorld
- No interpretation or filtering is performed - raw observation data only

Overview

This skill captures the current observation from an OSWorld environment and returns both a screenshot (base64 PNG) and the raw accessibility tree (JSON). It provides a timestamped, unfiltered snapshot of the UI state for automated agents to inspect, store, or process downstream. Configuration is available via an environment variable or character config.

How this skill works

When invoked, the skill requests an observation from OSWorld and packages the response into a note ID bound to the configured output variable. The observation includes a base64-encoded PNG screenshot and the raw accessibility tree JSON, along with metadata such as timestamp and step counter. Screenshot and accessibility data can be included or omitted using boolean flags.

When to use it

  • Capture a visual and accessibility snapshot before or after an automated action.
  • Gather raw UI state for debugging, test verification, or training data collection.
  • Archive reproducible observations for audit trails or issue reports.
  • Compare pre- and post-action UI state to validate effect of automation steps.
  • Feed the raw accessibility tree into downstream analysis or accessibility checks.

Best practices

  • Enable only the data you need: set include_screenshot or include_a11y to false to reduce payload size.
  • Store the returned base64 PNG and JSON separately to simplify downstream processing.
  • Use the timestamp and step_counter metadata to reliably correlate observations with actions.
  • Avoid interpreting or modifying the raw accessibility tree inside this step; do analysis in dedicated processing steps.
  • Keep OSWORLD_URL configured consistently across environments to avoid pointing to the wrong instance.

Example use cases

  • Capture screenshot + accessibility tree immediately after launching an app to verify initial layout.
  • Take observations before and after a click to confirm UI changes and generate a diff.
  • Collect accessibility trees across screens to build a map of interactive elements for automated testing.
  • Produce a visual and structural record for bug reports, attaching the base64 PNG and raw JSON.
  • Use observations as labeled inputs for training models that predict next actions from UI state.

FAQ

Can I disable the screenshot or accessibility tree?

Yes. Set include_screenshot or include_a11y to false to omit either item from the returned observation.

How do I point the skill to a non-default OSWorld instance?

Set the OSWORLD_URL environment variable or provide osworld_url in the character config's osworld_config section.