gui-act

Execute GUI actions — click, type, send messages. Includes detection, memory matching, component saving, execution, diff, and transition recording as one unified flow.

3,891 stars

Best use case

gui-act is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Execute GUI actions — click, type, send messages. Includes detection, memory matching, component saving, execution, diff, and transition recording as one unified flow.

Teams using gui-act should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/gui-act/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/alfredjamesli/gui-claw/skills/gui-act/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/gui-act/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How gui-act Compares

Feature / Agentgui-actStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Execute GUI actions — click, type, send messages. Includes detection, memory matching, component saving, execution, diff, and transition recording as one unified flow.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Act — Detect, Match, Save, Execute, Diff, Record

This is the core action loop. Every action follows this flow. Do not skip any part.

---

## The Complete Action Flow

```
┌──────────────────────────────────────────────────────────────────┐
│ 1. DETECT:  Screenshot → OCR + GPA-GUI-Detector                 │
│ 2. MATCH:   Compare detected elements against saved memory       │
│ 3. SAVE COMPONENTS: New elements → crop + save + label           │
│             ↑ Save BEFORE clicking — even if click fails,        │
│               components are in memory for next time             │
│ 4. DECIDE & EXECUTE: Pick target → click/type at coordinates     │
│ 5. DETECT AGAIN: Screenshot → OCR (only if action might fail)    │
│ 6. DIFF:    Compare before vs after OCR texts                    │
│ 7. SAVE TRANSITION: Record state change to transitions.json      │
└──────────────────────────────────────────────────────────────────┘
```

**Key change from previous version:** Component saving (step 3) happens BEFORE execution (step 4), not after. This means:
- Even if the click fails, you've already saved what you learned about the current page
- The next visit to this page can use template matching immediately
- You never "lose" detected components by skipping saves after action

---

## Automation API

Two platform-independent functions handle ALL saving automatically.
They work on any screenshot (local Mac, remote VM, downloaded image).
**The LLM does NOT manually crop or write JSON files** — call these functions instead.

### `learn_from_screenshot(img_path, domain, app_name, page_name)`
Runs GPA-GUI-Detector + OCR on a screenshot, crops all components, saves to memory.
Call this ONCE per page state you observe (step 3).

```python
from scripts.app_memory import learn_from_screenshot

# After taking a screenshot and before clicking anything:
result = learn_from_screenshot(
    img_path="/path/to/screenshot.png",
    domain="united.com",           # None for non-browser apps
    app_name="chromium",           # Browser or app name
    page_name="homepage",          # Human-readable page label
)
# Note: Scale (detection→click) is computed dynamically by detect_all()
# via refresh_screen_info(). No manual retina flag needed.
# result = {"saved": 42, "new": 38, "components": ["Booking", "Travel_info", ...]}
```

### `record_page_transition(before_img, after_img, click_label, click_pos, domain, app_name)`
Runs OCR on before/after screenshots, computes diff, saves state transition.
Call this ONCE per click (step 7).

```python
from scripts.app_memory import record_page_transition

# After clicking and taking a new screenshot:
result = record_page_transition(
    before_img_path="/path/to/before.png",
    after_img_path="/path/to/after.png",
    click_label="Travel_info",     # What was clicked
    click_pos=(779, 187),          # Where it was clicked (click space)
    domain="united.com",
    app_name="chromium",
)
# result = {"appeared": [...], "disappeared": [...], "from": "...", "to": "..."}
```

---

## Step-by-Step Walkthrough

### Step 1: DETECT (before action)

Take a screenshot. Run OCR + GPA-GUI-Detector on it:

```python
from scripts.ui_detector import detect_text, detect_icons

ocr_results = detect_text(screenshot_path)
# [{"label": "Travel info", "cx": 779, "cy": 187, ...}, ...]

icon_results = detect_icons(screenshot_path)
# [{"cx": 849, "cy": 783, "confidence": 0.85, "label": null, ...}, ...]
```

For remote VMs: download screenshot to Mac first, then run detection locally.

### Step 2: MATCH against saved memory

Check if components are already in memory:

```python
from scripts.app_memory import match_all_components

matched = match_all_components(app_name, img=screenshot_path, threshold=0.8)
# {"travel_info_btn": (661, 188, 0.95), "book_btn": (490, 283, 0.92), ...}
```

**If components match:** coordinates come from template matching (most precise). Skip to step 4.
**If components are NEW:** coordinates come from OCR/GPA-GUI-Detector. Continue to step 3.

### Step 3: SAVE COMPONENTS (before clicking!)

**Call `learn_from_screenshot()` to save all detected components automatically.**

```python
from scripts.app_memory import learn_from_screenshot

learn_from_screenshot(
    img_path=screenshot_path,
    domain="united.com",
    page_name="homepage",
)
```

This is automated — no manual cropping, no manual JSON editing.
The function handles: detection, filtering, naming, dedup, cropping, saving.

### Step 4: DECIDE & EXECUTE

Pick the target element, get coordinates from detection (step 1) or memory (step 2), click.

**Local Mac apps:**
```python
from scripts.app_memory import click_and_record, click_component

# Known component (template matched):
click_component(app_name, component_name)

# New element (detected coordinates):
click_and_record(app_name, "Travel_info", 779, 187)
```

**Remote VMs (OSWorld):**
```python
# Send click via VM API
import pyautogui
pyautogui.click(779, 187)
```

**CRITICAL:** Always use `gui_action.py click` (with appropriate --remote if needed), never raw platform-specific calls.

### Step 5: DETECT AGAIN (if needed)

Take another screenshot after the action. Run OCR to verify the result.

This step is needed when:
- You need to verify the click worked (page changed)
- You need to find the next element to click
- The action might have failed (wrong element, popup appeared)

For simple keyboard shortcuts (Ctrl+L, typing text), you can skip this step.

### Step 6: DIFF

Compare OCR texts from before and after screenshots:
- **Appeared:** new text = new page/state
- **Disappeared:** gone text = left previous state
- **Persisted:** unchanged text = persistent UI (nav bar, etc.)

This is done automatically by `record_page_transition()` in step 7.

### Step 7: SAVE TRANSITION

**Call `record_page_transition()` to save the state change automatically.**

```python
from scripts.app_memory import record_page_transition

record_page_transition(
    before_img_path=before_screenshot,
    after_img_path=after_screenshot,
    click_label="Travel_info",
    click_pos=(779, 187),
    domain="united.com",
)
```

This automatically: runs OCR on both images, diffs them, saves states + transition to `states.json` / `transitions.json`.

---

## Concrete Example: OSWorld Task

```python
from scripts.ui_detector import ImageContext, detect_all

# Step 1: Screenshot + detect (returns IMAGE PIXEL coords)
# Download VM screenshot to Mac, then detect locally
elements = detect_all("screenshot.png")  # returns image pixel coords

# Step 2: Match — first visit, no memory yet → all new

# Step 3: Save components BEFORE clicking
learn_from_screenshot("screenshot.png", domain="united.com", page_name="homepage")
# → 42 components saved automatically (crops use pixel coords directly)

# Step 4: Click — convert pixel coords to click-space
ctx = ImageContext.remote()  # VM screenshot = 1:1
click_x, click_y = ctx.image_to_click(779, 187)  # → (779, 187) for remote
pyautogui.click(click_x, click_y)  # via VM API

# Step 5: Detect again
new_elements = detect_all("new_screenshot.png")

# Step 6+7: Diff + save transition
record_page_transition("screenshot.png", "new_screenshot.png",
                       click_label="Travel_info", click_pos=(779, 187),
                       domain="united.com")
# → appeared: ["Bags", "United app", ...], disappeared: [...]
# → transition saved to transitions.json
```

---

## The Payoff

**First visit to united.com:**
```
Screenshot → GPA-GUI-Detector + OCR → learn_from_screenshot() saves everything
→ click → record_page_transition() saves state change
→ Total: ~5 seconds of detection, everything in memory
```

**Second visit to united.com:**
```
Screenshot → template match against saved components → instant recognition
→ "I see Travel_info at (661, 188), Bags at (485, 324)"
→ click directly. No GPA. No image tool. Fast.
```

---

## How Coordinates Work

`detect_all()` returns **image pixel coordinates**. Use `ImageContext` to convert to click-space:

```python
from scripts.ui_detector import ImageContext

# Choose context based on screenshot source:
ctx = ImageContext.remote()              # VM / remote screenshots
ctx = ImageContext.mac_fullscreen()      # Mac fullscreen
ctx = ImageContext.mac_window(wx, wy)    # Mac window crop

# Convert for clicking:
click_x, click_y = ctx.image_to_click(el["cx"], el["cy"])
```

| Source | Method | Returns |
|---|---|---|
| Saved component | Template matching (`match_all_components`) | Click-space (already converted) |
| Text element | OCR via `detect_all()` | **Image pixels** → use `ctx.image_to_click()` |
| UI component | GPA via `detect_all()` | **Image pixels** → use `ctx.image_to_click()` |
| **image tool** | **NEVER for coordinates** | **Understanding only** |

## Not Found?

Component not matching (conf < 0.8) = not on screen in its saved form.
**Don't lower threshold.** Run `learn_from_screenshot()` on current page to discover what IS on screen.

## Input Methods (gui_action.py)

All GUI operations go through `gui_action.py`. Add `--remote URL` for remote targets.

```bash
gui_action.py click X Y                   # Left click
gui_action.py right_click X Y             # Right click
gui_action.py type "text"                 # Type text (handles special chars)
gui_action.py key enter                   # Single key (enter/tab/escape...)
gui_action.py shortcut ctrl+s             # Key combination
gui_action.py screenshot /tmp/s.png       # Screenshot
gui_action.py focus "window title"        # Focus window
gui_action.py close "window title"        # Close window
gui_action.py list_windows                # List all windows
# Remote: add --remote http://IP:PORT
```

---

## REPORT — Track Task Performance

Call gui-report at the START and END of every gui-agent workflow (not per-click, per-task).

```bash
TRACKER="python3 ~/.openclaw/workspace/skills/gui-agent/skills/gui-report/scripts/tracker.py"

# At task start (get context from session_status):
$TRACKER start --task "OSWorld Task 25: United Airlines baggage calculator" --context 94000

# During task: image_calls need manual tick (clicks/screenshots auto-tick)
$TRACKER tick image_calls

# At task end (get context again from session_status):
$TRACKER report --context 120000
```

See `gui-report/SKILL.md` for details.

## ⛔ ABSOLUTE RULES — Coordinate Sources

```
✅ ALLOWED coordinate sources:
   1. GPA-GUI-Detector (detect_icons) → bounding box center
   2. OCR (detect_text) → text bounding box center
   3. Template matching → saved component position

❌ FORBIDDEN:
   - LLM/vision model guessing coordinates
   - Hardcoded pixel positions from memory or documentation
   - Coordinates from image tool analysis (image tool = understanding ONLY)
```

Every click: screenshot → detect → get coordinates from detection → click. No exceptions.

## Key Principles

1. **Vision-driven** — screenshot → detect → match → click
2. **Coordinates from detection only** — image tool is for understanding, NOT coordinates
3. **Not found = not on screen** — re-learn, don't guess
4. **State graph drives navigation** — each click records a transition
5. **First time: screenshot + image. Repeat: detection only** — saves tokens
6. **Paste > Type** for CJK text
7. **Integer logical coordinates** — use detect_to_click() for Retina
8. **ALWAYS save to memory** — every GUI operation saves to memory/apps/

Related Skills

---

3891
from openclaw/skills

name: article-factory-wechat

Content & Documentation

humanizer

3891
from openclaw/skills

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.

Content & Documentation

find-skills

3891
from openclaw/skills

Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.

General Utilities

tavily-search

3891
from openclaw/skills

Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.

Data & Research

baidu-search

3891
from openclaw/skills

Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.

Data & Research

agent-autonomy-kit

3891
from openclaw/skills

Stop waiting for prompts. Keep working.

Workflow & Productivity

Meeting Prep

3891
from openclaw/skills

Never walk into a meeting unprepared again. Your agent researches all attendees before calendar events—pulling LinkedIn profiles, recent company news, mutual connections, and conversation starters. Generates a briefing doc with talking points, icebreakers, and context so you show up informed and confident. Triggered automatically before meetings or on-demand. Configure research depth, advance timing, and output format. Walking into meetings blind is amateur hour—missed connections, generic small talk, zero leverage. Use when setting up meeting intelligence, researching specific attendees, generating pre-meeting briefs, or automating your prep workflow.

Workflow & Productivity

self-improvement

3891
from openclaw/skills

Captures learnings, errors, and corrections to enable continuous improvement. Use when: (1) A command or operation fails unexpectedly, (2) User corrects Claude ('No, that's wrong...', 'Actually...'), (3) User requests a capability that doesn't exist, (4) An external API or tool fails, (5) Claude realizes its knowledge is outdated or incorrect, (6) A better approach is discovered for a recurring task. Also review learnings before major tasks.

Agent Intelligence & Learning

botlearn-healthcheck

3891
from openclaw/skills

botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.

DevOps & Infrastructure

linkedin-cli

3891
from openclaw/skills

A bird-like LinkedIn CLI for searching profiles, checking messages, and summarizing your feed using session cookies.

Content & Documentation

notebooklm

3891
from openclaw/skills

Google NotebookLM 非官方 Python API 的 OpenClaw Skill。支持内容生成(播客、视频、幻灯片、测验、思维导图等)、文档管理和研究自动化。当用户需要使用 NotebookLM 生成音频概述、视频、学习材料或管理知识库时触发。

Data & Research

小红书长图文发布 Skill

3891
from openclaw/skills

## 概述

Content & Documentation