gemini-computer-use

Build and run Gemini 2.5 Computer Use browser-control agents with Playwright. Use when a user wants to automate web browser tasks via the Gemini Computer Use model, needs an agent loop (screenshot → function_call → action → function_response), or asks to integrate safety confirmation for risky UI actions.

3,891 stars

Best use case

gemini-computer-use is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Build and run Gemini 2.5 Computer Use browser-control agents with Playwright. Use when a user wants to automate web browser tasks via the Gemini Computer Use model, needs an agent loop (screenshot → function_call → action → function_response), or asks to integrate safety confirmation for risky UI actions.

Teams using gemini-computer-use should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/gemini-computer-use/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/am-will/gemini-computer-use/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/gemini-computer-use/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How gemini-computer-use Compares

Feature / Agentgemini-computer-useStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Build and run Gemini 2.5 Computer Use browser-control agents with Playwright. Use when a user wants to automate web browser tasks via the Gemini Computer Use model, needs an agent loop (screenshot → function_call → action → function_response), or asks to integrate safety confirmation for risky UI actions.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Gemini Computer Use

## Quick start

1. Source the env file and set your API key:

   ```bash
   cp env.example env.sh
   $EDITOR env.sh
   source env.sh
   ```

2. Create a virtual environment and install dependencies:

   ```bash
   python -m venv .venv
   source .venv/bin/activate
   pip install google-genai playwright
   playwright install chromium
   ```

3. Run the agent script with a prompt:

   ```bash
   python scripts/computer_use_agent.py \
     --prompt "Find the latest blog post title on example.com" \
     --start-url "https://example.com" \
     --turn-limit 6
   ```

## Browser selection

- Default: Playwright's bundled Chromium (no env vars required).
- Choose a channel (Chrome/Edge) with `COMPUTER_USE_BROWSER_CHANNEL`.
- Use a custom Chromium-based executable (e.g., Brave) with `COMPUTER_USE_BROWSER_EXECUTABLE`.

If both are set, `COMPUTER_USE_BROWSER_EXECUTABLE` takes precedence.

## Core workflow (agent loop)

1. Capture a screenshot and send the user goal + screenshot to the model.
2. Parse `function_call` actions in the response.
3. Execute each action in Playwright.
4. If a `safety_decision` is `require_confirmation`, prompt the user before executing.
5. Send `function_response` objects containing the latest URL + screenshot.
6. Repeat until the model returns only text (no actions) or you hit the turn limit.

## Operational guidance

- Run in a sandboxed browser profile or container.
- Use `--exclude` to block risky actions you do not want the model to take.
- Keep the viewport at 1440x900 unless you have a reason to change it.

## Resources

- Script: `scripts/computer_use_agent.py`
- Reference notes: `references/google-computer-use.md`
- Env template: `env.example`

Related Skills

enable-chrome-gemini

3891
from openclaw/skills

Set up or repair Gemini in Chrome (Glic) on Windows, macOS, or Linux when enabling it for the first time outside the US or when the sidebar, floating panel, Alt+G shortcut, or top-bar entry disappears. Back up and patch Chrome Local State, restore region/eligibility fields, and check the required Glic flags and Chrome language.

PDF OCR using Gemini LLM

3891
from openclaw/skills

Extract text from PDFs using Google Gemini OCR. Use when extracting text from PDFs, performing OCR on scanned documents, or processing image-based PDFs.

gemini-deep-research

3891
from openclaw/skills

Perform complex, long-running research tasks using Gemini Deep Research Agent. Use when asked to research topics requiring multi-source synthesis, competitive analysis, market research, or comprehensive technical investigations that benefit from systematic web search and analysis.

gemini-stt

3891
from openclaw/skills

Transcribe audio files using Google's Gemini API or Vertex AI

senior-computer-vision

3891
from openclaw/skills

Computer vision engineering skill for object detection, image segmentation, and visual AI systems. Covers CNN and Vision Transformer architectures, YOLO/Faster R-CNN/DETR detection, Mask R-CNN/SAM segmentation, and production deployment with ONNX/TensorRT. Includes PyTorch, torchvision, Ultralytics, Detectron2, and MMDetection frameworks. Use when building detection pipelines, training custom models, optimizing inference, or deploying vision systems.

gemini-voice-assistant

3891
from openclaw/skills

Voice-to-voice AI assistant using Gemini Live API. Speak to the AI and get spoken responses. Use when you want to have natural voice conversations with an AI assistant powered by Google's Gemini models.

gemini-assistant

3891
from openclaw/skills

General-purpose AI assistant using Gemini API with voice and text support. Use when you need a smart AI assistant that can answer questions, have conversations, or help with general tasks using Google's Gemini models with audio/text capabilities.

gemini-video-analyzer

3891
from openclaw/skills

Native video analysis using Google Gemini API. Upload and analyze video files — describe scenes, extract text/UI, answer questions about content, transcribe speech, identify objects and actions. Use when: (1) User sends a video file and wants it analyzed, (2) Video summarization or description needed, (3) Extracting text, UI elements, or information from screen recordings, (4) Answering questions about video content, (5) Comparing multiple videos, (6) Analyzing tutorials, demos, or walkthroughs.

gemini-nano-banana

3891
from openclaw/skills

Auto-generated skill for gemini tools via OneKey Gateway.

lybic cloud-computer skill

3891
from openclaw/skills

Lybic Sandbox is a cloud sandbox built for agents and automation workflows. Think of it as a disposable cloud computer you can spin up on demand. Agents can perform GUI actions like seeing the screen, clicking, typing, and handling pop ups, which makes it a great fit for legacy apps and complex flows where APIs are missing or incomplete. It is designed for control and observability. You can monitor execution in real time, stop it when needed, and use logs and replay to debug, reproduce runs, and evaluate reliability. For long running tasks, iterative experimentation, or sensitive environments, sandboxed execution helps reduce risk and operational overhead.

gemini Models for vwu.ai

3891
from openclaw/skills

vwu.ai 平台上的 gemini 模型调用技能。

---

3891
from openclaw/skills

name: article-factory-wechat

Content & Documentation