Vision Sandbox
Agentic Vision via Gemini's native Code Execution sandbox. Use for spatial grounding, visual math, and UI auditing.
Best use case
Vision Sandbox is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Agentic Vision via Gemini's native Code Execution sandbox. Use for spatial grounding, visual math, and UI auditing.
Teams using Vision Sandbox should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/vision-sandbox/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Vision Sandbox Compares
| Feature / Agent | Vision Sandbox | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Agentic Vision via Gemini's native Code Execution sandbox. Use for spatial grounding, visual math, and UI auditing.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Vision Sandbox 🔭 Leverage Gemini's native code execution to analyze images with high precision. The model writes and runs Python code in a Google-hosted sandbox to verify visual data, perfect for UI auditing, spatial grounding, and visual reasoning. ## Installation ```bash clawhub install vision-sandbox ``` ## Usage ```bash uv run vision-sandbox --image "path/to/image.png" --prompt "Identify all buttons and provide [x, y] coordinates." ``` ## Pattern Library ### 📍 Spatial Grounding Ask the model to find specific items and return coordinates. * **Prompt:** "Locate the 'Submit' button in this screenshot. Use code execution to verify its center point and return the [x, y] coordinates in a [0, 1000] scale." ### 🧮 Visual Math Ask the model to count or calculate based on the image. * **Prompt:** "Count the number of items in the list. Use Python to sum their values if prices are visible." ### 🖥️ UI Audit Check layout and readability. * **Prompt:** "Check if the header text overlaps with any icons. Use the sandbox to calculate the bounding box intersections." ### 🖐️ Counting & Logic Solve visual counting tasks with code verification. * **Prompt:** "Count the number of fingers on this hand. Use code execution to identify the bounding box for each finger and return the total count." ## Integration with OpenCode This skill is designed to provide **Visual Grounding** for automated coding agents like OpenCode. - **Step 1:** Use `vision-sandbox` to extract UI metadata (coordinates, sizes, colors). - **Step 2:** Pass the JSON output to OpenCode to generate or fix CSS/HTML. ## Configuration - **GEMINI_API_KEY**: Required environment variable. - **Model**: Defaults to `gemini-3-flash-preview`.
Related Skills
Vision Analyze (Google)
Analyze images using **Google Cloud Vision API**.
docker-sandbox
Create and manage Docker sandboxed VM environments for safe agent execution. Use when running untrusted code, exploring packages, or isolating agent workloads. Supports Claude, Codex, Copilot, Gemini, and Kiro agents with network proxy controls.
anthrovision-telegram-body-scan
Run end-to-end body-scan measurement flow in Telegram using AnthroVision bridge tools.
sandboxer
Manage Claude Code terminal sessions via Sandboxer web dashboard. Use when: (1) listing running Claude Code sessions, (2) checking what a Claude session is doing, (3) sending commands to a Claude session, (4) creating or killing sessions, (5) user mentions 'sandboxer' or 'session'.
sandboxer-tmux
Dispatch coding tasks to tmux sessions via Sandboxer.
desktop-sandbox
A desktop sandbox lets OpenClaw run as natively as on a real OS, ensuring full functionality with safe.
senior-computer-vision
Computer vision engineering skill for object detection, image segmentation, and visual AI systems. Covers CNN and Vision Transformer architectures, YOLO/Faster R-CNN/DETR detection, Mask R-CNN/SAM segmentation, and production deployment with ONNX/TensorRT. Includes PyTorch, torchvision, Ultralytics, Detectron2, and MMDetection frameworks. Use when building detection pipelines, training custom models, optimizing inference, or deploying vision systems.
lybic-sandbox
Lybic Sandbox is a cloud sandbox built for agents and automation workflows.
menuvision
Build beautiful HTML photo menus from restaurant URLs, PDFs, or photos using Gemini Vision and AI image generation.
paylock
Non-custodial SOL escrow for AI agent deals.
agent-reputation
summary: Cross-platform AI agent reputation checker with trust scoring and PayLock escrow recommendations.
Telecom Agent Skill
Turn your AI Agent into a Telecom Operator. Bulk calling, ChatOps, and Field Monitoring.