Vision Sandbox

Agentic Vision via Gemini's native Code Execution sandbox. Use for spatial grounding, visual math, and UI auditing.

7 stars

byDemerzels-lab

View on GitHub Installation ↓

Best use case

Vision Sandbox is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Agentic Vision via Gemini's native Code Execution sandbox. Use for spatial grounding, visual math, and UI auditing.

Teams using Vision Sandbox should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/vision-sandbox/SKILL.md --create-dirs "https://raw.githubusercontent.com/Demerzels-lab/elsamultiskillagent/main/public/skills/johanesalxd/vision-sandbox/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/vision-sandbox/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Vision Sandbox Compares

Feature / Agent	Vision Sandbox	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Agentic Vision via Gemini's native Code Execution sandbox. Use for spatial grounding, visual math, and UI auditing.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Vision Sandbox 🔭

Leverage Gemini's native code execution to analyze images with high precision. The model writes and runs Python code in a Google-hosted sandbox to verify visual data, perfect for UI auditing, spatial grounding, and visual reasoning.

## Installation

```bash
clawhub install vision-sandbox
```

## Usage

```bash
uv run vision-sandbox --image "path/to/image.png" --prompt "Identify all buttons and provide [x, y] coordinates."
```

## Pattern Library

### 📍 Spatial Grounding
Ask the model to find specific items and return coordinates.
* **Prompt:** "Locate the 'Submit' button in this screenshot. Use code execution to verify its center point and return the [x, y] coordinates in a [0, 1000] scale."

### 🧮 Visual Math
Ask the model to count or calculate based on the image.
* **Prompt:** "Count the number of items in the list. Use Python to sum their values if prices are visible."

### 🖥️ UI Audit
Check layout and readability.
* **Prompt:** "Check if the header text overlaps with any icons. Use the sandbox to calculate the bounding box intersections."

### 🖐️ Counting & Logic
Solve visual counting tasks with code verification.
* **Prompt:** "Count the number of fingers on this hand. Use code execution to identify the bounding box for each finger and return the total count."

## Integration with OpenCode
This skill is designed to provide **Visual Grounding** for automated coding agents like OpenCode.
- **Step 1:** Use `vision-sandbox` to extract UI metadata (coordinates, sizes, colors).
- **Step 2:** Pass the JSON output to OpenCode to generate or fix CSS/HTML.

## Configuration
- **GEMINI_API_KEY**: Required environment variable.
- **Model**: Defaults to `gemini-3-flash-preview`.

Related Skills

Vision Analyze (Google)

from Demerzels-lab/elsamultiskillagent

Analyze images using **Google Cloud Vision API**.

docker-sandbox

from Demerzels-lab/elsamultiskillagent

Create and manage Docker sandboxed VM environments for safe agent execution. Use when running untrusted code, exploring packages, or isolating agent workloads. Supports Claude, Codex, Copilot, Gemini, and Kiro agents with network proxy controls.

anthrovision-telegram-body-scan

from Demerzels-lab/elsamultiskillagent

Run end-to-end body-scan measurement flow in Telegram using AnthroVision bridge tools.

sandboxer

from Demerzels-lab/elsamultiskillagent

Manage Claude Code terminal sessions via Sandboxer web dashboard. Use when: (1) listing running Claude Code sessions, (2) checking what a Claude session is doing, (3) sending commands to a Claude session, (4) creating or killing sessions, (5) user mentions 'sandboxer' or 'session'.

sandboxer-tmux

from Demerzels-lab/elsamultiskillagent

Dispatch coding tasks to tmux sessions via Sandboxer.

desktop-sandbox

from Demerzels-lab/elsamultiskillagent

A desktop sandbox lets OpenClaw run as natively as on a real OS, ensuring full functionality with safe.

senior-computer-vision

from Demerzels-lab/elsamultiskillagent

Computer vision engineering skill for object detection, image segmentation, and visual AI systems. Covers CNN and Vision Transformer architectures, YOLO/Faster R-CNN/DETR detection, Mask R-CNN/SAM segmentation, and production deployment with ONNX/TensorRT. Includes PyTorch, torchvision, Ultralytics, Detectron2, and MMDetection frameworks. Use when building detection pipelines, training custom models, optimizing inference, or deploying vision systems.

lybic-sandbox

from Demerzels-lab/elsamultiskillagent

Lybic Sandbox is a cloud sandbox built for agents and automation workflows.

menuvision

from Demerzels-lab/elsamultiskillagent

Build beautiful HTML photo menus from restaurant URLs, PDFs, or photos using Gemini Vision and AI image generation.

paylock

from Demerzels-lab/elsamultiskillagent

Non-custodial SOL escrow for AI agent deals.

agent-reputation

from Demerzels-lab/elsamultiskillagent

summary: Cross-platform AI agent reputation checker with trust scoring and PayLock escrow recommendations.

Telecom Agent Skill

from Demerzels-lab/elsamultiskillagent

Turn your AI Agent into a Telecom Operator. Bulk calling, ChatOps, and Field Monitoring.