vision-llm skill
Patterns and reference for using vision LLMs to convert screenshots to code within the screenshot-to-code system.
Best use case
vision-llm skill is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Patterns and reference for using vision LLMs to convert screenshots to code within the screenshot-to-code system.
Teams using vision-llm skill should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/vision-llm/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How vision-llm skill Compares
| Feature / Agent | vision-llm skill | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Patterns and reference for using vision LLMs to convert screenshots to code within the screenshot-to-code system.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# vision-llm skill
Patterns and reference for using vision LLMs to convert screenshots to code within the screenshot-to-code system.
## Supported models
| Model | Provider | Notes |
|---|---|---|
| `gpt-4o` | OpenAI | Recommended. Best quality, fast. |
| `gpt-4-turbo` | OpenAI | Slower, slightly lower cost. |
| `claude-3-5-sonnet-20241022` | Anthropic | Excellent for complex UIs. |
| `claude-3-opus-20240229` | Anthropic | Highest quality, slowest. |
| Any vision-capable model | Custom | Must accept OpenAI-compatible `/chat/completions`. |
The system uses the OpenAI `/chat/completions` endpoint format. Anthropic models are accessed via an OpenAI-compatible proxy or the `@anthropic-ai/sdk` with the same message structure.
## System prompt
The default system prompt is stored in settings and editable by the user. The production default:
```
You are an expert frontend developer. Convert the provided UI screenshot into clean, well-structured code. Output only the code block, no explanations or markdown fences around the block (the block itself is fine). Produce idiomatic code for the target framework.
```
For each framework the system prompt is augmented with:
- **html-css**: "Output a single self-contained HTML file with a <style> block. Use CSS custom properties for colors and spacing."
- **react-tailwind**: "Output a single .tsx file. Use functional components, TypeScript prop types, and Tailwind CSS classes. Do not import external component libraries."
- **vue**: "Output a single Vue 3 SFC (.vue) with <script setup lang='ts'>, <template>, and <style scoped>."
## Message structure
```typescript
interface VisionMessage {
role: "user";
content: [
{
type: "text";
text: string; // framework-specific prompt + optional user instructions
},
{
type: "image_url";
image_url: {
url: string; // "data:image/jpeg;base64,<base64>"
detail: "high"; // always "high" for best code accuracy
};
}
];
}
```
The messages array sent to the LLM is always `[systemMessage, visionMessage]`.
## Image preprocessing with sharp
Before sending to the LLM, images are preprocessed in `src/lib/image.ts`:
```typescript
import sharp from 'sharp';
export async function preprocessImage(
buffer: Buffer,
opts: { maxWidth?: number; maxHeight?: number; quality?: number } = {}
): Promise<{ data: Buffer; mime: 'image/jpeg'; originalSize: number; processedSize: number }> {
const { maxWidth = 1920, maxHeight = 1080, quality = 85 } = opts;
const originalSize = buffer.length;
const data = await sharp(buffer)
.resize(maxWidth, maxHeight, { fit: 'inside', withoutEnlargement: true })
.jpeg({ quality })
.toBuffer();
return { data, mime: 'image/jpeg', originalSize, processedSize: data.length };
}
```
Always set `detail: "high"` in the image_url object - lower detail significantly degrades code quality.
## Iteration prompt
When the user iterates, the existing generated code is included in the message:
```typescript
const iterationUserText = `
Here is the current version of the code (v${iteration.version}):
\`\`\`${frameworkToLang(framework)}
${existingCode}
\`\`\`
The user wants the following change:
${userPrompt}
Output the complete updated code. Do not add explanations.
`.trim();
```
The same image is re-attached so the LLM can reference the original design.
## Code extraction
The LLM output is expected to contain a single fenced code block. Extraction:
```typescript
export function extractCodeBlock(text: string): string | null {
// Match ```lang\n...code...\n``` or just ```\n...code...\n```
const match = text.match(/```(?:\w+)?\n([\s\S]+?)```/);
if (match) return match[1].trim();
// Fallback: if no fence, check if output starts with a tag or import
const trimmed = text.trim();
if (trimmed.startsWith('<') || trimmed.startsWith('import ') || trimmed.startsWith('export ')) {
return trimmed;
}
return null;
}
```
If extraction returns `null`, the conversion is marked `error` with reason `no_code_extracted`.
## Framework language tags
| Framework | Code fence language |
|---|---|
| `html-css` | `html` |
| `react-tailwind` | `tsx` |
| `vue` | `vue` |
## Token cost estimation
Approximate token usage per conversion (1920x1080 JPEG at 85% quality):
| Component | Tokens (approx) |
|---|---|
| System prompt | 80-120 |
| User text prompt | 40-80 |
| Image (high detail) | 765-1105 (depends on image content) |
| Generated output (html-css) | 800-2000 |
| Generated output (react-tailwind) | 1000-3000 |
Total per conversion: approximately 2000-5000 tokens. Iterations reuse the image attachment.
## Streaming
The API uses streaming (`stream: true`) from the LLM to provide real-time status updates. The server buffers the stream internally and writes the final code to the database once complete. The client polls `GET /convert/:id` for status.
For iteration, the same streaming approach applies - status transitions from `generating` to `done` once the full response is buffered and code is extracted.
## Adding a custom OpenAI-compatible provider
Set `S2C_BASE_URL` in `.env` to the provider's base URL, e.g.:
```
S2C_BASE_URL=https://openrouter.ai/api/v1
S2C_OPENAI_KEY=sk-or-...
S2C_VISION_MODEL=openai/gpt-4o
```
The system uses the standard OpenAI SDK with `baseURL` override. Any provider that supports the `/chat/completions` endpoint with `image_url` content parts will work.
## Common LLM prompt improvements
Add these to the optional instructions field to improve output:
| Goal | Instruction |
|---|---|
| TypeScript types | "Add TypeScript prop types for all components and data." |
| Accessibility | "Include aria-labels on interactive elements and semantic HTML." |
| Mock data | "Include realistic mock data arrays for tables and lists." |
| Hover states | "Add Tailwind hover: and focus: states for all interactive elements." |
| Dark mode | "Support dark mode using Tailwind dark: variants." |
| Responsive | "Make the layout responsive with Tailwind sm:/md:/lg: breakpoints." |
| No Tailwind | "Use plain CSS with BEM class names instead of Tailwind." |
## Troubleshooting model output quality
| Symptom | Cause | Fix |
|---|---|---|
| Generic placeholder code | Image too small or low contrast | Increase upload resolution, use PNG |
| Wrong colors | JPEG compression artifacts | Use PNG or increase JPEG quality to 95% |
| Missing sections | Long UI, model truncated | Add "output the complete full page code, do not truncate" to prompt |
| Tailwind classes not applied | Model used arbitrary values | Add "use standard Tailwind utility classes only, no arbitrary values" |
| Vue options API instead of composition | Default model behavior | Add "use Vue 3 Composition API with <script setup>" to prompt |Related Skills
Skill: pi-provisioner
Application-level patterns for the pi-provisioner project.
Skill: Uptime Monitoring
## Overview
Skill: Status Page
## Overview
Skill: unit-conversion
## Overview
Skill: recipe-scaler
## Overview
reading-list
Operate the reading-list API to save, manage, tag, search, and export articles.
email-digest
Configure, test, and troubleshoot the reading-list daily email digest delivered via nodemailer.
websocket-realtime
Use the WebSocket connection in poll-builder to receive live vote updates. Use when you need to stream real-time poll results, monitor a poll for new votes, or build a live dashboard. Triggers include "live results", "real-time updates", "stream votes", "watch poll", or "WebSocket".
poll-builder
Self-hosted poll creation tool with real-time results. Use when you need to create a poll, check vote counts, close a poll, export results, or get the shareable link for a poll. Triggers include "create poll", "vote", "poll results", "survey", "collect votes", "share poll", or any task involving polling or voting.
Skill: personal-finance
## Overview
Skill: csv-import
## Overview
Skill: Syntax Highlighting
## Purpose