kimicode-vision-bridge
Use when the current Agent LLM cannot process images directly and visual analysis is needed — bridges images through KimiCode CLI print mode to a multimodal Kimi model for text description
Best use case
kimicode-vision-bridge is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Use when the current Agent LLM cannot process images directly and visual analysis is needed — bridges images through KimiCode CLI print mode to a multimodal Kimi model for text description
Teams using kimicode-vision-bridge should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/kimicode-vision-bridge/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How kimicode-vision-bridge Compares
| Feature / Agent | kimicode-vision-bridge | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Use when the current Agent LLM cannot process images directly and visual analysis is needed — bridges images through KimiCode CLI print mode to a multimodal Kimi model for text description
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# KimiCode Vision Bridge
## Overview
Routes images through KimiCode CLI's print mode (`kimi --print`) to give a non-vision Agent LLM visual understanding. Kimi is a multimodal model; print mode provides a non-interactive, programmatic pipe — the Agent submits an image + question via CLI, Kimi returns a text description on stdout.
```
Agent (no vision) KimiCode CLI Kimi multimodal model
│ │ │
│ image + question │ │
├───────────────────────────────→│ │
│ kimi --print -p "..." │ │
│ ├─────────────────────────────→│
│ │ API call with image │
│ │←─────────────────────────────┤
│ │ text response │
│←───────────────────────────────┤ │
│ text on stdout │ │
▼ ▼ ▼
```
Print mode docs: https://www.kimi-cli.com/en/customization/print-mode.html
## When to Use
Use this skill when:
- The current Agent LLM cannot process images directly (no vision capability)
- An image needs to be read or analyzed — screenshots, photos, diagrams, charts, document scans, UI mockups
- KimiCode CLI is installed and configured with a vision-capable model on the system
Do NOT use when:
- The Agent already has native image input — this skill adds an unnecessary hop
- The task is pure text analysis with no visual component
- KimiCode is not installed or lacks a vision model
## Prerequisites (MUST verify before proceeding)
Both checks must pass. If either fails, stop and report the failure.
### Check 1: KimiCode CLI (`kimi`) is installed
**Windows (PowerShell):**
```powershell
Get-Command kimi -ErrorAction SilentlyContinue | Select-Object -ExpandProperty Source
```
**macOS / Linux:**
```bash
which kimi || command -v kimi
```
| Result | Action |
|--------|--------|
| Found | Pass |
| Not found | Fail. "KimiCode CLI not found. Install from https://www.kimi-cli.com" |
### Check 2: A multimodal (vision-capable) model is configured
Locate the config file and verify an API key and a vision model are set.
**Config locations (try in order):**
| Windows | macOS | Linux |
|---------|-------|-------|
| `%APPDATA%\kimi\config.json` | `~/Library/Application Support/kimi/config.json` | `~/.config/kimi/config.json` |
| `%USERPROFILE%\.kimi\config.json` | `~/.kimi/config.json` | `~/.kimi/config.json` |
**Verify:**
1. A non-empty API key (fields: `apiKey`, `token`, `api_key`, or under `providers`)
2. A model name (fields: `model`, `defaultModel`) — must be vision-capable
3. If config uses env vars (e.g., `$KIMI_API_KEY`), verify those are set
| Result | Action |
|--------|--------|
| API key + vision model set | Pass |
| API key missing | Fail. "No API key configured. Run `kimi config set` or edit the config file." |
| Model missing or text-only | Fail. "No vision model configured. Set a multimodal model via `kimi config set model <name>` (e.g., moonshot-v1-vision, gpt-4o, claude-3.5-sonnet)." |
## The Pipe: Core Workflow
Once prerequisites pass, the bridge has two steps.
### Step A: Submit image + question to Kimi via print mode
The Agent already has an image (from the user, from a file, from a prior tool call). Combine it with a clear instruction and pipe it through `kimi --print`.
**Approach 1 — File reference (recommended)**
Point Kimi at the image file on disk. KimiCode's file-read tool loads and analyzes it.
```bash
kimi --quiet -p "Describe every visible element in this image in detail: /path/to/image.png"
```
**Approach 2 — Inline base64 via JSONL stdin (fully programmatic)**
Pipe the image as base64 directly — no temp file needed.
```bash
BASE64=$(base64 -w 0 /path/to/image.png)
echo "{\"role\":\"user\",\"content\":[{\"type\":\"image_url\",\"image_url\":{\"url\":\"data:image/png;base64,$BASE64\"}},{\"type\":\"text\",\"text\":\"Describe every visible element in this image in detail.\"}]}" \
| kimi --print --input-format=stream-json --output-format=stream-json
```
Windows PowerShell equivalent:
```powershell
$base64 = [Convert]::ToBase64String([IO.File]::ReadAllBytes("C:\path\to\image.png"))
$msg = '{"role":"user","content":[{"type":"image_url","image_url":{"url":"data:image/png;base64,' + $base64 + '"}},{"type":"text","text":"Describe every visible element in this image in detail."}]}'
$msg | kimi --print --input-format=stream-json --output-format=stream-json
```
**Effective prompts for the image question:**
| Scenario | Prompt |
|----------|--------|
| Generic | "Describe every visible element — text, layout, colors, positions, any errors or warnings." |
| UI/dialog | "Read every label, button text, input field value, and error message in this screenshot." |
| Code | "Read all visible code including line numbers, syntax highlighting, and any squiggly/error indicators." |
| Chart/graph | "Describe the chart type, axes labels, data ranges, trend direction, and any annotations." |
| Document | "Transcribe the visible text exactly. Note any formatting (bold, headers, tables)." |
### Step B: Read the text result and feed into Agent context
| Flags | Output format | How to parse |
|-------|--------------|--------------|
| `--quiet` | Plain text, final message only | Read directly |
| `--output-format=stream-json` | JSONL, one JSON per line | `grep '"role":"assistant"' \| tail -1 \| jq -r '.content'` |
**Inject into Agent context:**
```
[Vision bridge: KimiCode print mode]
<text from stdout>
[End vision bridge]
```
### Complete pipeline example
```bash
IMAGE="/path/to/image.png"
RESULT=$(kimi --quiet -p "Describe every visible element in $IMAGE in detail" 2>/dev/null)
echo "[Vision bridge: KimiCode print mode]"
echo "$RESULT"
echo "[End vision bridge]"
```
## Image Sources
The bridge works with any image the Agent can reference. Common sources:
- **User-provided file path** — `kimi --quiet -p "Describe ~/Downloads/screenshot.png"`
- **Screenshot captured on-the-fly** — capture with OS tool first, then feed the saved file (macOS: `screencapture`, Windows: PowerShell GDI+, Linux: ImageMagick/gnome-screenshot)
- **Image from a prior tool call** — pass the path from a previous download/generation step
- **Clipboard image** — save to temp file first, then bridge
Screenshot capture is not part of the bridge skill itself — use platform-native tools.
## Iterative Refinement
1. **Narrow the question**: `"Focus only on reading every text string in the dialog box."`
2. **Compare states**: Submit two images and ask `"Compare these two screenshots and describe what changed."`
3. **Crop and retry**: Crop to a sub-region with OS tools and re-submit
## Common Mistakes
| Mistake | Fix |
|---------|-----|
| Using a text-only model | Run `kimi config set model <vision-model>` to switch |
| Image path contains spaces or special chars | Wrap path in quotes or use a temp file with a simple name |
| Expecting KimiCode to take screenshots | Use OS tools; KimiCode is the analysis pipe, not the capture tool |
| Forgetting to verify prerequisites first | Always run Check 1 and Check 2 before attempting the bridge |
| Not checking exit codes | 0=success, 1=permanent error (auth/config), 75=transient (rate limit, retry) |
## Error Recovery
| Symptom | Likely cause | Action |
|---------|-------------|--------|
| `kimi: command not found` | CLI not on PATH | Install from https://www.kimi-cli.com |
| Exit code 75 | Rate limit / transient | Wait 10s, retry |
| Exit code 1 | Auth or config error | Run `kimi config` to check API key and model |
| Empty or nonsensical output | Model may not be vision-capable | Verify the model supports multimodal input |
| "Image input not supported" in output | Model is text-only | Switch to a vision model in KimiCode config |
| Base64 too large for stdin | Image too big | Use file-reference approach (Approach 1) |
| Output appears truncated | Long response | Use `--output-format=stream-json` for reliable capture |Related Skills
unity-mcp
Use when controlling Unity editor via AI, automating scene operations, or programmatically generating Unity assets and scripts
ue5-umg
Use when building HUDs, menus, inventory screens, settings panels, or any widget-based interface in Unreal Engine 5. Also use when connecting C++ logic to UMG Blueprint visuals, handling gamepad or keyboard focus navigation, managing UI state, creating widget animations, or troubleshooting UMG performance issues like frame drops, hitches, or widget memory leaks.
taskmaster-skill
Use when managing complex project plans, tracking multi-phase task progress, or prioritizing development tasks
research-to-practice
Use when applying academic research to practical workflows, optimizing existing processes based on papers, or extracting actionable insights from research
requirement-clarification
Use when receiving ambiguous instructions, preparing for state-changing operations, or needing explicit user confirmation
paper-first-principles
Use when converting academic papers into engineer-friendly documentation, extracting design patterns from research, or preparing technical knowledge sharing
mvp-design
Use when designing new modules from scratch, creating minimal viable prototypes, or establishing architectural decisions before implementation
msvc-build
Use when compiling MSVC C++ projects, debugging build errors, or performing clean and incremental builds
layered-first-principles-teaching
Use when explaining complex concepts to others, designing training materials, or preparing technical presentations with progressive disclosure
knowledge-base-cache
Use when managing large knowledge bases, reducing API costs, or implementing multi-tier caching for frequent queries
hexo-blog-update
Use when creating, editing, or publishing Hexo blog posts
git-workflow
Use when committing code, pushing changes, or managing Git operations that require safety checks