ai-multimodal

Image/vision analysis, generation prompt crafting, and multimodal AI workflow orchestration

39 stars

byInugamiDev

View on GitHub Installation ↓

Best use case

ai-multimodal is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Image/vision analysis, generation prompt crafting, and multimodal AI workflow orchestration

Teams using ai-multimodal should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ai-multimodal/SKILL.md --create-dirs "https://raw.githubusercontent.com/InugamiDev/ultrathink-oss/main/.claude/skills/ai-multimodal/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/ai-multimodal/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How ai-multimodal Compares

Feature / Agent	ai-multimodal	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Image/vision analysis, generation prompt crafting, and multimodal AI workflow orchestration

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# AI Multimodal

## Purpose

This skill handles all interactions involving visual content — analyzing screenshots, interpreting diagrams, extracting information from images, crafting image generation prompts, and orchestrating multimodal AI workflows. It bridges the gap between visual and textual reasoning.

## Key Concepts

### Vision Analysis Modes

| Mode | Use Case | Output |
|------|----------|--------|
| **Descriptive** | "What is in this image?" | Detailed natural language description |
| **Extractive** | "Read the text/data from this image" | Structured data extraction |
| **Diagnostic** | "What's wrong with this UI?" | Issue identification with recommendations |
| **Comparative** | "How do these two designs differ?" | Structured comparison |
| **Interpretive** | "What does this diagram mean?" | Semantic interpretation of visual information |

### Image Understanding Framework

When analyzing any image, systematically assess:

```
LAYER 1 — COMPOSITION:
  - Type: screenshot / photo / diagram / chart / illustration / icon
  - Dimensions: aspect ratio and resolution implications
  - Layout: grid / freeform / hierarchical / sequential

LAYER 2 — CONTENT:
  - Primary subject(s): What dominates the image
  - Text content: Any readable text (OCR-level extraction)
  - Data content: Numbers, charts, graphs, tables
  - UI elements: Buttons, forms, navigation, cards (if screenshot)

LAYER 3 — CONTEXT:
  - Purpose: What this image is trying to communicate
  - Audience: Who this is designed for
  - Quality: Resolution, clarity, artifacts, compression

LAYER 4 — SEMANTICS:
  - Meaning: What information does this convey beyond literal content
  - Relationships: How elements relate to each other
  - Flow: What sequence or hierarchy is implied
```

## Workflows

### Workflow 1: Screenshot Analysis (UI Review)

```
INPUT: Screenshot of a UI

STEP 1: Identify the application type
  - Web app / mobile app / desktop app / CLI
  - Platform: browser, iOS, Android, desktop OS
  - Framework hints: React DevTools icon, specific component patterns

STEP 2: Catalog UI elements
  - Navigation: header, sidebar, tabs, breadcrumbs
  - Content: cards, lists, tables, forms
  - Actions: buttons, links, toggles, dropdowns
  - Feedback: alerts, toasts, loading states, empty states

STEP 3: Assess design quality
  - Spacing: consistent padding and margins
  - Typography: hierarchy, readability, font choices
  - Color: contrast ratios, palette consistency, accessibility
  - Alignment: grid adherence, visual balance
  - Depth: shadows, elevation, layering

STEP 4: Identify issues
  - Accessibility: contrast failures, missing labels, touch target sizes
  - Usability: unclear CTAs, information overload, hidden actions
  - Consistency: style deviations, mixed patterns
  - Responsiveness: overflow, truncation, broken layouts

OUTPUT FORMAT:
  SUMMARY: [1-2 sentence overview]
  POSITIVE: [What works well]
  ISSUES:
    - [SEVERITY] [Issue description] → [Recommendation]
  ACCESSIBILITY:
    - [WCAG criterion] [Pass/Fail] [Details]
```

### Workflow 2: Diagram Interpretation

```
INPUT: Architecture diagram, flowchart, ER diagram, or sequence diagram

STEP 1: Identify diagram type
  - Flowchart: process flow with decisions
  - Sequence: temporal message passing between actors
  - ER: entity relationships with cardinality
  - Architecture: system components and connections
  - Class: object-oriented structure
  - State: state machine with transitions

STEP 2: Extract elements
  - Nodes/entities: name, type, attributes
  - Connections: direction, labels, cardinality
  - Groupings: boundaries, clusters, swimlanes
  - Annotations: notes, constraints, legends

STEP 3: Interpret semantics
  - Data flow: where does data originate and terminate
  - Control flow: what drives decisions and transitions
  - Dependencies: what depends on what
  - Bottlenecks: single points of failure, high-fan-in nodes

STEP 4: Generate machine-readable representation
  - Convert to Mermaid syntax (hand off to mermaid skill)
  - Or convert to structured text description
```

### Workflow 3: Image Generation Prompt Engineering

```
INPUT: Description of desired image

STEP 1: Structure the prompt
  SUBJECT: [What is the main subject]
  ACTION: [What is the subject doing]
  SETTING: [Where / background]
  STYLE: [Art style, medium, aesthetic]
  MOOD: [Emotional tone, lighting, atmosphere]
  TECHNICAL: [Aspect ratio, quality, camera angle]

STEP 2: Apply prompt engineering principles
  - Front-load important elements (subject first)
  - Use specific, concrete descriptors over vague ones
  - Include negative prompts for unwanted elements
  - Specify style references when possible
  - Use weight/emphasis syntax for the target platform

STEP 3: Optimize for the target model
  DALL-E 3:
    - Natural language descriptions work best
    - Be descriptive and narrative
    - Specify "digital art", "photograph", "illustration" etc.

  Stable Diffusion:
    - Comma-separated tags work best
    - Include quality boosters: "masterpiece, best quality, highly detailed"
    - Use negative prompts extensively
    - Specify model-specific tags (e.g., "8k uhd, dslr")

  Midjourney:
    - Use /imagine with concise, evocative language
    - Append parameters: --ar 16:9 --v 6 --q 2
    - Reference artists or styles with "in the style of"
    - Use --no for negative prompts
```

### Workflow 4: Data Extraction from Images

```
INPUT: Image containing structured data (chart, table, form)

STEP 1: Identify data type
  - Table: rows and columns of data
  - Chart: bar, line, pie, scatter, etc.
  - Form: filled form fields
  - Document: structured text document
  - Receipt/Invoice: financial data

STEP 2: Extract raw data
  For tables:
    | Header 1 | Header 2 | Header 3 |
    |----------|----------|----------|
    | value    | value    | value    |

  For charts:
    CHART TYPE: [bar/line/pie/etc.]
    X-AXIS: [label and unit]
    Y-AXIS: [label and unit]
    DATA POINTS: [extracted values]
    TREND: [observed pattern]

  For forms:
    FIELD: [label] = [value]

STEP 3: Validate extraction
  - Cross-check totals if available
  - Verify units and scales
  - Flag uncertain readings with confidence levels
  - Note any obscured or illegible portions

OUTPUT FORMAT:
  FORMAT: [table/csv/json — most appropriate for the data]
  DATA: [Extracted data in chosen format]
  CONFIDENCE: [high/medium/low for each data point]
  NOTES: [Anything uncertain or partially readable]
```

## Prompt Templates

### UI Screenshot Review Prompt

```
Analyze this UI screenshot and provide:
1. A brief description of what the screen shows
2. UI element inventory (navigation, content areas, actions)
3. Design assessment (spacing, typography, color, alignment)
4. Accessibility issues (contrast, labels, touch targets)
5. Top 3 improvement recommendations with specific CSS/design fixes
```

### Architecture Diagram Interpretation Prompt

```
Interpret this architecture diagram:
1. List all system components and their roles
2. Map all connections with direction and purpose
3. Identify the data flow from user request to response
4. Note any potential single points of failure
5. Convert to Mermaid syntax for version control
```

### Error Screenshot Diagnosis Prompt

```
Analyze this error screenshot:
1. Read and transcribe the exact error message
2. Identify the error type (runtime, build, network, UI)
3. Identify the source (file path, line number if visible)
4. Suggest probable causes based on the error
5. Provide fix steps in order of likelihood
```

## Quality Guidelines

### For Image Analysis

- Always describe what you **see**, not what you **assume**
- Flag uncertain readings explicitly: "This appears to be X, but the resolution makes it difficult to confirm"
- When analyzing UI, reference specific coordinates or regions: "top-left navigation area", "the third card in the grid"
- Provide actionable output — descriptions alone are not useful without recommendations

### For Prompt Generation

- Test prompts mentally before delivering — would this produce the desired result?
- Include aspect ratio specifications — default square outputs are rarely what users want
- Always ask about the intended use (web, print, social media) to set appropriate quality parameters
- Provide 2-3 prompt variants so the user can iterate

## Anti-Patterns

1. **Over-interpreting**: Making assumptions about image content that are not visually supported. State only what is visible.
2. **Generic descriptions**: "This is a nice-looking website" provides zero value. Be specific about what works and what does not.
3. **Ignoring context**: A login screen for a banking app has different requirements than one for a gaming platform. Consider the domain.
4. **Platform-agnostic prompts**: Image generation prompts must be tailored to the specific model being used. DALL-E and Stable Diffusion require different approaches.
5. **Missing accessibility**: Every UI analysis must include accessibility assessment. It is not optional.

## Integration Notes

- Hand off to **ui-ux-pro** when screenshot analysis reveals design system issues.
- Hand off to **media-processing** when images need transformation (resize, format conversion, optimization).
- Hand off to **mermaid** when a diagram needs to be recreated in version-controllable format.
- Use **chrome-devtools** when a screenshot analysis suggests the need for live browser inspection.

Related Skills

ultrathink

from InugamiDev/ultrathink-oss

UltraThink Workflow OS — 4-layer skill mesh with persistent memory and privacy hooks for complex engineering tasks. Routes prompts through intent detection to activate the right domain skills automatically.

ultrathink_review

from InugamiDev/ultrathink-oss

Multi-pass code review powered by UltraThink's quality gate — checks correctness, security (OWASP), performance, readability, and project conventions in a single structured pass.

ultrathink_memory

from InugamiDev/ultrathink-oss

Persistent memory system for UltraThink — search, save, and recall project context, decisions, and patterns across sessions using Postgres-backed fuzzy search with synonym expansion.

ui-design

from InugamiDev/ultrathink-oss

Comprehensive UI design system: 230+ font pairings, 48 themes, 65 design systems, 23 design languages, 30 UX laws, 14 color systems, Swiss grid, Gestalt principles, Pencil.dev workflow. Inherits ui-ux-pro-max (99 UX rules) + impeccable-frontend-design (anti-AI-slop). Triggers on any design, UI, layout, typography, color, theme, or styling task.

Zod

from InugamiDev/ultrathink-oss

> TypeScript-first schema validation with static type inference.

webinar-registration-page

from InugamiDev/ultrathink-oss

Build a webinar or live event registration page as a self-contained HTML file with countdown timer, speaker bio, agenda, and registration form. Triggers on: "build a webinar registration page", "create a webinar sign-up page", "event registration landing page", "live training registration page", "workshop sign-up page", "create a webinar page", "build an event page", "free webinar landing page", "live demo registration page", "online event page", "create a registration page for my webinar", "build a training event page".

webhooks

from InugamiDev/ultrathink-oss

Webhook design patterns — delivery, retry with exponential backoff, HMAC signature verification, payload validation, idempotency keys

web-workers

from InugamiDev/ultrathink-oss

Offload heavy computation from the main thread using Web Workers, SharedWorkers, and Comlink — structured messaging, transferable objects, and off-main-thread architecture patterns

web-vitals

from InugamiDev/ultrathink-oss

Core Web Vitals monitoring (LCP, FID, CLS, INP, TTFB), measurement with web-vitals library, reporting to analytics, and optimization strategies for Next.js

web-components

from InugamiDev/ultrathink-oss

Native Web Components, custom elements API, Shadow DOM, HTML templates, slots, lifecycle callbacks, and framework-agnostic design patterns

wasm

from InugamiDev/ultrathink-oss

WebAssembly integration — Rust to WASM with wasm-pack/wasm-bindgen, WASI, browser usage, server-side WASM, and performance considerations

vue

from InugamiDev/ultrathink-oss

Vue 3 Composition API, Nuxt patterns, reactivity system, component architecture, and production development practices