ai-multimodal

Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.

153 stars

byMicrock

View on GitHub Installation ↓

Best use case

ai-multimodal is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using ai-multimodal should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ai-multimodal/SKILL.md --create-dirs "https://raw.githubusercontent.com/Microck/ordinary-claude-skills/main/skills_all/ai-multimodal/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/ai-multimodal/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How ai-multimodal Compares

Feature / Agent	ai-multimodal	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agent for YouTube Script Writing

Find AI agent skills for YouTube script writing, video research, content outlining, and repeatable channel production workflows.

Best AI Skills for ChatGPT

Find the best AI skills to adapt into ChatGPT workflows for research, writing, summarization, planning, and repeatable assistant tasks.

ChatGPT vs Claude for Agent Skills

Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.

SKILL.md Source

# AI Multimodal Processing Skill

Process audio, images, videos, documents, and generate images using Google Gemini's multimodal API. Unified interface for all multimedia content understanding and generation.

## Core Capabilities

### Audio Processing
- Transcription with timestamps (up to 9.5 hours)
- Audio summarization and analysis
- Speech understanding and speaker identification
- Music and environmental sound analysis
- Text-to-speech generation with controllable voice

### Image Understanding
- Image captioning and description
- Object detection with bounding boxes (2.0+)
- Pixel-level segmentation (2.5+)
- Visual question answering
- Multi-image comparison (up to 3,600 images)
- OCR and text extraction

### Video Analysis
- Scene detection and summarization
- Video Q&A with temporal understanding
- Transcription with visual descriptions
- YouTube URL support
- Long video processing (up to 6 hours)
- Frame-level analysis

### Document Extraction
- Native PDF vision processing (up to 1,000 pages)
- Table and form extraction
- Chart and diagram analysis
- Multi-page document understanding
- Structured data output (JSON schema)
- Format conversion (PDF to HTML/JSON)

### Image Generation
- Text-to-image generation
- Image editing and modification
- Multi-image composition (up to 3 images)
- Iterative refinement
- Multiple aspect ratios (1:1, 16:9, 9:16, 4:3, 3:4)
- Controllable style and quality

## Capability Matrix

| Task | Audio | Image | Video | Document | Generation |
|------|:-----:|:-----:|:-----:|:--------:|:----------:|
| Transcription | ✓ | - | ✓ | - | - |
| Summarization | ✓ | ✓ | ✓ | ✓ | - |
| Q&A | ✓ | ✓ | ✓ | ✓ | - |
| Object Detection | - | ✓ | ✓ | - | - |
| Text Extraction | - | ✓ | - | ✓ | - |
| Structured Output | ✓ | ✓ | ✓ | ✓ | - |
| Creation | TTS | - | - | - | ✓ |
| Timestamps | ✓ | - | ✓ | - | - |
| Segmentation | - | ✓ | - | - | - |

## Model Selection Guide

### Gemini 2.5 Series (Recommended)
- **gemini-2.5-pro**: Highest quality, all features, 1M-2M context
- **gemini-2.5-flash**: Best balance, all features, 1M-2M context
- **gemini-2.5-flash-lite**: Lightweight, segmentation support
- **gemini-2.5-flash-image**: Image generation only

### Gemini 2.0 Series
- **gemini-2.0-flash**: Fast processing, object detection
- **gemini-2.0-flash-lite**: Lightweight option

### Feature Requirements
- **Segmentation**: Requires 2.5+ models
- **Object Detection**: Requires 2.0+ models
- **Multi-video**: Requires 2.5+ models
- **Image Generation**: Requires flash-image model

### Context Windows
- **2M tokens**: ~6 hours video (low-res) or ~2 hours (default)
- **1M tokens**: ~3 hours video (low-res) or ~1 hour (default)
- **Audio**: 32 tokens/second (1 min = 1,920 tokens)
- **PDF**: 258 tokens/page (fixed)
- **Image**: 258-1,548 tokens based on size

## Quick Start

### Prerequisites

**API Key Setup**: Supports both Google AI Studio and Vertex AI.

The skill checks for `GEMINI_API_KEY` in this order:
1. Process environment: `export GEMINI_API_KEY="your-key"`
2. Project root: `.env`
3. `.claude/.env`
4. `.claude/skills/.env`
5. `.claude/skills/ai-multimodal/.env`

**Get API key**: https://aistudio.google.com/apikey

**For Vertex AI**:
```bash
export GEMINI_USE_VERTEX=true
export VERTEX_PROJECT_ID=your-gcp-project-id
export VERTEX_LOCATION=us-central1  # Optional
```

**Install SDK**:
```bash
pip install google-genai python-dotenv pillow
```

### Common Patterns

**Transcribe Audio**:
```bash
python scripts/gemini_batch_process.py \
  --files audio.mp3 \
  --task transcribe \
  --model gemini-2.5-flash
```

**Analyze Image**:
```bash
python scripts/gemini_batch_process.py \
  --files image.jpg \
  --task analyze \
  --prompt "Describe this image" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash
```

**Process Video**:
```bash
python scripts/gemini_batch_process.py \
  --files video.mp4 \
  --task analyze \
  --prompt "Summarize key points with timestamps" \
  --output docs/assets/<output-name>.md \
  --model gemini-2.5-flash
```

**Extract from PDF**:
```bash
python scripts/gemini_batch_process.py \
  --files document.pdf \
  --task extract \
  --prompt "Extract table data as JSON" \
  --output docs/assets/<output-name>.md \
  --format json
```

**Generate Image**:
```bash
python scripts/gemini_batch_process.py \
  --task generate \
  --prompt "A futuristic city at sunset" \
  --output docs/assets/<output-file-name> \
  --model gemini-2.5-flash-image \
  --aspect-ratio 16:9
```

**Optimize Media**:
```bash
# Prepare large video for processing
python scripts/media_optimizer.py \
  --input large-video.mp4 \
  --output docs/assets/<output-file-name> \
  --target-size 100MB

# Batch optimize multiple files
python scripts/media_optimizer.py \
  --input-dir ./videos \
  --output-dir docs/assets/optimized \
  --quality 85
```

**Convert Documents to Markdown**:
```bash
# Convert to PDF
python scripts/document_converter.py \
  --input document.docx \
  --output docs/assets/document.md

# Extract pages
python scripts/document_converter.py \
  --input large.pdf \
  --output docs/assets/chapter1.md \
  --pages 1-20
```

## Supported Formats

### Audio
- WAV, MP3, AAC, FLAC, OGG Vorbis, AIFF
- Max 9.5 hours per request
- Auto-downsampled to 16 Kbps mono

### Images
- PNG, JPEG, WEBP, HEIC, HEIF
- Max 3,600 images per request
- Resolution: ≤384px = 258 tokens, larger = tiled

### Video
- MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
- Max 6 hours (low-res) or 2 hours (default)
- YouTube URLs supported (public only)

### Documents
- PDF only for vision processing
- Max 1,000 pages
- TXT, HTML, Markdown supported (text-only)

### Size Limits
- **Inline**: <20MB total request
- **File API**: 2GB per file, 20GB project quota
- **Retention**: 48 hours auto-delete

## Reference Navigation

For detailed implementation guidance, see:

### Audio Processing
- `references/audio-processing.md` - Transcription, analysis, TTS
  - Timestamp handling and segment analysis
  - Multi-speaker identification
  - Non-speech audio analysis
  - Text-to-speech generation

### Image Understanding
- `references/vision-understanding.md` - Captioning, detection, OCR
  - Object detection and localization
  - Pixel-level segmentation
  - Visual question answering
  - Multi-image comparison

### Video Analysis
- `references/video-analysis.md` - Scene detection, temporal understanding
  - YouTube URL processing
  - Timestamp-based queries
  - Video clipping and FPS control
  - Long video optimization

### Document Extraction
- `references/document-extraction.md` - PDF processing, structured output
  - Table and form extraction
  - Chart and diagram analysis
  - JSON schema validation
  - Multi-page handling

### Image Generation
- `references/image-generation.md` - Text-to-image, editing
  - Prompt engineering strategies
  - Image editing and composition
  - Aspect ratio selection
  - Safety settings

## Cost Optimization

### Token Costs
**Input Pricing**:
- Gemini 2.5 Flash: $1.00/1M input, $0.10/1M output
- Gemini 2.5 Pro: $3.00/1M input, $12.00/1M output
- Gemini 1.5 Flash: $0.70/1M input, $0.175/1M output

**Token Rates**:
- Audio: 32 tokens/second (1 min = 1,920 tokens)
- Video: ~300 tokens/second (default) or ~100 (low-res)
- PDF: 258 tokens/page (fixed)
- Image: 258-1,548 tokens based on size

**TTS Pricing**:
- Flash TTS: $10/1M tokens
- Pro TTS: $20/1M tokens

### Best Practices
1. Use `gemini-2.5-flash` for most tasks (best price/performance)
2. Use File API for files >20MB or repeated queries
3. Optimize media before upload (see `media_optimizer.py`)
4. Process specific segments instead of full videos
5. Use lower FPS for static content
6. Implement context caching for repeated queries
7. Batch process multiple files in parallel

## Rate Limits

**Free Tier**:
- 10-15 RPM (requests per minute)
- 1M-4M TPM (tokens per minute)
- 1,500 RPD (requests per day)

**YouTube Limits**:
- Free tier: 8 hours/day
- Paid tier: No length limits
- Public videos only

**Storage Limits**:
- 20GB per project
- 2GB per file
- 48-hour retention

## Error Handling

Common errors and solutions:
- **400**: Invalid format/size - validate before upload
- **401**: Invalid API key - check configuration
- **403**: Permission denied - verify API key restrictions
- **404**: File not found - ensure file uploaded and active
- **429**: Rate limit exceeded - implement exponential backoff
- **500**: Server error - retry with backoff

## Scripts Overview

All scripts support unified API key detection and error handling:

**gemini_batch_process.py**: Batch process multiple media files
- Supports all modalities (audio, image, video, PDF)
- Progress tracking and error recovery
- Output formats: JSON, Markdown, CSV
- Rate limiting and retry logic
- Dry-run mode

**media_optimizer.py**: Prepare media for Gemini API
- Compress videos/audio for size limits
- Resize images appropriately
- Split long videos into chunks
- Format conversion
- Quality vs size optimization

**document_converter.py**: Convert documents to PDF
- Convert DOCX, XLSX, PPTX to PDF
- Extract page ranges
- Optimize PDFs for Gemini
- Extract images from PDFs
- Batch conversion support

Run any script with `--help` for detailed usage.

## Resources

- [Audio API Docs](https://ai.google.dev/gemini-api/docs/audio)
- [Image API Docs](https://ai.google.dev/gemini-api/docs/image-understanding)
- [Video API Docs](https://ai.google.dev/gemini-api/docs/video-understanding)
- [Document API Docs](https://ai.google.dev/gemini-api/docs/document-processing)
- [Image Gen Docs](https://ai.google.dev/gemini-api/docs/image-generation)
- [Get API Key](https://aistudio.google.com/apikey)
- [Pricing](https://ai.google.dev/pricing)

Related Skills

zapier-workflows

153

from Microck/ordinary-claude-skills

Manage and trigger pre-built Zapier workflows and MCP tool orchestration. Use when user mentions workflows, Zaps, automations, daily digest, research, search, lead tracking, expenses, or asks to "run" any process. Also handles Perplexity-based research and Google Sheets data tracking.

writing-skills

153

from Microck/ordinary-claude-skills

Create and manage Claude Code skills in HASH repository following Anthropic best practices. Use when creating new skills, modifying skill-rules.json, understanding trigger patterns, working with hooks, debugging skill activation, or implementing progressive disclosure. Covers skill structure, YAML frontmatter, trigger types (keywords, intent patterns), UserPromptSubmit hook, and the 500-line rule. Includes validation and debugging with SKILL_DEBUG. Examples include rust-error-stack, cargo-dependencies, and rust-documentation skills.

writing-plans

153

from Microck/ordinary-claude-skills

Use when design is complete and you need detailed implementation tasks for engineers with zero codebase context - creates comprehensive implementation plans with exact file paths, complete code examples, and verification steps assuming engineer has minimal domain knowledge

workflow-orchestration-patterns

153

from Microck/ordinary-claude-skills

Design durable workflows with Temporal for distributed systems. Covers workflow vs activity separation, saga patterns, state management, and determinism constraints. Use when building long-running processes, distributed transactions, or microservice orchestration.

workflow-management

153

from Microck/ordinary-claude-skills

Create, debug, or modify QStash workflows for data updates and social media posting in the API service. Use when adding new automated jobs, fixing workflow errors, or updating scheduling logic.

workflow-interactive-dev

153

from Microck/ordinary-claude-skills

用于开发 FastGPT 工作流中的交互响应。详细说明了交互节点的架构、开发流程和需要修改的文件。

woocommerce-dev-cycle

153

from Microck/ordinary-claude-skills

Run tests, linting, and quality checks for WooCommerce development. Use when running tests, fixing code style, or following the development workflow.

woocommerce-code-review

153

from Microck/ordinary-claude-skills

Review WooCommerce code changes for coding standards compliance. Use when reviewing code locally, performing automated PR reviews, or checking code quality.

Wheels Migration Generator

153

from Microck/ordinary-claude-skills

Generate database-agnostic Wheels migrations for creating tables, altering schemas, and managing database changes. Use when creating or modifying database schema, adding tables, columns, indexes, or foreign keys. Prevents database-specific SQL and ensures cross-database compatibility.

webapp-testing

153

from Microck/ordinary-claude-skills

Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs.

web3-testing

153

from Microck/ordinary-claude-skills

Test smart contracts comprehensively using Hardhat and Foundry with unit tests, integration tests, and mainnet forking. Use when testing Solidity contracts, setting up blockchain test suites, or validating DeFi protocols.

web-research

153

from Microck/ordinary-claude-skills

Use this skill for requests related to web research; it provides a structured approach to conducting comprehensive web research