media-generation
Generate images, videos, and audio using Google's Gemini APIs. Use for image generation/editing (Gemini 3 Pro Image), video generation (Veo 3), and speech (TBD). Trigger words - images: generate, create, draw, design, make, edit, modify image/picture. Video: generate video, create video, animate, make a video. Supports text-to-image, image-to-image editing, text-to-video, and image-to-video.
Best use case
media-generation is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Generate images, videos, and audio using Google's Gemini APIs. Use for image generation/editing (Gemini 3 Pro Image), video generation (Veo 3), and speech (TBD). Trigger words - images: generate, create, draw, design, make, edit, modify image/picture. Video: generate video, create video, animate, make a video. Supports text-to-image, image-to-image editing, text-to-video, and image-to-video.
Teams using media-generation should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/media-generation/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How media-generation Compares
| Feature / Agent | media-generation | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Generate images, videos, and audio using Google's Gemini APIs. Use for image generation/editing (Gemini 3 Pro Image), video generation (Veo 3), and speech (TBD). Trigger words - images: generate, create, draw, design, make, edit, modify image/picture. Video: generate video, create video, animate, make a video. Supports text-to-image, image-to-image editing, text-to-video, and image-to-video.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Media Generation
## Image Generation
```bash
uv run ~/.claude/skills/media-generation/scripts/generate_image.py \
--prompt "description or editing instructions" \
--filename "output.png" \
[--input-image "source.png"] \
[--resolution 1K|2K|4K]
```
### Resolution
- `1K` (default) — also for: "low res", "1080p"
- `2K` — also for: "medium", "2048"
- `4K` — also for: "high res", "hi-res", "ultra"
## Video Generation
```bash
uv run ~/.claude/skills/media-generation/scripts/generate_video.py \
--prompt "video description" \
--filename "output.mp4" \
[--model veo-3.0-generate-preview] \
[--negative "things to avoid"] \
[--input-image "first-frame.png"]
```
### Models
- `veo-3.0-generate-001` (default) — stable, video only
- `veo-3.0-fast-generate-001` — faster, lower cost
- `veo-3.1-generate-preview` — supports video extend, audio sync
- `veo-3.1-fast-generate-preview` — fast with extend support
### Prompting Tips
- Specify camera movements: `"slow zoom in", "pan left", "close-up"`
- Add `"no talking, no dialogue"` if character shouldn't speak
- Describe atmosphere: `"rain outside", "purple mystical energy"`
**Note:** Veo requires paid tier. ~$0.40/sec standard, ~$0.15/sec fast.
## Music Video from Image + Audio
### Overview
1. Start with character image + audio track (e.g., from Suno)
2. Transcribe audio to get timestamps
3. Generate clip 1 from image (veo-3.1)
4. Extend each subsequent clip from previous (maintains continuity)
5. Stitch clips + overlay audio with ffmpeg
### Step 1: Transcribe audio for timing
```bash
whisper-ctranslate2 "song.mp3" --model large-v3 --output_dir /tmp --output_format srt
```
### Step 2: Generate first clip from image
```python
# Use veo-3.1 (required for extend feature)
operation = client.models.generate_videos(
model="veo-3.1-generate-preview",
image=types.Image(image_bytes=img_data, mime_type="image/jpeg"),
prompt="character description, scene action, no talking",
)
video1 = operation.result.generated_videos[0]
```
### Step 3: Extend from previous clip
```python
operation = client.models.generate_videos(
model="veo-3.1-generate-preview",
video=previous_video.video, # Pass previous video object
prompt="next scene description, continuous action, no talking",
)
```
### Step 4: Stitch clips + add audio
```bash
# Create concat list
printf "file 'clip_01.mp4'\nfile 'clip_02.mp4'\n..." > concat.txt
# Stitch video clips
ffmpeg -f concat -safe 0 -i concat.txt -c copy combined.mp4
# Add audio track
ffmpeg -i combined.mp4 -i song.mp3 -c:v copy -c:a aac -map 0:v -map 1:a final.mp4
```
### Cost estimate
- ~8 sec per clip × $0.40/sec = $3.20/clip
- 4-min song ≈ 30 clips ≈ $96
## Audio Generation
- **Music:** Use Suno (external service)
- **Speech:** Gemini 2.5 TTS (Flash or Pro) - TBD script
## API Key
Uses `GEMINI_API_KEY` env var, or pass `--api-key KEY`.Related Skills
generational-agent-succession
Parallel agent swarms with generational succession. Combines agent-architect's multi-agent parallelism with automatic succession when agents degrade. Each parallel agent gets fresh context through controlled handoffs while maintaining accumulated wisdom.
social-media-scheduler
Generate a full week of social media content for any topic. Outputs platform-optimized posts for Twitter/X, LinkedIn, and Instagram with hashtags and posting times.
social-media-manager
Agente especialista em Social Media para múltiplas empresas (Multi-tenant). Cria estratégias semanais, gerencia perfis de marca e gera conteúdo (texto e imagem) em massa para publicação manual.
Media Uploader - R2/S3 with video download
Upload files or download videos from popular platforms (YouTube, Vimeo, Bilibili, etc.) and upload to Cloudflare R2, AWS S3, or any S3-compatible storage with secure presigned download links.
marketing-social-media
Sustainable social media marketing and paid social: content systems, community management, influencer ops, social commerce, and attribution (2026).
instagram-social-media
Atua como um especialista em social media para Instagram, criando conteúdos altamente alinhados com a identidade da marca. Use esta habilidade sempre que o usuário quiser criar posts, stories, legendas ou estratégias para o Instagram.
Image Generation
AI图像生成与编辑能力,基于 Nano Banana (Gemini Image) 实现文生图、图生图、图像编辑。适用于创意设计、营销素材、社交媒体内容、演示文稿配图等场景。支持多种风格、高分辨率输出(最高4K)、文字渲染、角色一致性保持。
ffmpeg-media
FFmpeg media processing. Video/audio transcoding, stream manipulation, and filter graphs.
cv181x-media
Expert guide for CV181X/CV182X/CV180X (SG200X) multimedia development using CVI MPI API. Use this skill when working with: VI (video input/camera/ISP), VPSS (video processing/scaling/crop), VENC (H.264/H.265/JPEG encoding), VDEC (decoding), VB (video buffer pools), SYS binding, or any CVI_* API calls. Covers camera pipeline setup, offline VPSS processing, VB pool planning, and error diagnosis (ERR_VPSS_NOBUF, ERR_VB_NOBUF). API details in references/.
athlete-social-media-manager
Create brand-safe content for athletes. Personal branding strategy, engagement optimization, crisis communication, sponsor integration.
ai-video-generation
Generate AI videos with Google Veo, Seedance, Wan, Grok and 40+ models via inference.sh CLI. Models: Veo 3.1, Veo 3, Seedance 1.5 Pro, Wan 2.5, Grok Imagine Video, OmniHuman, Fabric, HunyuanVideo. Capabilities: text-to-video, image-to-video, lipsync, avatar animation, video upscaling, foley sound. Use for: social media videos, marketing content, explainer videos, product demos, AI avatars. Triggers: video generation, ai video, text to video, image to video, veo, animate image, video from image, ai animation, video generator, generate video, t2v, i2v, ai video maker, create video with ai, runway alternative, pika alternative, sora alternative, kling alternative
ai-generation-client
External AI API integration with retry logic, rate limiting, content safety detection, and multi-turn conversation support for image generation.