media-generation

Generate images, videos, and audio using Google's Gemini APIs. Use for image generation/editing (Gemini 3 Pro Image), video generation (Veo 3), and speech (TBD). Trigger words - images: generate, create, draw, design, make, edit, modify image/picture. Video: generate video, create video, animate, make a video. Supports text-to-image, image-to-image editing, text-to-video, and image-to-video.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

media-generation is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using media-generation should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/media-generation/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/content-media/media-generation/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/media-generation/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How media-generation Compares

Feature / Agent	media-generation	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Media Generation

## Image Generation

```bash
uv run ~/.claude/skills/media-generation/scripts/generate_image.py \
  --prompt "description or editing instructions" \
  --filename "output.png" \
  [--input-image "source.png"] \
  [--resolution 1K|2K|4K]
```

### Resolution
- `1K` (default) — also for: "low res", "1080p"
- `2K` — also for: "medium", "2048"
- `4K` — also for: "high res", "hi-res", "ultra"

## Video Generation

```bash
uv run ~/.claude/skills/media-generation/scripts/generate_video.py \
  --prompt "video description" \
  --filename "output.mp4" \
  [--model veo-3.0-generate-preview] \
  [--negative "things to avoid"] \
  [--input-image "first-frame.png"]
```

### Models
- `veo-3.0-generate-001` (default) — stable, video only
- `veo-3.0-fast-generate-001` — faster, lower cost
- `veo-3.1-generate-preview` — supports video extend, audio sync
- `veo-3.1-fast-generate-preview` — fast with extend support

### Prompting Tips
- Specify camera movements: `"slow zoom in", "pan left", "close-up"`
- Add `"no talking, no dialogue"` if character shouldn't speak
- Describe atmosphere: `"rain outside", "purple mystical energy"`

**Note:** Veo requires paid tier. ~$0.40/sec standard, ~$0.15/sec fast.

## Music Video from Image + Audio

### Overview
1. Start with character image + audio track (e.g., from Suno)
2. Transcribe audio to get timestamps
3. Generate clip 1 from image (veo-3.1)
4. Extend each subsequent clip from previous (maintains continuity)
5. Stitch clips + overlay audio with ffmpeg

### Step 1: Transcribe audio for timing
```bash
whisper-ctranslate2 "song.mp3" --model large-v3 --output_dir /tmp --output_format srt
```

### Step 2: Generate first clip from image
```python
# Use veo-3.1 (required for extend feature)
operation = client.models.generate_videos(
    model="veo-3.1-generate-preview",
    image=types.Image(image_bytes=img_data, mime_type="image/jpeg"),
    prompt="character description, scene action, no talking",
)
video1 = operation.result.generated_videos[0]
```

### Step 3: Extend from previous clip
```python
operation = client.models.generate_videos(
    model="veo-3.1-generate-preview",
    video=previous_video.video,  # Pass previous video object
    prompt="next scene description, continuous action, no talking",
)
```

### Step 4: Stitch clips + add audio
```bash
# Create concat list
printf "file 'clip_01.mp4'\nfile 'clip_02.mp4'\n..." > concat.txt

# Stitch video clips
ffmpeg -f concat -safe 0 -i concat.txt -c copy combined.mp4

# Add audio track
ffmpeg -i combined.mp4 -i song.mp3 -c:v copy -c:a aac -map 0:v -map 1:a final.mp4
```

### Cost estimate
- ~8 sec per clip × $0.40/sec = $3.20/clip
- 4-min song ≈ 30 clips ≈ $96

## Audio Generation

- **Music:** Use Suno (external service)
- **Speech:** Gemini 2.5 TTS (Flash or Pro) - TBD script

## API Key

Uses `GEMINI_API_KEY` env var, or pass `--api-key KEY`.

Related Skills

generational-agent-succession

from diegosouzapw/awesome-omni-skill

Parallel agent swarms with generational succession. Combines agent-architect's multi-agent parallelism with automatic succession when agents degrade. Each parallel agent gets fresh context through controlled handoffs while maintaining accumulated wisdom.

social-media-scheduler

from diegosouzapw/awesome-omni-skill

Generate a full week of social media content for any topic. Outputs platform-optimized posts for Twitter/X, LinkedIn, and Instagram with hashtags and posting times.

social-media-manager

from diegosouzapw/awesome-omni-skill

Agente especialista em Social Media para múltiplas empresas (Multi-tenant). Cria estratégias semanais, gerencia perfis de marca e gera conteúdo (texto e imagem) em massa para publicação manual.

Media Uploader - R2/S3 with video download

from diegosouzapw/awesome-omni-skill

Upload files or download videos from popular platforms (YouTube, Vimeo, Bilibili, etc.) and upload to Cloudflare R2, AWS S3, or any S3-compatible storage with secure presigned download links.

marketing-social-media

from diegosouzapw/awesome-omni-skill

Sustainable social media marketing and paid social: content systems, community management, influencer ops, social commerce, and attribution (2026).

instagram-social-media

from diegosouzapw/awesome-omni-skill

Atua como um especialista em social media para Instagram, criando conteúdos altamente alinhados com a identidade da marca. Use esta habilidade sempre que o usuário quiser criar posts, stories, legendas ou estratégias para o Instagram.

Image Generation

from diegosouzapw/awesome-omni-skill

AI图像生成与编辑能力，基于 Nano Banana (Gemini Image) 实现文生图、图生图、图像编辑。适用于创意设计、营销素材、社交媒体内容、演示文稿配图等场景。支持多种风格、高分辨率输出（最高4K）、文字渲染、角色一致性保持。

ffmpeg-media

from diegosouzapw/awesome-omni-skill

FFmpeg media processing. Video/audio transcoding, stream manipulation, and filter graphs.

cv181x-media

from diegosouzapw/awesome-omni-skill

Expert guide for CV181X/CV182X/CV180X (SG200X) multimedia development using CVI MPI API. Use this skill when working with: VI (video input/camera/ISP), VPSS (video processing/scaling/crop), VENC (H.264/H.265/JPEG encoding), VDEC (decoding), VB (video buffer pools), SYS binding, or any CVI_* API calls. Covers camera pipeline setup, offline VPSS processing, VB pool planning, and error diagnosis (ERR_VPSS_NOBUF, ERR_VB_NOBUF). API details in references/.

athlete-social-media-manager

from diegosouzapw/awesome-omni-skill

Create brand-safe content for athletes. Personal branding strategy, engagement optimization, crisis communication, sponsor integration.

ai-video-generation

from diegosouzapw/awesome-omni-skill

Generate AI videos with Google Veo, Seedance, Wan, Grok and 40+ models via inference.sh CLI. Models: Veo 3.1, Veo 3, Seedance 1.5 Pro, Wan 2.5, Grok Imagine Video, OmniHuman, Fabric, HunyuanVideo. Capabilities: text-to-video, image-to-video, lipsync, avatar animation, video upscaling, foley sound. Use for: social media videos, marketing content, explainer videos, product demos, AI avatars. Triggers: video generation, ai video, text to video, image to video, veo, animate image, video from image, ai animation, video generator, generate video, t2v, i2v, ai video maker, create video with ai, runway alternative, pika alternative, sora alternative, kling alternative

ai-generation-client

from diegosouzapw/awesome-omni-skill

External AI API integration with retry logic, rate limiting, content safety detection, and multi-turn conversation support for image generation.