capy-video-gen-skill
Multi-shot AI video generation pipeline with face identity consistency. Converts scripts or ideas into complete videos using character extraction, storyboarding, frame generation, and video assembly. 300 experiments validated, 70% face distance improvement. Use when the user asks to create a video from a script, story, idea, or wants multi-shot video with consistent characters.
Best use case
capy-video-gen-skill is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Multi-shot AI video generation pipeline with face identity consistency. Converts scripts or ideas into complete videos using character extraction, storyboarding, frame generation, and video assembly. 300 experiments validated, 70% face distance improvement. Use when the user asks to create a video from a script, story, idea, or wants multi-shot video with consistent characters.
Teams using capy-video-gen-skill should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/capy-video-gen-skill/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How capy-video-gen-skill Compares
| Feature / Agent | capy-video-gen-skill | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Multi-shot AI video generation pipeline with face identity consistency. Converts scripts or ideas into complete videos using character extraction, storyboarding, frame generation, and video assembly. 300 experiments validated, 70% face distance improvement. Use when the user asks to create a video from a script, story, idea, or wants multi-shot video with consistent characters.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Capy Video Gen Skill - Script-to-Video Pipeline
Generate complete multi-shot videos from scripts or ideas with consistent character faces across all scenes. Built for HappyCapy AI Gateway. 300 experiments validated, 70% face distance improvement.
## Overview
ViMax converts text scripts into full videos through an automated pipeline:
1. Extract characters from script with detailed physical features
2. Generate front/side/back character portraits
3. Design shot-by-shot storyboard
4. Decompose each shot into first_frame, last_frame, and motion descriptions
5. Build camera tree for shot relationships
6. Generate frames with reference image selection (face identity as top priority)
7. Generate video clips from frames
8. Concatenate into final video
## Installation Location
The ViMax pipeline code is at: `/home/node/a0/workspace/527fb591-1439-4b5b-ad5d-90f972773f95/workspace/tmp/ViMax/`
All commands must be run from this directory using the venv:
```bash
cd /home/node/a0/workspace/527fb591-1439-4b5b-ad5d-90f972773f95/workspace/tmp/ViMax
```
## Prerequisites
- `AI_GATEWAY_API_KEY` environment variable (auto-configured in HappyCapy)
- Python venv at `.venv/` (already set up)
## Quick Start
### Script-to-Video
Edit the script, requirements, and style in the entry script, then run:
```bash
cd /home/node/a0/workspace/527fb591-1439-4b5b-ad5d-90f972773f95/workspace/tmp/ViMax
.venv/bin/python main_happycapy_script2video.py
```
### Idea-to-Video
For generating from a brief idea (auto-generates script first):
```bash
cd /home/node/a0/workspace/527fb591-1439-4b5b-ad5d-90f972773f95/workspace/tmp/ViMax
.venv/bin/python main_happycapy_idea2video.py
```
## Programmatic Usage
```python
import asyncio
from langchain.chat_models import init_chat_model
from tools.render_backend import RenderBackend
from utils.config_loader import load_config
from pipelines.script2video_pipeline import Script2VideoPipeline
config = load_config("configs/happycapy_script2video.yaml")
chat_model = init_chat_model(**config["chat_model"]["init_args"])
backend = RenderBackend.from_config(config)
pipeline = Script2VideoPipeline(
chat_model=chat_model,
image_generator=backend.image_generator,
video_generator=backend.video_generator,
working_dir=config["working_dir"],
)
# Run the pipeline
asyncio.run(pipeline(
script="Your script here...",
user_requirement="No more than 8 shots total.",
style="Cinematic, warm lighting"
))
```
## Pipelines
### Script2VideoPipeline
- Input: A formatted screenplay/script with character dialogue and scene descriptions
- Output: Concatenated video at `{working_dir}/final_video.mp4`
- Config: `configs/happycapy_script2video.yaml`
### Idea2VideoPipeline
- Input: A brief idea/concept (1-3 paragraphs)
- Output: Auto-generates a script, then produces video
- Config: `configs/happycapy_idea2video.yaml`
## Configuration
HappyCapy configs at `configs/happycapy_script2video.yaml`:
```yaml
chat_model:
init_args:
model: gpt-4.1
model_provider: openai
api_key: ${AI_GATEWAY_API_KEY}
base_url: https://ai-gateway.happycapy.ai/api/v1/openai/v1
image_generator:
class_path: tools.ImageGeneratorHappyCapyAPI
init_args:
api_key: ${AI_GATEWAY_API_KEY}
model: google/gemini-3.1-flash-image-preview
video_generator:
class_path: tools.VideoGeneratorHappyCapyAPI
init_args:
api_key: ${AI_GATEWAY_API_KEY}
model: google/veo-3.1-generate-preview
working_dir: .working_dir/script2video
```
## Key Components
### Agents (AI Processing)
| Agent | File | Purpose |
|-------|------|---------|
| CharacterExtractor | `agents/character_extractor.py` | Extract characters with static/dynamic features from script |
| CharacterPortraitsGenerator | `agents/character_portraits_generator.py` | Generate front/side/back portraits for each character |
| StoryboardArtist | `agents/storyboard_artist.py` | Design shot-by-shot storyboard with first/last frames and motion |
| ReferenceImageSelector | `agents/reference_image_selector.py` | Select best reference images for each frame (face identity #1 priority) |
| CameraImageGenerator | `agents/camera_image_generator.py` | Build camera trees and generate transition videos |
| BestImageSelector | `agents/best_image_selector.py` | Select best generated image from candidates |
| Screenwriter | `agents/screenwriter.py` | Generate scripts from ideas |
### Tools (Generation Backends)
| Tool | File | Purpose |
|------|------|---------|
| ImageGeneratorHappyCapyAPI | `tools/image_generator_happycapy_api.py` | Image generation via HappyCapy Gateway (Gemini) |
| VideoGeneratorHappyCapyAPI | `tools/video_generator_happycapy_api.py` | Video generation via HappyCapy Gateway (Veo) |
| RenderBackend | `tools/render_backend.py` | Factory for instantiating generators from config |
### Interfaces (Data Models)
- `CharacterInScene` - Character with identifier, static_features, dynamic_features
- `ShotDescription` - Shot with ff_desc, lf_desc, motion_desc, variation_type
- `Camera` - Camera with parent-child relationships
- `Frame` - Frame with shot_idx, frame_type, visible characters
- `ImageOutput` / `VideoOutput` - Generation outputs with save methods
## Face Identity Consistency (CRITICAL)
This pipeline includes face identity improvements validated through 257 experiments (70% improvement in face distance, from 0.74 to 0.22):
### Built-In Protections
1. **Reference Image Selector**: Face identity is the #1 priority when selecting reference images. The front-view portrait is always included when a character's face is visible.
2. **Character Portraits**: Enhanced prompts generate identity-critical details (exact nose shape, eye spacing, jawline, distinguishing marks) for cross-scene recognition.
3. **Video Prompt Face Lock**: Every video generation prompt is prepended with a face identity instruction requiring the character's face to remain identical to the starting frame throughout the clip.
### Best Practices When Using ViMax
- **Hyper-detailed character descriptions**: Include ethnicity, age, hair texture/style/color, eye shape, facial hair, glasses, skin tone, build, and distinguishing marks in your script's character introductions
- **Extreme close-up shots**: Include at least one extreme close-up per character to anchor identity
- **Consistent lighting**: Specify similar lighting across scenes to prevent face drift
- **User-provided reference photos**: Place photos in the working directory and pass them as `character_portraits_registry` to skip AI portrait generation
### What Does NOT Work
- Complex prompt engineering (viseme morphing, phoneme anchoring) does not improve face identity
- Simple, direct prompts with detailed physical descriptions outperform clever prompts
- Lip-sync to external audio is NOT possible (Veo generates its own internal audio)
See `FACE_IDENTITY_GUIDE.md` in the ViMax directory for full details.
## Output Structure
After a run, the working directory contains:
```
.working_dir/script2video/
characters.json # Extracted characters
character_portraits_registry.json # Portrait paths registry
character_portraits/ # Generated portraits
0_CharacterName/
front.png
side.png
back.png
storyboard.json # Shot descriptions
camera_tree.json # Camera relationships
shots/
0/
shot_description.json
first_frame.png
last_frame.png (if medium/large variation)
video.mp4
1/
...
final_video.mp4 # Final concatenated output
```
## Customization
### Using Your Own Reference Photos
To use real photos instead of AI-generated portraits:
```python
# Build a portrait registry pointing to your photos
character_portraits_registry = {
"Alice": {
"front": {"path": "/path/to/alice_front.png", "description": "Front view of Alice"},
"side": {"path": "/path/to/alice_side.png", "description": "Side view of Alice"},
"back": {"path": "/path/to/alice_back.png", "description": "Back view of Alice"},
}
}
# Pass to pipeline (skips portrait generation)
await pipeline(
script=script,
user_requirement=user_requirement,
style=style,
character_portraits_registry=character_portraits_registry,
)
```
### Changing Models
Edit the YAML config to use different models:
- Image: `google/gemini-3.1-flash-image-preview` (recommended for face identity)
- Video: `google/veo-3.1-generate-preview` (recommended) or `openai/sora-2`
- Chat: `gpt-4.1` (recommended) or any OpenAI-compatible model
## Troubleshooting
### "No module named 'tools'" or similar import errors
Run from the ViMax root directory:
```bash
cd /home/node/a0/workspace/527fb591-1439-4b5b-ad5d-90f972773f95/workspace/tmp/ViMax
.venv/bin/python main_happycapy_script2video.py
```
### API rate limit errors
Reduce `max_requests_per_minute` in the YAML config.
### Face identity drift in generated videos
- Add more physical detail to character descriptions in your script
- Use user-provided reference photos instead of AI-generated portraits
- Include extreme close-up shots for important characters
- Keep lighting consistent across scenesRelated Skills
happycapy-skill-creator
Automate HappyCapy skill creation by finding and adapting existing skills from anthropics/skills repository. Handles environment constraints (Python 3.11, Node.js 24, no Docker). Use when user wants to create or adapt skills for specific tasks.
happycapy-feishu
为 HappyCapy 安装并授权飞书(Lark)MCP,让 Claude 直接操作飞书消息、文档、多维表格、日历等。当用户提到安装飞书 MCP、配置飞书、接入飞书、飞书 MCP setup、connect feishu/lark、飞书重新授权、飞书 token 过期、lark mcp 失效等场景时,必须使用此 skill。
capy-cortex
Autonomous learning system - learns from mistakes, reflects on sessions, and gets smarter over time. The AI brain.
video-comparer
This skill should be used when comparing two videos to analyze compression results or quality differences. Generates interactive HTML reports with quality metrics (PSNR, SSIM) and frame-by-frame visual comparisons. Triggers when users mention "compare videos", "video quality", "compression analysis", "before/after compression", or request quality assessment of compressed videos.
video-enhancement
AI Video Enhancement - Upscale video resolution, improve quality, denoise, sharpen, enhance low-quality videos to HD/4K. Supports local video files, remote URLs (YouTube, Bilibili), auto-download, real-time progress tracking.
ai-avatar-video
Create AI avatar and talking head videos with OmniHuman, Fabric, PixVerse via inference.sh CLI. Models: OmniHuman 1.5, OmniHuman 1.0, Fabric 1.0, PixVerse Lipsync. Capabilities: audio-driven avatars, lipsync videos, talking head generation, virtual presenters. Use for: AI presenters, explainer videos, virtual influencers, dubbing, marketing videos. Triggers: ai avatar, talking head, lipsync, avatar video, virtual presenter, ai spokesperson, audio driven video, heygen alternative, synthesia alternative, talking avatar, lip sync, video avatar, ai presenter, digital human
video-prompting-guide
Best practices and techniques for writing effective AI video generation prompts. Covers: Veo, Seedance, Wan, Grok, Kling, Runway, Pika, Sora prompting strategies. Learn: shot types, camera movements, lighting, pacing, style keywords, negative prompts. Use for: improving video quality, getting consistent results, professional video prompts. Triggers: video prompt, how to prompt video, veo prompts, video generation tips, better ai video, video prompt engineering, video prompt guide, video prompt template, ai video tips, video prompt best practices, video prompt examples, cinematography prompts
image-to-video
Still-to-video conversion guide: model selection, motion prompting, and camera movement. Covers Wan 2.5 i2v, Seedance, Fabric, Grok Video with when to use each. Use for: animating images, creating video from stills, adding motion, product animations. Triggers: image to video, i2v, animate image, still to video, add motion to image, image animation, photo to video, animate still, wan i2v, image2video, bring image to life, animate photo, motion from image
ai-marketing-videos
Create AI marketing videos for ads, promos, product launches, and brand content. Models: Veo, Seedance, Wan, FLUX for visuals, Kokoro for voiceover. Types: product demos, testimonials, explainers, social ads, brand videos. Use for: Facebook ads, YouTube ads, product launches, brand awareness. Triggers: marketing video, ad video, promo video, commercial, brand video, product video, explainer video, ad creative, video ad, facebook ad video, youtube ad, instagram ad, tiktok ad, promotional video, launch video
p-video
Generate videos with Pruna P-Video and WAN models via inference.sh CLI. Models: P-Video, WAN-T2V, WAN-I2V. Capabilities: text-to-video, image-to-video, audio support, 720p/1080p, fast inference. Pruna optimizes models for speed without quality loss. Triggers: pruna video, p-video, pruna ai video, fast video generation, optimized video, wan t2v, wan i2v, economic video generation, cheap video generation, pruna text to video, pruna image to video
ai-video-generation
Generate AI videos with Google Veo, Seedance, Wan, Grok and 40+ models via inference.sh CLI. Models: Veo 3.1, Veo 3, Seedance 1.5 Pro, Wan 2.5, Grok Imagine Video, OmniHuman, Fabric, HunyuanVideo. Capabilities: text-to-video, image-to-video, lipsync, avatar animation, video upscaling, foley sound. Use for: social media videos, marketing content, explainer videos, product demos, AI avatars. Triggers: video generation, ai video, text to video, image to video, veo, animate image, video from image, ai animation, video generator, generate video, t2v, i2v, ai video maker, create video with ai, runway alternative, pika alternative, sora alternative, kling alternative
video-processor
Process video files with audio extraction, format conversion (mp4, webm), and Whisper transcription. Use when user mentions video conversion, audio extraction, transcription, mp4, webm, ffmpeg, or whisper transcription.