capy-video-gen-skill

Multi-shot AI video generation pipeline with face identity consistency. Converts scripts or ideas into complete videos using character extraction, storyboarding, frame generation, and video assembly. 300 experiments validated, 70% face distance improvement. Use when the user asks to create a video from a script, story, idea, or wants multi-shot video with consistent characters.

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

capy-video-gen-skill is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using capy-video-gen-skill should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/capy-video-gen-skill/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/happycapy-ai/Happycapy-skills/capy-video-gen-skill/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/capy-video-gen-skill/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How capy-video-gen-skill Compares

Feature / Agent	capy-video-gen-skill	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Capy Video Gen Skill - Script-to-Video Pipeline

Generate complete multi-shot videos from scripts or ideas with consistent character faces across all scenes. Built for HappyCapy AI Gateway. 300 experiments validated, 70% face distance improvement.

## Overview

ViMax converts text scripts into full videos through an automated pipeline:
1. Extract characters from script with detailed physical features
2. Generate front/side/back character portraits
3. Design shot-by-shot storyboard
4. Decompose each shot into first_frame, last_frame, and motion descriptions
5. Build camera tree for shot relationships
6. Generate frames with reference image selection (face identity as top priority)
7. Generate video clips from frames
8. Concatenate into final video

## Installation Location

The ViMax pipeline code is at: `/home/node/a0/workspace/527fb591-1439-4b5b-ad5d-90f972773f95/workspace/tmp/ViMax/`

All commands must be run from this directory using the venv:
```bash
cd /home/node/a0/workspace/527fb591-1439-4b5b-ad5d-90f972773f95/workspace/tmp/ViMax
```

## Prerequisites

- `AI_GATEWAY_API_KEY` environment variable (auto-configured in HappyCapy)
- Python venv at `.venv/` (already set up)

## Quick Start

### Script-to-Video

Edit the script, requirements, and style in the entry script, then run:

```bash
cd /home/node/a0/workspace/527fb591-1439-4b5b-ad5d-90f972773f95/workspace/tmp/ViMax
.venv/bin/python main_happycapy_script2video.py
```

### Idea-to-Video

For generating from a brief idea (auto-generates script first):

```bash
cd /home/node/a0/workspace/527fb591-1439-4b5b-ad5d-90f972773f95/workspace/tmp/ViMax
.venv/bin/python main_happycapy_idea2video.py
```

## Programmatic Usage

```python
import asyncio
from langchain.chat_models import init_chat_model
from tools.render_backend import RenderBackend
from utils.config_loader import load_config
from pipelines.script2video_pipeline import Script2VideoPipeline

config = load_config("configs/happycapy_script2video.yaml")
chat_model = init_chat_model(**config["chat_model"]["init_args"])
backend = RenderBackend.from_config(config)

pipeline = Script2VideoPipeline(
    chat_model=chat_model,
    image_generator=backend.image_generator,
    video_generator=backend.video_generator,
    working_dir=config["working_dir"],
)

# Run the pipeline
asyncio.run(pipeline(
    script="Your script here...",
    user_requirement="No more than 8 shots total.",
    style="Cinematic, warm lighting"
))
```

## Pipelines

### Script2VideoPipeline
- Input: A formatted screenplay/script with character dialogue and scene descriptions
- Output: Concatenated video at `{working_dir}/final_video.mp4`
- Config: `configs/happycapy_script2video.yaml`

### Idea2VideoPipeline
- Input: A brief idea/concept (1-3 paragraphs)
- Output: Auto-generates a script, then produces video
- Config: `configs/happycapy_idea2video.yaml`

## Configuration

HappyCapy configs at `configs/happycapy_script2video.yaml`:

```yaml
chat_model:
  init_args:
    model: gpt-4.1
    model_provider: openai
    api_key: ${AI_GATEWAY_API_KEY}
    base_url: https://ai-gateway.happycapy.ai/api/v1/openai/v1

image_generator:
  class_path: tools.ImageGeneratorHappyCapyAPI
  init_args:
    api_key: ${AI_GATEWAY_API_KEY}
    model: google/gemini-3.1-flash-image-preview

video_generator:
  class_path: tools.VideoGeneratorHappyCapyAPI
  init_args:
    api_key: ${AI_GATEWAY_API_KEY}
    model: google/veo-3.1-generate-preview

working_dir: .working_dir/script2video
```

## Key Components

### Agents (AI Processing)

| Agent | File | Purpose |
|-------|------|---------|
| CharacterExtractor | `agents/character_extractor.py` | Extract characters with static/dynamic features from script |
| CharacterPortraitsGenerator | `agents/character_portraits_generator.py` | Generate front/side/back portraits for each character |
| StoryboardArtist | `agents/storyboard_artist.py` | Design shot-by-shot storyboard with first/last frames and motion |
| ReferenceImageSelector | `agents/reference_image_selector.py` | Select best reference images for each frame (face identity #1 priority) |
| CameraImageGenerator | `agents/camera_image_generator.py` | Build camera trees and generate transition videos |
| BestImageSelector | `agents/best_image_selector.py` | Select best generated image from candidates |
| Screenwriter | `agents/screenwriter.py` | Generate scripts from ideas |

### Tools (Generation Backends)

| Tool | File | Purpose |
|------|------|---------|
| ImageGeneratorHappyCapyAPI | `tools/image_generator_happycapy_api.py` | Image generation via HappyCapy Gateway (Gemini) |
| VideoGeneratorHappyCapyAPI | `tools/video_generator_happycapy_api.py` | Video generation via HappyCapy Gateway (Veo) |
| RenderBackend | `tools/render_backend.py` | Factory for instantiating generators from config |

### Interfaces (Data Models)

- `CharacterInScene` - Character with identifier, static_features, dynamic_features
- `ShotDescription` - Shot with ff_desc, lf_desc, motion_desc, variation_type
- `Camera` - Camera with parent-child relationships
- `Frame` - Frame with shot_idx, frame_type, visible characters
- `ImageOutput` / `VideoOutput` - Generation outputs with save methods

## Face Identity Consistency (CRITICAL)

This pipeline includes face identity improvements validated through 257 experiments (70% improvement in face distance, from 0.74 to 0.22):

### Built-In Protections

1. **Reference Image Selector**: Face identity is the #1 priority when selecting reference images. The front-view portrait is always included when a character's face is visible.

2. **Character Portraits**: Enhanced prompts generate identity-critical details (exact nose shape, eye spacing, jawline, distinguishing marks) for cross-scene recognition.

3. **Video Prompt Face Lock**: Every video generation prompt is prepended with a face identity instruction requiring the character's face to remain identical to the starting frame throughout the clip.

### Best Practices When Using ViMax

- **Hyper-detailed character descriptions**: Include ethnicity, age, hair texture/style/color, eye shape, facial hair, glasses, skin tone, build, and distinguishing marks in your script's character introductions
- **Extreme close-up shots**: Include at least one extreme close-up per character to anchor identity
- **Consistent lighting**: Specify similar lighting across scenes to prevent face drift
- **User-provided reference photos**: Place photos in the working directory and pass them as `character_portraits_registry` to skip AI portrait generation

### What Does NOT Work

- Complex prompt engineering (viseme morphing, phoneme anchoring) does not improve face identity
- Simple, direct prompts with detailed physical descriptions outperform clever prompts
- Lip-sync to external audio is NOT possible (Veo generates its own internal audio)

See `FACE_IDENTITY_GUIDE.md` in the ViMax directory for full details.

## Output Structure

After a run, the working directory contains:

```
.working_dir/script2video/
  characters.json                      # Extracted characters
  character_portraits_registry.json    # Portrait paths registry
  character_portraits/                 # Generated portraits
    0_CharacterName/
      front.png
      side.png
      back.png
  storyboard.json                     # Shot descriptions
  camera_tree.json                    # Camera relationships
  shots/
    0/
      shot_description.json
      first_frame.png
      last_frame.png (if medium/large variation)
      video.mp4
    1/
      ...
  final_video.mp4                     # Final concatenated output
```

## Customization

### Using Your Own Reference Photos

To use real photos instead of AI-generated portraits:

```python
# Build a portrait registry pointing to your photos
character_portraits_registry = {
    "Alice": {
        "front": {"path": "/path/to/alice_front.png", "description": "Front view of Alice"},
        "side": {"path": "/path/to/alice_side.png", "description": "Side view of Alice"},
        "back": {"path": "/path/to/alice_back.png", "description": "Back view of Alice"},
    }
}

# Pass to pipeline (skips portrait generation)
await pipeline(
    script=script,
    user_requirement=user_requirement,
    style=style,
    character_portraits_registry=character_portraits_registry,
)
```

### Changing Models

Edit the YAML config to use different models:
- Image: `google/gemini-3.1-flash-image-preview` (recommended for face identity)
- Video: `google/veo-3.1-generate-preview` (recommended) or `openai/sora-2`
- Chat: `gpt-4.1` (recommended) or any OpenAI-compatible model

## Troubleshooting

### "No module named 'tools'" or similar import errors
Run from the ViMax root directory:
```bash
cd /home/node/a0/workspace/527fb591-1439-4b5b-ad5d-90f972773f95/workspace/tmp/ViMax
.venv/bin/python main_happycapy_script2video.py
```

### API rate limit errors
Reduce `max_requests_per_minute` in the YAML config.

### Face identity drift in generated videos
- Add more physical detail to character descriptions in your script
- Use user-provided reference photos instead of AI-generated portraits
- Include extreme close-up shots for important characters
- Keep lighting consistent across scenes

Related Skills

happycapy-skill-creator

from ComeOnOliver/skillshub

Automate HappyCapy skill creation by finding and adapting existing skills from anthropics/skills repository. Handles environment constraints (Python 3.11, Node.js 24, no Docker). Use when user wants to create or adapt skills for specific tasks.

happycapy-feishu

from ComeOnOliver/skillshub

为 HappyCapy 安装并授权飞书（Lark）MCP，让 Claude 直接操作飞书消息、文档、多维表格、日历等。当用户提到安装飞书 MCP、配置飞书、接入飞书、飞书 MCP setup、connect feishu/lark、飞书重新授权、飞书 token 过期、lark mcp 失效等场景时，必须使用此 skill。

capy-cortex

from ComeOnOliver/skillshub

Autonomous learning system - learns from mistakes, reflects on sessions, and gets smarter over time. The AI brain.

video-comparer

from ComeOnOliver/skillshub

This skill should be used when comparing two videos to analyze compression results or quality differences. Generates interactive HTML reports with quality metrics (PSNR, SSIM) and frame-by-frame visual comparisons. Triggers when users mention "compare videos", "video quality", "compression analysis", "before/after compression", or request quality assessment of compressed videos.

video-enhancement

from ComeOnOliver/skillshub

AI Video Enhancement - Upscale video resolution, improve quality, denoise, sharpen, enhance low-quality videos to HD/4K. Supports local video files, remote URLs (YouTube, Bilibili), auto-download, real-time progress tracking.

ai-avatar-video

from ComeOnOliver/skillshub

Create AI avatar and talking head videos with OmniHuman, Fabric, PixVerse via inference.sh CLI. Models: OmniHuman 1.5, OmniHuman 1.0, Fabric 1.0, PixVerse Lipsync. Capabilities: audio-driven avatars, lipsync videos, talking head generation, virtual presenters. Use for: AI presenters, explainer videos, virtual influencers, dubbing, marketing videos. Triggers: ai avatar, talking head, lipsync, avatar video, virtual presenter, ai spokesperson, audio driven video, heygen alternative, synthesia alternative, talking avatar, lip sync, video avatar, ai presenter, digital human

video-prompting-guide

from ComeOnOliver/skillshub

Best practices and techniques for writing effective AI video generation prompts. Covers: Veo, Seedance, Wan, Grok, Kling, Runway, Pika, Sora prompting strategies. Learn: shot types, camera movements, lighting, pacing, style keywords, negative prompts. Use for: improving video quality, getting consistent results, professional video prompts. Triggers: video prompt, how to prompt video, veo prompts, video generation tips, better ai video, video prompt engineering, video prompt guide, video prompt template, ai video tips, video prompt best practices, video prompt examples, cinematography prompts

image-to-video

from ComeOnOliver/skillshub

Still-to-video conversion guide: model selection, motion prompting, and camera movement. Covers Wan 2.5 i2v, Seedance, Fabric, Grok Video with when to use each. Use for: animating images, creating video from stills, adding motion, product animations. Triggers: image to video, i2v, animate image, still to video, add motion to image, image animation, photo to video, animate still, wan i2v, image2video, bring image to life, animate photo, motion from image

ai-marketing-videos

from ComeOnOliver/skillshub

Create AI marketing videos for ads, promos, product launches, and brand content. Models: Veo, Seedance, Wan, FLUX for visuals, Kokoro for voiceover. Types: product demos, testimonials, explainers, social ads, brand videos. Use for: Facebook ads, YouTube ads, product launches, brand awareness. Triggers: marketing video, ad video, promo video, commercial, brand video, product video, explainer video, ad creative, video ad, facebook ad video, youtube ad, instagram ad, tiktok ad, promotional video, launch video

p-video

from ComeOnOliver/skillshub

Generate videos with Pruna P-Video and WAN models via inference.sh CLI. Models: P-Video, WAN-T2V, WAN-I2V. Capabilities: text-to-video, image-to-video, audio support, 720p/1080p, fast inference. Pruna optimizes models for speed without quality loss. Triggers: pruna video, p-video, pruna ai video, fast video generation, optimized video, wan t2v, wan i2v, economic video generation, cheap video generation, pruna text to video, pruna image to video

ai-video-generation

from ComeOnOliver/skillshub

Generate AI videos with Google Veo, Seedance, Wan, Grok and 40+ models via inference.sh CLI. Models: Veo 3.1, Veo 3, Seedance 1.5 Pro, Wan 2.5, Grok Imagine Video, OmniHuman, Fabric, HunyuanVideo. Capabilities: text-to-video, image-to-video, lipsync, avatar animation, video upscaling, foley sound. Use for: social media videos, marketing content, explainer videos, product demos, AI avatars. Triggers: video generation, ai video, text to video, image to video, veo, animate image, video from image, ai animation, video generator, generate video, t2v, i2v, ai video maker, create video with ai, runway alternative, pika alternative, sora alternative, kling alternative

video-processor

from ComeOnOliver/skillshub

Process video files with audio extraction, format conversion (mp4, webm), and Whisper transcription. Use when user mentions video conversion, audio extraction, transcription, mp4, webm, ffmpeg, or whisper transcription.