ue-benchmark

UE Agent Benchmark 评测框架。定义通用评分体系、评测流程和质量层级，支持多场景 Benchmark。触发：用户提及 Benchmark/评测/基准测试/跑分等关键词时激活。

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

ue-benchmark is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using ue-benchmark should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ue-benchmark/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/blackplume233/UnrealMCPHub/ue-benchmark/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/ue-benchmark/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How ue-benchmark Compares

Feature / Agent	ue-benchmark	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# UE Agent Benchmark — 评测框架

> **前置技能**：本技能依赖 `use-unrealhub` 技能提供的 UE 工具链知识。
> 执行 Benchmark 前，Agent 应已加载 `use-unrealhub` 技能。

---

## 1. 概述

UE Agent Benchmark 衡量 AI 编码代理在**无人工代码介入**的条件下，通过 MCP 工具链在 Unreal Engine 中端到端交付完整游戏原型的综合能力。

评测覆盖：工程创建、C++ 编码、关卡构建、游戏系统设计、PIE 自测、迭代修复的全生命周期。

### 1.1 多场景架构

Benchmark 支持多个评测场景，每个场景定义一种游戏类型及其专属评分细则：

```
skills/ue-benchmark/
├── SKILL.md              ← 本文件：通用框架（评分公式、评测流程、质量层级）
└── scenarios/
    ├── vampire-survivors-v1.md  ← 场景 A：3D 吸血鬼幸存者
    └── (future scenarios)       ← 场景 B, C, ...
```

**启动 Benchmark 时**：
1. 读取本文件了解通用框架
2. 读取 `scenarios/<场景名>.md` 获取标准 Prompt、执行阶段、游戏内容规格
3. 按场景要求执行

---

## 2. 通用评分体系

### 2.1 总分公式

```
TotalScore = PackageGate × (UserScore × 0.55 + AIReviewScore × 0.30 + ContentScore × 0.05 + TokenScore × 0.10)
```

| 因子 | 说明 |
|------|------|
| **PackageGate** | 二值门槛（0 或 1）。Cook 打包失败、打包产物崩溃或不可玩 → **整体记零分** |
| **UserScore** | 用户游玩评分（权重 55%） |
| **AIReviewScore** | AI 代码审查评分（权重 30%） |
| **ContentScore** | 内容充实度评分（权重 5%，见 2.5 节） |
| **TokenScore** | Token 效率评分（权重 10%） |

**总分无上限**。Agent 可通过迭代增强持续提升各项分数。

### 2.1.1 PackageGate — 打包门槛（硬性前置条件）

PackageGate 是**资格赛**：不通过则整体 0 分，不进入后续评分。

#### 通过条件（全部满足才为 1）

| # | 条件 | 验证方式 |
|---|------|---------|
| G1 | 工程目录完整保留 | 目录结构可被 UE 正常打开，无缺失核心文件（.uproject、Source/、Config/） |
| G2 | Cook 成功 | `RunUAT BuildCookRun` 或等效命令退出码为 0，无 Fatal Error |
| G3 | 打包产物可启动 | 双击 .exe 能进入游戏主画面，无立即崩溃 |
| G4 | 打包产物可玩 | 能开始一局游戏、控制角色移动、遭遇至少一波敌人 |

#### 失败判定

| 情形 | PackageGate | 后果 |
|------|-------------|------|
| Cook 报 Fatal Error / 退出码非 0 | 0 | TotalScore = 0 |
| 打包产物启动即崩溃（< 10 秒） | 0 | TotalScore = 0 |
| 打包产物能启动但无法进入游戏（卡主菜单、黑屏等） | 0 | TotalScore = 0 |
| 打包产物可进入游戏但 30 秒内必现崩溃 | 0 | TotalScore = 0 |
| 打包产物可正常游玩（偶发崩溃按稳定性惩罚扣分） | 1 | 正常评分 |

> **工程目录要求**：Agent 完成后工程目录应保持干净、结构合理。评测方须能直接用 UE 编辑器打开工程并运行 Cook。允许使用引擎内置资源，不要求外部资源包。

### 2.1.2 Playability Verification Protocol — 可玩性验证协议

> **为什么需要结构化验证？** 单纯"能跑"不代表"可玩"。操作感和可玩性是游戏的生命线，
> 必须通过**可复现的测试流程**来评估，而非仅凭主观印象。

PackageGate 通过后，在用户打分（UserScore）之前，必须执行**可玩性验证协议 (PVP)**。
PVP 不直接产生分数，但为 UserScore 评分提供客观依据，并可发现 PackageGate 未覆盖的问题。

#### 结构化游玩测试（4 轮）

| 轮次 | 时长 | 测试重点 | 评估目标 |
|------|------|---------|---------|
| **S1: 初见体验** | 2 分钟 | 启动游戏、尝试所有操控 | 输入响应性、第一印象、是否有引导 |
| **S2: 核心循环** | 5 分钟 | 打过前几波、升级 1-2 次 | 战斗手感、敌人威胁感、升级体验 |
| **S3: 完整对局** | 10+ 分钟 | 尽可能长时间存活或通关 | 难度曲线、内容多样性、持续乐趣 |
| **S4: 二次游玩** | 5 分钟 | 重新开始，做不同选择 | 重玩性、不同升级路径的差异感 |

每轮结束后记录观察笔记，作为 UserScore 各项打分依据。

#### 操作验证清单（场景专属，见场景文件）

每个场景定义一份**操作验证清单**——由具体的「输入→期望响应」测试项组成。
评测方须逐项执行并记录通过/失败。此清单同时用于：
- 发现 PackageGate 未覆盖的可玩性缺陷
- 为 UserScore 的「操作手感分」提供客观评分依据

#### PVP 与评分的关系

```
PackageGate ─── 通过 ───→ PVP（结构化测试）───→ UserScore 打分
     │                        │                       │
     │                        ├── 操作验证清单          ├── 基础功能分
     │                        ├── 4 轮游玩测试          ├── 操作手感分 ← PVP 提供依据
     │                        └── 观察笔记              ├── 内容深度分
     │                                                 ├── 体验加分
   失败 → 0 分                                         └── 稳定性惩罚
```

---

### 2.2 UserScore — 用户综合评分（权重 55%）

用户在 **打包产物** 和 **PIE** 中实际游玩后打分，采用**累加制**（非百分制）。

> **核心原则：内容越充实越完善，分数越高；操作手感是游戏体验的基石。**
> 没有操作感的游戏即使内容再多也难以获得高分。

- **基础功能分（0-100）**：由场景文件定义各模块的分值和评分标准
- **操作手感分（0-30）**：独立维度，评估输入响应、移动手感、战斗反馈、UI 响应（见场景文件）
- **内容深度分（0-60）**：对每个基础模块的实现深度进行额外评分（见场景文件）
- **体验加分（无上限）**：由场景文件定义加分项
- **稳定性惩罚**：

| 问题 | 扣分 |
|------|------|
| PIE 崩溃 | -10/次 |
| 打包产物中崩溃（通过 PackageGate 后的偶发崩溃） | -15/次 |
| 编译失败未自行修复 | -15 |
| 游戏卡死（无限循环等） | -10/次 |
| 明显穿模 / 物理异常 | -5/处 |
| 输入无响应（按键无反应/操控断裂） | -10/处 |

### 2.3 AIReviewScore — AI 审查评分（权重 30%）

由**另一个独立 AI**（非执行 Agent）对产出代码和工程进行审查。

#### 代码质量（0-40）

| 维度 | 满分 | 标准 |
|------|------|------|
| 架构设计 | 10 | 职责分离、模块化、可扩展性 |
| 代码规范 | 10 | UE 命名规范、注释质量、头文件组织 |
| 内存安全 | 10 | UPROPERTY 标记、弱引用、生命周期管理 |
| 错误处理 | 10 | 空指针检查、边界检查、优雅降级 |

#### 工程完整度（0-30）

| 维度 | 满分 | 标准 |
|------|------|------|
| 类层次结构 | 10 | 继承合理、接口清晰 |
| 数据驱动 | 10 | 数值可配置(UPROPERTY)、不硬编码 |
| 构建系统 | 10 | Build.cs 依赖正确、模块划分 |

#### 游戏设计质量（0-30）

| 维度 | 满分 | 标准 |
|------|------|------|
| 系统完整性 | 10 | 核心循环闭合、无断裂流程 |
| 数值平衡 | 10 | 难度曲线、武器 DPS 差异化 |
| 关卡空间 | 10 | 动线合理、空间利用、节奏感 |

#### 加分项（无上限）

| 加分项 | 分值 |
|--------|------|
| 设计模式运用 | +5/种 |
| 性能优化措施 | +10 |
| 单元测试 / 自动化测试 | +10 |
| 热重载兼容性 | +5 |
| 文档 / 注释完备 | +5 |

### 2.5 ContentScore — 内容充实度评分（权重 5%）

独立维度，衡量游戏内容的**广度和深度**，由用户和 AI 审查者共同评估。

#### 评分标准（0-100，无上限加分）

| 维度 | 满分 | 标准 |
|------|------|------|
| 系统数量 | 20 | 实现的独立游戏系统数量（武器、敌人、升级、拾取、Boss、被动技能、成就…） |
| 单系统深度 | 20 | 每个系统的实现层次（如武器不仅能射击，还有升级路径、视觉差异、音效差异） |
| 数据丰富度 | 20 | 配置数据量（武器种类、敌人种类、关卡元素、技能数量） |
| 视听表现 | 20 | 特效、音效、UI 动画、材质的丰富程度 |
| 玩法完整度 | 20 | 从开始到结束的完整体验闭环（主菜单→游玩→结算→重开） |

#### 加分项（无上限）

| 加分项 | 分值 |
|--------|------|
| 多关卡 / 多场景 | +15 |
| 元进度系统（跨局升级） | +15 |
| 多角色可选 | +10 |
| 成就 / 统计系统 | +10 |
| 设置界面（音量、画质等） | +5 |
| 新手引导 / 教程 | +10 |
| 存档系统 | +10 |
| 本地化支持 | +5 |

> **评估方式**：UserScore 打分时同步评估 ContentScore，AI 审查时也单独给出 ContentScore。
> 最终取 `ContentScore = (UserContentScore + AIContentScore) / 2`。

---

### 2.6 TokenScore — Token 效率评分（权重 10%）

```
TokenScore = BaseEfficiency + QualityBonus

BaseEfficiency = 100 × (ReferenceTokens / ActualTokens)
QualityBonus   = UserScore × 0.1
```

| 参数 | 值 | 说明 |
|------|----|------|
| ReferenceTokens | 1,000,000,000 (1B) | 基准 Token 数 |
| ActualTokens | 实际消耗 | Agent 自行记录 + 用户独立记录，取一致值 |

> **Token 记录要求**：Agent 须通过 `add_note` 阶段性记录 Token 消耗。用户也独立记录。

---

## 3. 评测流程

所有场景共享此通用流程：

```
┌──────────────────────────────────────────────┐
│  1. 准备阶段                                  │
│     - 准备干净的 UE 引擎环境                   │
│     - 启动 UnrealMCPHub                       │
│     - 记录开始 Token 计数                      │
│     - 读取场景文件获取标准 Prompt               │
├──────────────────────────────────────────────┤
│  2. 执行阶段                                  │
│     - 向 Agent 发送场景的标准 Prompt            │
│     - Agent 自主执行全部流程                    │
│     - 不做任何人工干预                          │
│     - 记录所有 MCP 调用日志                     │
├──────────────────────────────────────────────┤
│  3. Cook & Package 阶段 ⚠️ GATE              │
│     - Agent 完成后，在工程目录执行 Cook 打包     │
│     - 验证打包产物可启动、可进入游戏、可游玩     │
│     - 若打包失败或产物不可玩 → PackageGate=0    │
│       → TotalScore=0，流程终止                 │
│     - 通过后保留工程目录和打包产物               │
├──────────────────────────────────────────────┤
│  4. 可玩性验证阶段 (PVP)                       │
│     - 执行操作验证清单（逐项测试输入→响应）      │
│     - 完成 4 轮结构化游玩测试（S1-S4）           │
│     - 记录每轮观察笔记                           │
├──────────────────────────────────────────────┤
│  5. 验收评分阶段                                │
│     - 基于 PVP 结果打 UserScore                  │
│       （基础分 + 操作手感分 + 内容深度分 + 加分） │
│     - 同步评估 ContentScore                      │
├──────────────────────────────────────────────┤
│  6. AI 审查阶段                               │
│     - 将全部 C++ 源码提交给独立 AI              │
│     - 使用场景提供的审查 Prompt 模板             │
│     - AI 按标准打 AIReviewScore + ContentScore │
├──────────────────────────────────────────────┤
│  7. 汇总阶段                                  │
│     - 计算 TokenScore（2.6 节）                │
│     - 计算 ContentScore 均值（2.5 节）          │
│     - 代入总分公式（含 PackageGate）             │
│     - 记录到排行榜                              │
└──────────────────────────────────────────────┘
```

---

## 4. 质量层级

| Tier | 分数段 | 特征 |
|------|--------|------|
| **未通过** | **0** | **Cook 失败或打包产物不可玩/崩溃** |
| Tier 1 | 40-70 | 核心可玩：能跑能打、基本循环闭合、可打包 |
| Tier 2 | 70-120 | 功能完整：多武器/敌人、升级系统、HUD、音效 |
| Tier 3 | 120-180 | 体验打磨：打击感、Boss、特效、数值平衡、内容丰富 |
| Tier 4 | 180+ | 超越期望：创意玩法、元进度、性能优化、海量内容 |

---

## 5. 排行榜格式

```
| 排名 | Agent | 模型 | 场景 | 总分 | PackageGate | User | AIReview | Content | Token | 消耗Token | 日期 |
|------|-------|------|------|------|-------------|------|----------|---------|-------|----------|------|
| 1 | Agent-X | claude-4-opus | vampire-survivors-v1 | 207.4 | ✅ | 230 | 150 | 85 | 208 | 500M | 2026-03-07 |
| - | Agent-F | model-y | vampire-survivors-v1 | 0 | ❌ Cook失败 | - | - | - | - | 300M | 2026-03-08 |
```

---

## 6. 评分示例

### 案例 A：基础完成（1B Token，通过打包）

| 项目 | 原始分 | 权重后 |
|------|--------|--------|
| PackageGate | ✅ (1) | ×1 |
| UserScore | 72 | 72 × 0.55 = 39.6 |
| AIReviewScore | 65 | 65 × 0.30 = 19.5 |
| ContentScore | 40 | 40 × 0.05 = 2.0 |
| TokenScore | 107.2 | 107.2 × 0.10 = 10.7 |
| **总分** | | **71.8** |

### 案例 B：优秀完成（800M Token，通过打包）

| 项目 | 原始分 | 权重后 |
|------|--------|--------|
| PackageGate | ✅ (1) | ×1 |
| UserScore | 145 | 145 × 0.55 = 79.8 |
| AIReviewScore | 110 | 110 × 0.30 = 33.0 |
| ContentScore | 75 | 75 × 0.05 = 3.8 |
| TokenScore | 139.5 | 139.5 × 0.10 = 14.0 |
| **总分** | | **130.6** |

### 案例 C：极致打磨（2B Token，通过打包，内容极其丰富）

| 项目 | 原始分 | 权重后 |
|------|--------|--------|
| PackageGate | ✅ (1) | ×1 |
| UserScore | 230 | 230 × 0.55 = 126.5 |
| AIReviewScore | 150 | 150 × 0.30 = 45.0 |
| ContentScore | 130 | 130 × 0.05 = 6.5 |
| TokenScore | 73 | 73 × 0.10 = 7.3 |
| **总分** | | **185.3** |

### 案例 D：打包失败

| 项目 | 原始分 | 权重后 |
|------|--------|--------|
| PackageGate | ❌ (0) | ×0 |
| UserScore | (未评分) | - |
| AIReviewScore | (未评分) | - |
| ContentScore | (未评分) | - |
| TokenScore | - | - |
| **总分** | | **0** |

---

## 7. 通用前置条件

所有场景共享以下前置条件：

| 项目 | 要求 |
|------|------|
| 引擎 | Unreal Engine 5.7+ |
| 编译器 | MSVC 2022+ |
| MCP 连接 | UnrealMCPHub 已连接 |
| 起始状态 | **全新空白 UE C++ 工程** |
| 人工操作 | 仅限启动 Agent 会话 + 最终评分 + 执行 Cook（或 Agent 自动触发） |
| 输入指令 | 单条 Prompt（由场景文件定义） |
| Token 记录 | Agent 自行记录 + 用户独立记录 |
| 打包验证 | 评测结束后必须执行 Cook 打包并验证产物可玩性 |
| 工程保留 | Agent 完成后工程目录必须完整保留，可被 UE 编辑器直接打开 |

---

## 8. 可用场景

| 场景 ID | 文件 | 游戏类型 | 版本 |
|---------|------|---------|------|
| `vampire-survivors-v1` | [`scenarios/vampire-survivors-v1.md`](./scenarios/vampire-survivors-v1.md) | 3D 吸血鬼幸存者 | v1.3 |

> 新场景按 `scenarios/<game-type>-v<N>.md` 命名。每个场景文件定义：标准 Prompt、执行阶段、游戏内容规格、场景专属评分细则、AI 审查 Prompt 模板。

Related Skills

security-benchmark-runner

from ComeOnOliver/skillshub

Security Benchmark Runner - Auto-activating skill for Security Advanced. Triggers on: security benchmark runner, security benchmark runner Part of the Security Advanced skill category.

benchmark-suite-creator

from ComeOnOliver/skillshub

Benchmark Suite Creator - Auto-activating skill for Performance Testing. Triggers on: benchmark suite creator, benchmark suite creator Part of the Performance Testing skill category.

competitive-feature-benchmark

from ComeOnOliver/skillshub

Research and compare how competing products implement a similar feature at the UX and interaction level. Provides structured comparison tables and strategic differentiation recommendations.

benchmark-kernel

from ComeOnOliver/skillshub

Guide for benchmarking FlashInfer kernels with CUPTI timing

Benchmark — Performance Baseline & Regression Detection

from ComeOnOliver/skillshub

## When to Use

Rust Benchmarks Skill

from ComeOnOliver/skillshub

Run Rust benchmarks and compare performance with the C implementation.

NeMo Evaluator SDK - Enterprise LLM Benchmarking

from ComeOnOliver/skillshub

## Quick Start

lm-evaluation-harness - LLM Benchmarking

from ComeOnOliver/skillshub

## Quick start

BigCode Evaluation Harness - Code Model Benchmarking

from ComeOnOliver/skillshub

## Quick Start

golang-benchmark

from ComeOnOliver/skillshub

Golang benchmarking, profiling, and performance measurement. Use when writing, running, or comparing Go benchmarks, profiling hot paths with pprof, interpreting CPU/memory/trace profiles, analyzing results with benchstat, setting up CI benchmark regression detection, or investigating production performance with Prometheus runtime metrics. Also use when the developer needs deep analysis on a specific performance indicator - this skill provides the measurement methodology, while golang-performance provides the optimization patterns.

benchmark-email-automation

from ComeOnOliver/skillshub

Automate Benchmark Email tasks via Rube MCP (Composio). Always search tools first for current schemas.

Daily Logs

from ComeOnOliver/skillshub

Record the user's daily activities, progress, decisions, and learnings in a structured, chronological format.