multiAI Summary Pending
ml-model-eval-benchmark
Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.
3,556 stars
byopenclaw
Installation
Claude Code / Cursor / Codex
$curl -o ~/.claude/skills/ml-model-eval-benchmark/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/0x-professor/ml-model-eval-benchmark/SKILL.md"
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/ml-model-eval-benchmark/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How ml-model-eval-benchmark Compares
| Feature / Agent | ml-model-eval-benchmark | Standard Approach |
|---|---|---|
| Platform Support | multi | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Compare model candidates using weighted metrics and deterministic ranking outputs. Use for benchmark leaderboards and model promotion decisions.
Which AI agents support this skill?
This skill is compatible with multi.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# ML Model Eval Benchmark ## Overview Produce consistent model ranking outputs from metric-weighted evaluation inputs. ## Workflow 1. Define metric weights and accepted metric ranges. 2. Ingest model metrics for each candidate. 3. Compute weighted score and ranking. 4. Export leaderboard and promotion recommendation. ## Use Bundled Resources - Run `scripts/benchmark_models.py` to generate benchmark outputs. - Read `references/benchmarking-guide.md` for weighting and tie-break guidance. ## Guardrails - Keep metric names and scales consistent across candidates. - Record weighting assumptions in the output.