ln-840-benchmark-compare
Runs built-in vs hex-line benchmark with scenario manifests, activation checks, and diff-based correctness. Use when measuring hex-line MCP performance against built-in tools.
Best use case
ln-840-benchmark-compare is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Runs built-in vs hex-line benchmark with scenario manifests, activation checks, and diff-based correctness. Use when measuring hex-line MCP performance against built-in tools.
Teams using ln-840-benchmark-compare should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/ln-840-benchmark-compare/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How ln-840-benchmark-compare Compares
| Feature / Agent | ln-840-benchmark-compare | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Runs built-in vs hex-line benchmark with scenario manifests, activation checks, and diff-based correctness. Use when measuring hex-line MCP performance against built-in tools.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
> **Paths:** File paths (`shared/`, `references/`) are relative to skills repo root. Locate this SKILL.md directory and go up one level for repo root.
# Benchmark Compare
**Type:** L3 Worker
**Category:** 8XX Optimization -> 840 Benchmark
Run a clean A/B benchmark in Claude Code: one session with built-in tools only, one with `hex-line`. The benchmark is scenario-based, diff-validated, and manifest-driven. It measures activation, correctness, time, cost, and tokens.
---
## Input / Output
| Direction | Content |
|-----------|----------|
| **Input** | Repo checkout containing `mcp/hex-line-mcp/`, optional `references/goals.md`, optional `references/expectations.json` |
| **Output** | Comparison report in `skills-catalog/ln-840-benchmark-compare/results/{date}-comparison.md` |
---
## Prerequisites
- `claude --version` succeeds
- `git` succeeds
- `mcp/hex-line-mcp/server.mjs` exists
- `mcp/hex-line-mcp/hook.mjs` exists
- `skills-catalog/ln-840-benchmark-compare/references/goals.md` exists
- `skills-catalog/ln-840-benchmark-compare/references/expectations.json` exists
- `skills-catalog/ln-840-benchmark-compare/references/mcp-bench.json` exists
---
## Quick Run
```bash
bash skills-catalog/ln-840-benchmark-compare/scripts/run-benchmark.sh \
[skills-catalog/ln-840-benchmark-compare/references/goals.md] \
[skills-catalog/ln-840-benchmark-compare/references/expectations.json]
```
The runner handles:
- syntax preflight
- SessionStart preflight
- scenario extraction from `goals.md`
- isolated worktrees per scenario/session
- per-scenario diffs
- final comparison report
---
## Workflow
### Phase 1: Define The Canonical Suite
Use one canonical pair owned by this skill:
- `skills-catalog/ln-840-benchmark-compare/references/goals.md`
- `skills-catalog/ln-840-benchmark-compare/references/expectations.json`
Rules:
- The suite must be a balanced mix of common engineering scenarios.
- Do not design the suite to favor `hex-line`.
- Every scenario in `goals.md` must have a matching entry in `expectations.json`.
- `expectations.json` is the source of truth for correctness.
Supported expectation fields per scenario:
| Field | Meaning |
|-------|---------|
| `id` | Scenario identifier used in result filenames |
| `expectedChangedFiles` | Files that must change |
| `forbiddenChangedFiles` | Files that must not change |
| `requiredDiffPatterns` | Regex patterns required in the saved diff |
| `forbiddenDiffPatterns` | Regex patterns that must not appear in the diff |
| `requiredResultPatterns` | Regex patterns required in the final assistant result text |
| `requiredCommands` | Regex patterns that must match at least one Bash command |
| `exactChangedFiles` | If `true`, no extra changed files are allowed |
### Phase 2: Preflight
The runner must pass:
- `node --check server.mjs`
- `node --check hook.mjs`
- `node --check extract-scenarios.mjs`
- `node --check parse-results.mjs`
- SessionStart smoke check from `hook.mjs`
If preflight fails, the benchmark is invalid and must stop before scenarios run.
### Phase 3: Execute Per Scenario
For each `##` scenario in `goals.md`:
1. generate a standalone prompt file
2. create two clean worktrees from the same commit
3. run built-in Claude session
4. run hex-line Claude session
5. save `.jsonl` logs and `.diff.txt` artifacts
6. remove both worktrees
Built-in session:
- no MCP
- hooks disabled
Hex-line session:
- resolved MCP config pointing to `server.mjs`
- `outputStyle: "hex-line"`
- `PreToolUse` hook through `hook.mjs`
### Phase 4: Parse Results
`parse-results.mjs` evaluates each scenario for both sessions.
Scenario pass requires:
- valid run
- successful session completion
- changed files match expectations
- diff patterns match expectations
- result text patterns match expectations
- required commands were actually executed
### Phase 5: Read The Report
The final report has these sections:
- Scenario Outcomes
- Activation
- Time
- Cost
- Tokens
- Tool Totals
- Validity
Interpretation rules:
- `invalid run` means setup/adoption failure, not product performance
- scenario `FAIL` means correctness contract was not met
- activation is part of product quality for `hex-line`, not external noise
---
## Report Contract
`skills-catalog/ln-840-benchmark-compare/results/{date}-comparison.md` must answer:
- Did each scenario complete correctly?
- Did `hex-line` activate cleanly without discovery drift?
- What changed in wall time, API time, cost, output tokens, and total tool calls?
- Was the run valid?
Do not treat raw time/cost as sufficient without scenario correctness.
---
## Known Pitfalls
| Pitfall | Solution |
|---------|----------|
| SessionStart not present in hex-line run | Fail preflight and stop |
| Agent drifts into `ToolSearch` before hex-line use | Treat as activation problem and capture in report |
| Worktree already exists from prior crash | Remove it before adding a new one |
| Diff artifacts missing | Treat scenario correctness as failed |
| Simple scenario favors built-ins | Keep it in the suite if it is common; honesty beats cherry-picking |
---
## Definition of Done
- [ ] `goals.md` defines the canonical balanced suite
- [ ] `expectations.json` fully describes scenario correctness
- [ ] Runner passes syntax and SessionStart preflight
- [ ] Each scenario runs in two clean worktrees from the same commit
- [ ] Parser evaluates activation and scenario correctness from logs plus diffs
- [ ] Final report is saved to `skills-catalog/ln-840-benchmark-compare/results/`
- [ ] Temporary worktrees are removed
---
**Version:** 2.0.0
**Last Updated:** 2026-03-24Related Skills
ln-914-community-responder
Responds to unanswered GitHub discussions and issues with codebase-informed replies. Use when clearing community question backlog.
ln-913-community-debater
Launches RFC and debate discussions on GitHub. Use when proposing changes that need community input or voting.
ln-912-community-announcer
Composes and publishes announcements to GitHub Discussions. Use when sharing releases, updates, or news with the community.
ln-911-github-triager
Produces prioritized triage report from open GitHub issues, PRs, and discussions. Use when reviewing community backlog.
ln-910-community-engagement
Analyzes community health and delegates engagement tasks. Use when managing GitHub issues, discussions, and announcements.
ln-832-bundle-optimizer
Reduces JS/TS bundle size via tree-shaking, code splitting, and unused dependency removal. Use when optimizing frontend bundle size.
ln-831-oss-replacer
Replaces custom modules with OSS packages using atomic keep/discard testing. Use when migrating custom code to established libraries.
ln-830-code-modernization-coordinator
Modernizes codebase via OSS replacement and bundle optimization. Use when acting on audit findings to reduce custom code.
ln-823-pip-upgrader
Upgrades Python pip/poetry/pipenv dependencies with breaking change handling. Use when updating Python dependencies.
ln-822-nuget-upgrader
Upgrades .NET NuGet packages with breaking change handling. Use when updating .NET dependencies.
ln-821-npm-upgrader
Upgrades npm/yarn/pnpm dependencies with breaking change handling. Use when updating JavaScript/TypeScript dependencies.
ln-820-dependency-optimization-coordinator
Upgrades dependencies across all detected package managers. Use when updating npm, NuGet, or pip packages project-wide.