ln-840-benchmark-compare

Runs built-in vs hex-line benchmark with scenario manifests, activation checks, and diff-based correctness. Use when measuring hex-line MCP performance against built-in tools.

310 stars

bylevnikolaevich

View on GitHub Installation ↓

Best use case

ln-840-benchmark-compare is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Runs built-in vs hex-line benchmark with scenario manifests, activation checks, and diff-based correctness. Use when measuring hex-line MCP performance against built-in tools.

Teams using ln-840-benchmark-compare should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ln-840-benchmark-compare/SKILL.md --create-dirs "https://raw.githubusercontent.com/levnikolaevich/claude-code-skills/main/skills-catalog/ln-840-benchmark-compare/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/ln-840-benchmark-compare/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How ln-840-benchmark-compare Compares

Feature / Agent	ln-840-benchmark-compare	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Runs built-in vs hex-line benchmark with scenario manifests, activation checks, and diff-based correctness. Use when measuring hex-line MCP performance against built-in tools.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

> **Paths:** File paths (`shared/`, `references/`) are relative to skills repo root. Locate this SKILL.md directory and go up one level for repo root.

# Benchmark Compare

**Type:** L3 Worker
**Category:** 8XX Optimization -> 840 Benchmark

Run a clean A/B benchmark in Claude Code: one session with built-in tools only, one with `hex-line`. The benchmark is scenario-based, diff-validated, and manifest-driven. It measures activation, correctness, time, cost, and tokens.

---

## Input / Output

| Direction | Content |
|-----------|----------|
| **Input** | Repo checkout containing `mcp/hex-line-mcp/`, optional `references/goals.md`, optional `references/expectations.json` |
| **Output** | Comparison report in `skills-catalog/ln-840-benchmark-compare/results/{date}-comparison.md` |

---

## Prerequisites

- `claude --version` succeeds
- `git` succeeds
- `mcp/hex-line-mcp/server.mjs` exists
- `mcp/hex-line-mcp/hook.mjs` exists
- `skills-catalog/ln-840-benchmark-compare/references/goals.md` exists
- `skills-catalog/ln-840-benchmark-compare/references/expectations.json` exists
- `skills-catalog/ln-840-benchmark-compare/references/mcp-bench.json` exists

---

## Quick Run

```bash
bash skills-catalog/ln-840-benchmark-compare/scripts/run-benchmark.sh \
  [skills-catalog/ln-840-benchmark-compare/references/goals.md] \
  [skills-catalog/ln-840-benchmark-compare/references/expectations.json]
```

The runner handles:
- syntax preflight
- SessionStart preflight
- scenario extraction from `goals.md`
- isolated worktrees per scenario/session
- per-scenario diffs
- final comparison report

---

## Workflow

### Phase 1: Define The Canonical Suite

Use one canonical pair owned by this skill:
- `skills-catalog/ln-840-benchmark-compare/references/goals.md`
- `skills-catalog/ln-840-benchmark-compare/references/expectations.json`

Rules:
- The suite must be a balanced mix of common engineering scenarios.
- Do not design the suite to favor `hex-line`.
- Every scenario in `goals.md` must have a matching entry in `expectations.json`.
- `expectations.json` is the source of truth for correctness.

Supported expectation fields per scenario:

| Field | Meaning |
|-------|---------|
| `id` | Scenario identifier used in result filenames |
| `expectedChangedFiles` | Files that must change |
| `forbiddenChangedFiles` | Files that must not change |
| `requiredDiffPatterns` | Regex patterns required in the saved diff |
| `forbiddenDiffPatterns` | Regex patterns that must not appear in the diff |
| `requiredResultPatterns` | Regex patterns required in the final assistant result text |
| `requiredCommands` | Regex patterns that must match at least one Bash command |
| `exactChangedFiles` | If `true`, no extra changed files are allowed |

### Phase 2: Preflight

The runner must pass:
- `node --check server.mjs`
- `node --check hook.mjs`
- `node --check extract-scenarios.mjs`
- `node --check parse-results.mjs`
- SessionStart smoke check from `hook.mjs`

If preflight fails, the benchmark is invalid and must stop before scenarios run.

### Phase 3: Execute Per Scenario

For each `##` scenario in `goals.md`:
1. generate a standalone prompt file
2. create two clean worktrees from the same commit
3. run built-in Claude session
4. run hex-line Claude session
5. save `.jsonl` logs and `.diff.txt` artifacts
6. remove both worktrees

Built-in session:
- no MCP
- hooks disabled

Hex-line session:
- resolved MCP config pointing to `server.mjs`
- `outputStyle: "hex-line"`
- `PreToolUse` hook through `hook.mjs`

### Phase 4: Parse Results

`parse-results.mjs` evaluates each scenario for both sessions.

Scenario pass requires:
- valid run
- successful session completion
- changed files match expectations
- diff patterns match expectations
- result text patterns match expectations
- required commands were actually executed

### Phase 5: Read The Report

The final report has these sections:
- Scenario Outcomes
- Activation
- Time
- Cost
- Tokens
- Tool Totals
- Validity

Interpretation rules:
- `invalid run` means setup/adoption failure, not product performance
- scenario `FAIL` means correctness contract was not met
- activation is part of product quality for `hex-line`, not external noise

---

## Report Contract

`skills-catalog/ln-840-benchmark-compare/results/{date}-comparison.md` must answer:
- Did each scenario complete correctly?
- Did `hex-line` activate cleanly without discovery drift?
- What changed in wall time, API time, cost, output tokens, and total tool calls?
- Was the run valid?

Do not treat raw time/cost as sufficient without scenario correctness.

---

## Known Pitfalls

| Pitfall | Solution |
|---------|----------|
| SessionStart not present in hex-line run | Fail preflight and stop |
| Agent drifts into `ToolSearch` before hex-line use | Treat as activation problem and capture in report |
| Worktree already exists from prior crash | Remove it before adding a new one |
| Diff artifacts missing | Treat scenario correctness as failed |
| Simple scenario favors built-ins | Keep it in the suite if it is common; honesty beats cherry-picking |

---

## Definition of Done

- [ ] `goals.md` defines the canonical balanced suite
- [ ] `expectations.json` fully describes scenario correctness
- [ ] Runner passes syntax and SessionStart preflight
- [ ] Each scenario runs in two clean worktrees from the same commit
- [ ] Parser evaluates activation and scenario correctness from logs plus diffs
- [ ] Final report is saved to `skills-catalog/ln-840-benchmark-compare/results/`
- [ ] Temporary worktrees are removed

---

**Version:** 2.0.0
**Last Updated:** 2026-03-24

Related Skills

ln-914-community-responder

310

from levnikolaevich/claude-code-skills

Responds to unanswered GitHub discussions and issues with codebase-informed replies. Use when clearing community question backlog.

ln-913-community-debater

310

from levnikolaevich/claude-code-skills

Launches RFC and debate discussions on GitHub. Use when proposing changes that need community input or voting.

ln-912-community-announcer

310

from levnikolaevich/claude-code-skills

Composes and publishes announcements to GitHub Discussions. Use when sharing releases, updates, or news with the community.

ln-911-github-triager

310

from levnikolaevich/claude-code-skills

Produces prioritized triage report from open GitHub issues, PRs, and discussions. Use when reviewing community backlog.

ln-910-community-engagement

310

from levnikolaevich/claude-code-skills

Analyzes community health and delegates engagement tasks. Use when managing GitHub issues, discussions, and announcements.

ln-832-bundle-optimizer

310

from levnikolaevich/claude-code-skills

Reduces JS/TS bundle size via tree-shaking, code splitting, and unused dependency removal. Use when optimizing frontend bundle size.

ln-831-oss-replacer

310

from levnikolaevich/claude-code-skills

Replaces custom modules with OSS packages using atomic keep/discard testing. Use when migrating custom code to established libraries.

ln-830-code-modernization-coordinator

310

from levnikolaevich/claude-code-skills

Modernizes codebase via OSS replacement and bundle optimization. Use when acting on audit findings to reduce custom code.

ln-823-pip-upgrader

310

from levnikolaevich/claude-code-skills

Upgrades Python pip/poetry/pipenv dependencies with breaking change handling. Use when updating Python dependencies.

ln-822-nuget-upgrader

310

from levnikolaevich/claude-code-skills

Upgrades .NET NuGet packages with breaking change handling. Use when updating .NET dependencies.

ln-821-npm-upgrader

310

from levnikolaevich/claude-code-skills

Upgrades npm/yarn/pnpm dependencies with breaking change handling. Use when updating JavaScript/TypeScript dependencies.

ln-820-dependency-optimization-coordinator

310

from levnikolaevich/claude-code-skills

Upgrades dependencies across all detected package managers. Use when updating npm, NuGet, or pip packages project-wide.