Codex

metrics-tokens

Analyze token usage efficiency against the MetaGPT baseline and surface per-step optimization opportunities

104 stars

byjmagly

View on GitHub Installation ↓

Best use case

metrics-tokens is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

It is a strong fit for teams already working in Codex.

Analyze token usage efficiency against the MetaGPT baseline and surface per-step optimization opportunities

Teams using metrics-tokens should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/metrics-tokens/SKILL.md --create-dirs "https://raw.githubusercontent.com/jmagly/aiwg/main/.agents/skills/metrics-tokens/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/metrics-tokens/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How metrics-tokens Compares

Feature / Agent	metrics-tokens	Standard Approach
Platform Support	Codex	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Analyze token usage efficiency against the MetaGPT baseline and surface per-step optimization opportunities

Which AI agents support this skill?

This skill is designed for Codex.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

AI Agents for Marketing

Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.

SKILL.md Source

# metrics-tokens

You perform deep analysis of token usage efficiency. You compare AIWG workflow token consumption against the MetaGPT 124 tokens/line benchmark (REF-013), identify high-cost operations, and surface optimization opportunities.

## Triggers

Alternate expressions and non-obvious activations (primary phrases are matched automatically from the skill description):

- "how efficient are my tokens" → efficiency ratio vs MetaGPT baseline
- "am I above the baseline" → threshold status check
- "where are tokens being wasted" → per-step breakdown with recommendations
- "token ratio" → tokens/line ratio calculation

## Trigger Patterns Reference

| Pattern | Example | Action |
|---------|---------|--------|
| Efficiency report | "token efficiency" | `aiwg metrics-tokens` |
| Session analysis | "analyze tokens for this session" | `aiwg metrics-tokens --session current` |
| Threshold check | "are we at green" | `aiwg metrics-tokens --threshold` |
| Per-step breakdown | "which step used the most tokens" | `aiwg metrics-tokens --by-step` |
| Optimization hints | "suggest token optimizations" | `aiwg metrics-tokens --optimize` |

## Behavior

When triggered:

1. **Determine scope**:
   - Default: current or most recent session
   - `--session <name>`: named session
   - `--all`: aggregate across all sessions

2. **Load token data**:
   - Read `.aiwg/ralph/sessions/*/metrics.json` for raw token counts
   - Apply estimation heuristic: 4 chars per token (aligned with `src/metrics/token-counter.ts`)

3. **Compute efficiency metrics**:
   - Tokens/line ratio for session output
   - `vsBenchmark`: percentage vs MetaGPT 124 tokens/line (negative = better)
   - `vsBaseline`: percentage vs typical LLM 200 tokens/line (negative = better)
   - Threshold status: green (≤124), yellow (125–150), red (>150)

4. **Run the command**:

   ```bash
   # Default efficiency report
   aiwg metrics-tokens

   # Current session
   aiwg metrics-tokens --session current

   # Per-step breakdown
   aiwg metrics-tokens --by-step

   # With optimization suggestions
   aiwg metrics-tokens --optimize

   # JSON output
   aiwg metrics-tokens --json
   ```

## Benchmark Reference

The MetaGPT 124 tokens/line benchmark comes from REF-013 (research corpus). It represents a validated efficiency target for AI-assisted software workflows. AIWG tracks against this benchmark to make token costs legible and comparable across sessions.

| Threshold | Tokens/Line | Status | Action |
|-----------|-------------|--------|--------|
| At or below benchmark | ≤ 124 | green | No action needed |
| Above benchmark | 125–150 | yellow | Flag for review |
| Well above benchmark | > 150 | red | Generate optimization recommendations |

Comparison points:

| Baseline | Tokens/Line |
|----------|-------------|
| MetaGPT benchmark (REF-013) | 124 |
| Typical LLM baseline | ~200 |
| AIWG target | ≤ 124 |

## Report Format

### Standard Efficiency Report

```
Token Efficiency — Session: sdlc-review-20260401-143022
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Token Counts
  Input:    42,310 tokens
  Output:   18,940 tokens
  Total:    61,250 tokens

Content Metrics
  Characters:     245,000
  Non-blank lines:    548
  Total lines:        621

Efficiency
  Tokens/line:    112
  vs MetaGPT:     -9.7%  (better than 124 tokens/line benchmark)
  vs LLM baseline: -44%  (well below 200 tokens/line typical)
  Status:         green

Threshold: green — at or below MetaGPT benchmark
```

### Per-Step Breakdown (`--by-step`)

```
Token Efficiency by Step
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step                    Tokens    Lines  Tokens/Line  Status
──────────────────────  ────────  ─────  ───────────  ──────
architecture-designer   18,200    168    108          green
security-architect      14,600    132    111          green
test-architect          13,100    119    110          green
technical-writer        15,350    129    119          green  ← highest volume
                        ──────────────────────────────────
Total                   61,250    548    112          green
```

### Optimization Report (`--optimize`)

```
Optimization Suggestions
━━━━━━━━━━━━━━━━━━━━━━━━

Status: green — no critical optimizations needed.

Opportunities (optional):
  1. technical-writer (119 tok/line) — near benchmark ceiling.
     Consider: scope the synthesis prompt to final merge only,
     avoid re-reading full drafts.

  2. architecture-designer (18,200 tokens) — highest absolute cost.
     Consider: pass only the relevant SAD section, not the full doc.
```

## Efficiency Calculation

Token efficiency uses the estimation and comparison logic from `src/metrics/token-counter.ts`:

```
tokens          = ceil(characters / 4)
tokensPerLine   = tokens / nonBlankLines
vsBenchmark     = (tokensPerLine - 124) / 124 * 100   (negative = better)
vsBaseline      = (tokensPerLine - 200) / 200 * 100   (negative = better)
```

## Examples

### Example 1: Quick efficiency check

**User**: "Token efficiency for this session"

**Action**:
```bash
aiwg metrics-tokens
```

**Response**: Efficiency report with tokens/line ratio, benchmark comparison, and green/yellow/red status.

### Example 2: Identify expensive steps

**User**: "Which step used the most tokens?"

**Action**:
```bash
aiwg metrics-tokens --by-step
```

**Response**: Per-step table showing token counts, line counts, tokens/line ratio, and threshold status for each workflow step.

### Example 3: Optimization pass

**User**: "Suggest ways to reduce token usage"

**Action**:
```bash
aiwg metrics-tokens --optimize
```

**Response**: Optimization suggestions targeted at steps above the green threshold, with specific prompt-scoping recommendations.

### Example 4: Are we at green?

**User**: "Are we at green on token efficiency?"

**Extraction**: Threshold check

**Action**:
```bash
aiwg metrics-tokens --threshold
```

**Response**: "Threshold status: **green** — 112 tokens/line, 9.7% below the MetaGPT 124 tokens/line benchmark (REF-013)."

## Clarification Prompts

If the session scope is unclear:

- "Should I analyze the current running session or the most recent completed session?"

## References

- @$AIWG_ROOT/src/cli/handlers/subcommands.ts — Metrics tokens handler
- @$AIWG_ROOT/src/metrics/token-counter.ts — Token counting, MetaGPT baseline constants (REF-013)
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/schemas/flows/token-efficiency.yaml — Token efficiency schema
- @$AIWG_ROOT/docs/cli-reference.md — CLI reference

Related Skills

regression-metrics

104

from jmagly/aiwg

Track and analyze regression statistics, trends, hotspots, and health indicators across test suites

Codex

aiwg-orchestrate

104

from jmagly/aiwg

Route structured artifact work to AIWG workflows via MCP with zero parent context cost

venv-manager

104

from jmagly/aiwg

Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.

pytest-runner

104

from jmagly/aiwg

Execute Python tests with pytest, supporting fixtures, markers, coverage, and parallel execution. Use for Python test automation.

vitest-runner

104

from jmagly/aiwg

Execute JavaScript/TypeScript tests with Vitest, supporting coverage, watch mode, and parallel execution. Use for JS/TS test automation.

eslint-checker

104

from jmagly/aiwg

Run ESLint for JavaScript/TypeScript code quality and style enforcement. Use for static analysis and auto-fixing.

repo-analyzer

104

from jmagly/aiwg

Analyze GitHub repositories for structure, documentation, dependencies, and contribution patterns. Use for codebase understanding and health assessment.