ct-grade

CLEO session grading and A/B behavioral analysis with token tracking. Evaluates agent session quality via a 5-dimension rubric (S1 session discipline, S2 discovery efficiency, S3 task hygiene, S4 error protocol, S5 progressive disclosure). Supports three modes: (1) scenario — run playbook scenarios S1-S5 via CLI; (2) ab — blind A/B comparison of different CLI configurations for same domain operations with token cost measurement; (3) blind — spawn two agents with different configurations, blind-comparator picks winner, analyzer produces recommendation. Use when grading agent sessions, running grade playbook scenarios, comparing behavioral differences, measuring token usage across configurations, or performing multi-run blind A/B evaluation with statistical analysis and comparative report. Triggers on: grade session, evaluate agent behavior, A/B test CLEO configurations, run grade scenario, token usage analysis, behavioral rubric, protocol compliance scoring.

141 stars

Best use case

ct-grade is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

CLEO session grading and A/B behavioral analysis with token tracking. Evaluates agent session quality via a 5-dimension rubric (S1 session discipline, S2 discovery efficiency, S3 task hygiene, S4 error protocol, S5 progressive disclosure). Supports three modes: (1) scenario — run playbook scenarios S1-S5 via CLI; (2) ab — blind A/B comparison of different CLI configurations for same domain operations with token cost measurement; (3) blind — spawn two agents with different configurations, blind-comparator picks winner, analyzer produces recommendation. Use when grading agent sessions, running grade playbook scenarios, comparing behavioral differences, measuring token usage across configurations, or performing multi-run blind A/B evaluation with statistical analysis and comparative report. Triggers on: grade session, evaluate agent behavior, A/B test CLEO configurations, run grade scenario, token usage analysis, behavioral rubric, protocol compliance scoring.

Teams using ct-grade should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ct-grade/SKILL.md --create-dirs "https://raw.githubusercontent.com/kryptobaseddev/cleo/main/packages/skills/skills/ct-grade/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/ct-grade/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How ct-grade Compares

Feature / Agentct-gradeStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

CLEO session grading and A/B behavioral analysis with token tracking. Evaluates agent session quality via a 5-dimension rubric (S1 session discipline, S2 discovery efficiency, S3 task hygiene, S4 error protocol, S5 progressive disclosure). Supports three modes: (1) scenario — run playbook scenarios S1-S5 via CLI; (2) ab — blind A/B comparison of different CLI configurations for same domain operations with token cost measurement; (3) blind — spawn two agents with different configurations, blind-comparator picks winner, analyzer produces recommendation. Use when grading agent sessions, running grade playbook scenarios, comparing behavioral differences, measuring token usage across configurations, or performing multi-run blind A/B evaluation with statistical analysis and comparative report. Triggers on: grade session, evaluate agent behavior, A/B test CLEO configurations, run grade scenario, token usage analysis, behavioral rubric, protocol compliance scoring.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Session Grading Guide

Session grading evaluates agent behavioral patterns against the CLEO protocol. It reads the audit log for a completed session and applies a 5-dimension rubric to produce a score (0-100), letter grade (A-F), and diagnostic flags.

## When to Use Grade Mode

Use grading when you need to:
- Evaluate how well an agent followed CLEO protocol during a session
- Identify behavioral anti-patterns (skipped discovery, missing session.end, etc.)
- Track improvement over time across multiple sessions
- Validate that orchestrated subagents followed protocol

Grading requires audit data. Sessions must be started with the `--grade` flag to enable audit log capture.

## Starting a Grade Session

### CLI

```bash
# Start a session with grading enabled
ct session start --scope epic:T001 --name "Feature work" --grade

# The --grade flag enables detailed audit logging
# All CLI operations are recorded for later analysis
```

## Running Scenarios

The grading rubric evaluates 5 behavioral scenarios that map to protocol compliance:

### 1. Fresh Discovery
Tests whether the agent checks existing sessions and tasks before starting work. Evaluates `session.list` and `tasks.find` calls at session start.

### 2. Task Hygiene
Tests whether task creation follows protocol: descriptions provided, parent existence verified before subtask creation, no duplicate tasks.

### 3. Error Recovery
Tests whether the agent handles errors correctly: follows up `E_NOT_FOUND` with recovery lookups (`tasks.find`), avoids duplicate creates after failures.

### 4. Full Lifecycle
Tests session discipline end-to-end: session listed before task ops, session properly ended, CLI usage patterns.

### 5. Multi-Domain Analysis
Tests progressive disclosure: use of `admin.help` or skill lookups, use of progressive disclosure for programmatic access.

## Evaluating Results

### CLI

```bash
# Grade a specific session
ct grade <sessionId>

# List all past grade results
ct grade --list
```

## Understanding the 5 Dimensions

Each dimension scores 0-20 points, totaling 0-100.

### S1: Session Discipline (20 pts)

| Points | Criteria |
|--------|----------|
| 10 | `session.list` called before first task operation |
| 10 | `session.end` called when work is complete |

**What it measures**: Does the agent check existing sessions before starting, and properly close sessions when done?

### S2: Discovery Efficiency (20 pts)

| Points | Criteria |
|--------|----------|
| 0-15 | `find:list` ratio >= 80% earns full 15; scales linearly below |
| 5 | `tasks.show` used for detail retrieval |

**What it measures**: Does the agent prefer `tasks.find` (low context cost) over `tasks.list` (high context cost) for discovery?

### S3: Task Hygiene (20 pts)

Starts at 20 and deducts for violations:

| Deduction | Violation |
|-----------|-----------|
| -5 each | `tasks.add` without a description |
| -3 | Subtasks created without `tasks.find {exact:true}` parent check |

**What it measures**: Does the agent create well-formed tasks with descriptions and verify parents before creating subtasks?

### S4: Error Protocol (20 pts)

Starts at 20 and deducts for violations:

| Deduction | Violation |
|-----------|-----------|
| -5 each | `E_NOT_FOUND` error not followed by recovery lookup within 5 ops |
| -5 | Duplicate task creates detected (same title in session) |

**What it measures**: Does the agent recover gracefully from errors and avoid creating duplicate tasks?

### S5: Progressive Disclosure Use (20 pts)

| Points | Criteria |
|--------|----------|
| 10 | `admin.help` or skill lookup calls made |
| 10 | Progressive disclosure used for programmatic access |

**What it measures**: Does the agent use progressive disclosure (help/skills) for efficient protocol access?

## Interpreting Scores

### Letter Grades

| Grade | Score Range | Meaning |
|-------|-----------|---------|
| **A** | 90-100 | Excellent protocol adherence. Agent follows all best practices. |
| **B** | 75-89 | Good. Minor gaps in one or two dimensions. |
| **C** | 60-74 | Acceptable. Several protocol violations need attention. |
| **D** | 45-59 | Below expectations. Significant anti-patterns present. |
| **F** | 0-44 | Failing. Major protocol violations across multiple dimensions. |

### Reading the Output

The grade result includes:
- **score/maxScore**: Raw numeric score (e.g., `85/100`)
- **percent**: Percentage score
- **grade**: Letter grade (A-F)
- **dimensions**: Per-dimension breakdown with score, max, and evidence
- **flags**: Specific violations or improvement suggestions
- **entryCount**: Number of audit entries analyzed

### Flags

Flags are actionable diagnostic messages. Each flag identifies a specific behavioral issue:

- `session.list never called` -- Check existing sessions before starting new ones
- `session.end never called` -- Always end sessions when done
- `tasks.list used Nx` -- Prefer `tasks.find` for discovery
- `tasks.add without description` -- Always provide task descriptions
- `Subtasks created without parent existence check` -- Verify parent exists first
- `E_NOT_FOUND not followed by recovery lookup` -- Follow errors with `tasks.find`
- `No admin.help or skill lookup calls` -- Load `ct-cleo` for protocol guidance
- `No progressive disclosure calls` -- Use `admin.help` or skill lookups

## Common Anti-patterns

| Anti-pattern | Impact | Fix |
|-------------|--------|-----|
| Skipping `session.list` at start | -10 S1 | Always check existing sessions first |
| Forgetting `session.end` | -10 S1 | End sessions when work is complete |
| Using `tasks.list` instead of `tasks.find` | -up to 15 S2 | Use `find` for discovery, `list` only for known parent children |
| Creating tasks without descriptions | -5 each S3 | Always provide a description with `tasks.add` |
| Ignoring `E_NOT_FOUND` errors | -5 each S4 | Follow up with `tasks.find` or `tasks.exists` |
| Creating duplicate tasks | -5 S4 | Check for existing tasks before creating new ones |
| Never using `admin.help` | -10 S5 | Use progressive disclosure for protocol guidance |
| No progressive disclosure calls | -10 S5 | Use `admin.help` or skill lookups for protocol guidance |

## Grade Result Schema

Grade results are stored in `.cleo/metrics/GRADES.jsonl` as append-only JSONL. Each entry conforms to `schemas/grade.schema.json` with these fields:

- `sessionId` (string, required) -- Session that was graded
- `taskId` (string, optional) -- Associated task ID
- `totalScore` (number, 0-100) -- Aggregate score
- `maxScore` (number, default 100) -- Maximum possible score
- `dimensions` (object) -- Per-dimension `{ score, max, evidence[] }`
- `flags` (string[]) -- Specific violations or suggestions
- `timestamp` (ISO 8601) -- When the grade was computed
- `entryCount` (number) -- Audit entries analyzed
- `evaluator` (`auto` | `manual`) -- How the grade was computed

## CLI Grade Operations

| Command | Description |
|---------|-------------|
| `ct grade <sessionId>` | Grade a specific session |
| `ct grade --list` | List past grade results |

Related Skills

signaldock-connect

141
from kryptobaseddev/cleo

Connect any AI agent to SignalDock for agent-to-agent messaging. Use when an agent needs to: (1) register on api.signaldock.io, (2) install the signaldock runtime CLI, (3) send/receive messages to other agents, (4) set up SSE real-time streaming, (5) poll for messages, (6) check inbox, or (7) connect to the SignalDock platform. Triggers on: "connect to signaldock", "register agent", "send message to agent", "agent messaging", "signaldock setup", "install signaldock", "agent-to-agent".

ct-validator

141
from kryptobaseddev/cleo

Compliance validation for verifying systems, documents, or code against requirements, schemas, or standards. Performs schema validation, code compliance checks, document validation, and protocol compliance verification with detailed pass/fail reporting. Use when validating compliance, checking schemas, verifying code standards, or auditing protocol implementations. Triggers on validation tasks, compliance checks, or quality verification needs.

ct-task-executor

141
from kryptobaseddev/cleo

General implementation task execution for completing assigned CLEO tasks by following instructions and producing concrete deliverables. Handles coding, configuration, documentation work with quality verification against acceptance criteria and progress reporting. Use when executing implementation tasks, completing assigned work, or producing task deliverables. Triggers on implementation tasks, general execution needs, or task completion work.

ct-stickynote

141
from kryptobaseddev/cleo

Quick ephemeral sticky notes for project-wide capture before formal classification

ct-spec-writer

141
from kryptobaseddev/cleo

Technical specification writing using RFC 2119 language for clear, unambiguous requirements. Creates protocol specifications, technical requirements, API specifications, and architecture documents with testable requirements and compliance criteria. Use when writing specifications, defining protocols, documenting requirements, or creating API contracts. Triggers on specification tasks, protocol definition needs, or requirement documentation.

ct-skill-validator

141
from kryptobaseddev/cleo

Validates an existing skill folder against the full CLEO standard and ecosystem. Use when auditing skills for structural compliance, verifying a skill fits into the CLEO ecosystem and constitution, running quality A/B evals, or preparing a skill for distribution. Runs a 3-phase validation loop — structural, ecosystem fit, and quality eval — then presents all findings as an HTML report opened in the user's browser. Iterates until all required phases pass.

ct-skill-creator

141
from kryptobaseddev/cleo

Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.

ct-research-agent

141
from kryptobaseddev/cleo

Multi-source research and investigation combining web search, documentation lookup via Context7, and codebase analysis. Synthesizes findings into actionable recommendations with proper citation and task traceability. Use when conducting research, investigating best practices, gathering technical information, or analyzing existing implementations. Triggers on research tasks, investigation needs, or information discovery requests.

ct-release-orchestrator

141
from kryptobaseddev/cleo

Orchestrates the full release pipeline: version bump, then changelog, then commit, then tag, then conditionally forks to artifact-publish and provenance based on release config. Parent protocol that composes ct-artifact-publisher and ct-provenance-keeper as sub-protocols: not every release publishes artifacts (source-only releases skip it), and artifact publishers delegate signing and attestation to provenance. Use when shipping a new version, running cleo release ship, or promoting a completed epic to released status.

ct-provenance-keeper

141
from kryptobaseddev/cleo

Generates in-toto v1 attestations, SLSA-level provenance records, SBOMs (CycloneDX or SPDX), and sigstore/cosign signatures for published artifacts. Invoked by ct-artifact-publisher as a delegation for signing and attestation. Records the full commit, then build, then artifact, then attestation, then registry chain in .cleo/releases.json and rejects publishes whose digest does not match the attestation. Triggers when artifact-publish reaches the provenance step or when a release needs SLSA L2+ attestation.

ct-orchestrator

141
from kryptobaseddev/cleo

Pipeline-aware orchestration skill for managing complex workflows through subagent delegation. Use when the user asks to "orchestrate", "orchestrator mode", "run as orchestrator", "delegate to subagents", "coordinate agents", "spawn subagents", "multi-agent workflow", "context-protected workflow", "agent farm", "HITL orchestration", "pipeline management", or needs to manage complex workflows by delegating work to subagents while protecting the main context window. Enforces ORC-001 through ORC-009 constraints. Provider-neutral — works with any AI agent runtime.

ct-memory

141
from kryptobaseddev/cleo

Brain memory protocol with progressive disclosure for anti-hallucination and context recall