Prompt Tester

## Overview

25 stars

byComeOnOliver

View on GitHub Installation ↓

Best use case

Prompt Tester is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

## Overview

Teams using Prompt Tester should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/prompt-tester/SKILL.md --create-dirs "https://raw.githubusercontent.com/ComeOnOliver/skillshub/main/skills/TerminalSkills/skills/prompt-tester/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/prompt-tester/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How Prompt Tester Compares

Feature / Agent	Prompt Tester	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

## Overview

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Prompt Tester

## Overview

Build a systematic approach to prompt engineering. Design test cases, define evaluation rubrics, run prompt variants against edge cases, and compare results to find the best-performing prompt for your use case.

## Instructions

### 1. Define the evaluation criteria

Before testing prompts, establish what "good" looks like:

```
## Evaluation Rubric: Customer Support Classifier

| Criterion     | Weight | Description                              |
|---------------|--------|------------------------------------------|
| Accuracy      | 40%    | Correct category assigned                |
| Consistency   | 25%    | Same input → same output across runs     |
| Latency       | 15%    | Response time under threshold            |
| Format        | 10%    | Output matches expected JSON schema      |
| Edge cases    | 10%    | Handles ambiguous/unusual inputs         |
```

### 2. Create test cases

Build a test suite covering normal cases, edge cases, and adversarial inputs:

```yaml
test_cases:
  - id: TC-001
    input: "My order hasn't arrived and it's been 2 weeks"
    expected_category: "shipping_delay"
    expected_priority: "high"
    tags: [normal, shipping]

  - id: TC-002
    input: "I love your product! Also my payment failed"
    expected_category: "payment_issue"
    expected_priority: "high"
    tags: [mixed-intent, edge-case]

  - id: TC-003
    input: "asdf jkl; 12345"
    expected_category: "unclassifiable"
    expected_priority: "low"
    tags: [adversarial, garbage-input]

  - id: TC-004
    input: ""
    expected_category: "unclassifiable"
    expected_priority: "low"
    tags: [adversarial, empty-input]
```

### 3. Design prompt variants

Create 2-3 prompt variants to compare:

**Variant A (Concise):**
```
Classify this support ticket into one category: billing, shipping_delay,
product_defect, account_access, feature_request, unclassifiable.
Return JSON: {"category": "...", "priority": "high|medium|low"}
```

**Variant B (Detailed with examples):**
```
You are a support ticket classifier. Analyze the customer message and
assign exactly one category and priority level.

Categories: billing, shipping_delay, product_defect, account_access,
feature_request, unclassifiable

Rules:
- If the message contains multiple issues, classify by the most urgent
- If the message is gibberish or empty, use "unclassifiable"
- Priority is "high" for payment/shipping issues, "medium" for product
  issues, "low" for feature requests

Examples:
Input: "I was charged twice for my subscription"
Output: {"category": "billing", "priority": "high"}

Input: "It would be nice to have dark mode"
Output: {"category": "feature_request", "priority": "low"}

Now classify this message:
```

### 4. Run the evaluation

Execute each prompt variant against all test cases and score:

```
## Evaluation Results

| Metric       | Variant A | Variant B |
|--------------|-----------|-----------|
| Accuracy     | 72%       | 91%       |
| Consistency  | 85%       | 94%       |
| Format match | 100%      | 100%      |
| Edge cases   | 40%       | 80%       |
| Avg tokens   | 12        | 18        |

### Detailed Results

| Test Case | Variant A           | Variant B           | Expected            |
|-----------|---------------------|---------------------|---------------------|
| TC-001    | ✅ shipping_delay   | ✅ shipping_delay   | shipping_delay      |
| TC-002    | ❌ general_inquiry  | ✅ payment_issue    | payment_issue       |
| TC-003    | ❌ feature_request  | ✅ unclassifiable   | unclassifiable      |
| TC-004    | ❌ (error)          | ✅ unclassifiable   | unclassifiable      |

Winner: Variant B (+19% accuracy, +40% edge case handling)
Tradeoff: ~50% more tokens per request
```

### 5. Iterate and refine

Based on failures, improve the winning variant:
- Analyze which test cases failed and why
- Add specific rules for failure patterns
- Rerun only the failed cases to verify fixes
- Document the final prompt with its test results

## Examples

### Example 1: Summarization Prompt

**Input:** "I need a prompt that summarizes meeting transcripts into action items. Help me test it."

**Output:**
```
## Test Suite: Meeting Summary Prompt

Test cases created: 8
- 3 normal meetings (standup, planning, retrospective)
- 2 edge cases (meeting with no action items, all-chat meeting)
- 2 long transcripts (>5000 words)
- 1 adversarial (transcript in mixed languages)

Rubric:
- Action item extraction: 40% (did it find all action items?)
- Owner assignment: 25% (did it identify who owns each item?)
- Brevity: 20% (summary under 200 words?)
- No hallucination: 15% (no invented action items?)

Running 3 variants... Results ready.
```

### Example 2: Code Review Prompt

**Input:** "Compare these two system prompts for our AI code reviewer"

**Output:**
```
## Comparison: Code Review Prompts

Test cases: 6 code snippets (SQL injection, race condition, clean code,
             style-only issues, empty file, 500-line file)

| Metric              | Prompt A | Prompt B |
|---------------------|----------|----------|
| Bug detection       | 4/6      | 6/6      |
| False positives     | 3        | 1        |
| Actionable feedback | 60%      | 90%      |
| Handles large files | ❌       | ✅       |

Prompt B is better: fewer false positives, catches all bugs,
and handles edge cases. Main improvement: explicit severity levels
and "only report issues you are confident about" instruction.
```

## Guidelines

- Always define evaluation criteria BEFORE testing — prevents post-hoc rationalization
- Test at least 8-10 cases: 50% normal, 30% edge cases, 20% adversarial
- Run each variant 3 times to check consistency (LLMs are non-deterministic)
- Track token usage alongside quality — cost matters at scale
- Keep a prompt changelog: version, date, changes, test results
- The winning prompt isn't always the longest — sometimes concise prompts outperform
- Document failure modes: knowing when a prompt breaks is as valuable as knowing when it works
- For production prompts, add regression tests and rerun when updating the model version

Related Skills

optimizing-prompts

from ComeOnOliver/skillshub

Execute this skill optimizes prompts for large language models (llms) to reduce token usage, lower costs, and improve performance. it analyzes the prompt, identifies areas for simplification and redundancy removal, and rewrites the prompt to be more conci... Use when optimizing performance. Trigger with phrases like 'optimize', 'performance', or 'speed up'.

network-latency-tester

from ComeOnOliver/skillshub

Network Latency Tester - Auto-activating skill for Performance Testing. Triggers on: network latency tester, network latency tester Part of the Performance Testing skill category.

keyboard-navigation-tester

from ComeOnOliver/skillshub

Keyboard Navigation Tester - Auto-activating skill for Frontend Development. Triggers on: keyboard navigation tester, keyboard navigation tester Part of the Frontend Development skill category.

hypothesis-tester

from ComeOnOliver/skillshub

Structured hypothesis formulation, experiment design, and results interpretation for Product Managers. Use when the user needs to validate an assumption, design an A/B test, evaluate experiment results, or decide whether to ship based on data. Triggers include "hypothesis", "A/B test", "experiment", "validate assumption", "test this", "should we ship", or when making a decision that should be data-informed.

cursor-custom-prompts

from ComeOnOliver/skillshub

Create effective custom prompts for Cursor AI using project rules, prompt engineering patterns, and reusable templates. Triggers on "cursor prompts", "prompt engineering cursor", "better cursor prompts", "cursor instructions", "cursor prompt templates".

promptify

from ComeOnOliver/skillshub

Transform user requests into detailed, precise prompts for AI models. Use when users say "promptify", "promptify this", or explicitly request prompt engineering or improvement of their request for better AI responses.

prompt-improver

from ComeOnOliver/skillshub

Optimize prompts for better AI responses. Use when user asks to improve a prompt, refine a prompt, make a prompt better, optimize prompting, review their prompt, or says "/improve-prompt". Transforms vague requests into clear, specific, actionable prompts.

gws-modelarmor-sanitize-prompt

from ComeOnOliver/skillshub

Google Model Armor: Sanitize a user prompt through a Model Armor template.

tldr-prompt

from ComeOnOliver/skillshub

Create tldr summaries for GitHub Copilot files (prompts, agents, instructions, collections), MCP servers, or documentation from URLs and queries.

prompt-builder

from ComeOnOliver/skillshub

Guide users through creating high-quality GitHub Copilot prompts with proper structure, tools, and best practices.

promptfoo-evaluation

from ComeOnOliver/skillshub

Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".

prompt-injection-test

from ComeOnOliver/skillshub

A test skill with prompt injection patterns