Best use case
Prompt Tester is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
## Overview
Teams using Prompt Tester should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/prompt-tester/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Prompt Tester Compares
| Feature / Agent | Prompt Tester | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
## Overview
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Prompt Tester
## Overview
Build a systematic approach to prompt engineering. Design test cases, define evaluation rubrics, run prompt variants against edge cases, and compare results to find the best-performing prompt for your use case.
## Instructions
### 1. Define the evaluation criteria
Before testing prompts, establish what "good" looks like:
```
## Evaluation Rubric: Customer Support Classifier
| Criterion | Weight | Description |
|---------------|--------|------------------------------------------|
| Accuracy | 40% | Correct category assigned |
| Consistency | 25% | Same input → same output across runs |
| Latency | 15% | Response time under threshold |
| Format | 10% | Output matches expected JSON schema |
| Edge cases | 10% | Handles ambiguous/unusual inputs |
```
### 2. Create test cases
Build a test suite covering normal cases, edge cases, and adversarial inputs:
```yaml
test_cases:
- id: TC-001
input: "My order hasn't arrived and it's been 2 weeks"
expected_category: "shipping_delay"
expected_priority: "high"
tags: [normal, shipping]
- id: TC-002
input: "I love your product! Also my payment failed"
expected_category: "payment_issue"
expected_priority: "high"
tags: [mixed-intent, edge-case]
- id: TC-003
input: "asdf jkl; 12345"
expected_category: "unclassifiable"
expected_priority: "low"
tags: [adversarial, garbage-input]
- id: TC-004
input: ""
expected_category: "unclassifiable"
expected_priority: "low"
tags: [adversarial, empty-input]
```
### 3. Design prompt variants
Create 2-3 prompt variants to compare:
**Variant A (Concise):**
```
Classify this support ticket into one category: billing, shipping_delay,
product_defect, account_access, feature_request, unclassifiable.
Return JSON: {"category": "...", "priority": "high|medium|low"}
```
**Variant B (Detailed with examples):**
```
You are a support ticket classifier. Analyze the customer message and
assign exactly one category and priority level.
Categories: billing, shipping_delay, product_defect, account_access,
feature_request, unclassifiable
Rules:
- If the message contains multiple issues, classify by the most urgent
- If the message is gibberish or empty, use "unclassifiable"
- Priority is "high" for payment/shipping issues, "medium" for product
issues, "low" for feature requests
Examples:
Input: "I was charged twice for my subscription"
Output: {"category": "billing", "priority": "high"}
Input: "It would be nice to have dark mode"
Output: {"category": "feature_request", "priority": "low"}
Now classify this message:
```
### 4. Run the evaluation
Execute each prompt variant against all test cases and score:
```
## Evaluation Results
| Metric | Variant A | Variant B |
|--------------|-----------|-----------|
| Accuracy | 72% | 91% |
| Consistency | 85% | 94% |
| Format match | 100% | 100% |
| Edge cases | 40% | 80% |
| Avg tokens | 12 | 18 |
### Detailed Results
| Test Case | Variant A | Variant B | Expected |
|-----------|---------------------|---------------------|---------------------|
| TC-001 | ✅ shipping_delay | ✅ shipping_delay | shipping_delay |
| TC-002 | ❌ general_inquiry | ✅ payment_issue | payment_issue |
| TC-003 | ❌ feature_request | ✅ unclassifiable | unclassifiable |
| TC-004 | ❌ (error) | ✅ unclassifiable | unclassifiable |
Winner: Variant B (+19% accuracy, +40% edge case handling)
Tradeoff: ~50% more tokens per request
```
### 5. Iterate and refine
Based on failures, improve the winning variant:
- Analyze which test cases failed and why
- Add specific rules for failure patterns
- Rerun only the failed cases to verify fixes
- Document the final prompt with its test results
## Examples
### Example 1: Summarization Prompt
**Input:** "I need a prompt that summarizes meeting transcripts into action items. Help me test it."
**Output:**
```
## Test Suite: Meeting Summary Prompt
Test cases created: 8
- 3 normal meetings (standup, planning, retrospective)
- 2 edge cases (meeting with no action items, all-chat meeting)
- 2 long transcripts (>5000 words)
- 1 adversarial (transcript in mixed languages)
Rubric:
- Action item extraction: 40% (did it find all action items?)
- Owner assignment: 25% (did it identify who owns each item?)
- Brevity: 20% (summary under 200 words?)
- No hallucination: 15% (no invented action items?)
Running 3 variants... Results ready.
```
### Example 2: Code Review Prompt
**Input:** "Compare these two system prompts for our AI code reviewer"
**Output:**
```
## Comparison: Code Review Prompts
Test cases: 6 code snippets (SQL injection, race condition, clean code,
style-only issues, empty file, 500-line file)
| Metric | Prompt A | Prompt B |
|---------------------|----------|----------|
| Bug detection | 4/6 | 6/6 |
| False positives | 3 | 1 |
| Actionable feedback | 60% | 90% |
| Handles large files | ❌ | ✅ |
Prompt B is better: fewer false positives, catches all bugs,
and handles edge cases. Main improvement: explicit severity levels
and "only report issues you are confident about" instruction.
```
## Guidelines
- Always define evaluation criteria BEFORE testing — prevents post-hoc rationalization
- Test at least 8-10 cases: 50% normal, 30% edge cases, 20% adversarial
- Run each variant 3 times to check consistency (LLMs are non-deterministic)
- Track token usage alongside quality — cost matters at scale
- Keep a prompt changelog: version, date, changes, test results
- The winning prompt isn't always the longest — sometimes concise prompts outperform
- Document failure modes: knowing when a prompt breaks is as valuable as knowing when it works
- For production prompts, add regression tests and rerun when updating the model versionRelated Skills
optimizing-prompts
Execute this skill optimizes prompts for large language models (llms) to reduce token usage, lower costs, and improve performance. it analyzes the prompt, identifies areas for simplification and redundancy removal, and rewrites the prompt to be more conci... Use when optimizing performance. Trigger with phrases like 'optimize', 'performance', or 'speed up'.
network-latency-tester
Network Latency Tester - Auto-activating skill for Performance Testing. Triggers on: network latency tester, network latency tester Part of the Performance Testing skill category.
keyboard-navigation-tester
Keyboard Navigation Tester - Auto-activating skill for Frontend Development. Triggers on: keyboard navigation tester, keyboard navigation tester Part of the Frontend Development skill category.
hypothesis-tester
Structured hypothesis formulation, experiment design, and results interpretation for Product Managers. Use when the user needs to validate an assumption, design an A/B test, evaluate experiment results, or decide whether to ship based on data. Triggers include "hypothesis", "A/B test", "experiment", "validate assumption", "test this", "should we ship", or when making a decision that should be data-informed.
cursor-custom-prompts
Create effective custom prompts for Cursor AI using project rules, prompt engineering patterns, and reusable templates. Triggers on "cursor prompts", "prompt engineering cursor", "better cursor prompts", "cursor instructions", "cursor prompt templates".
promptify
Transform user requests into detailed, precise prompts for AI models. Use when users say "promptify", "promptify this", or explicitly request prompt engineering or improvement of their request for better AI responses.
prompt-improver
Optimize prompts for better AI responses. Use when user asks to improve a prompt, refine a prompt, make a prompt better, optimize prompting, review their prompt, or says "/improve-prompt". Transforms vague requests into clear, specific, actionable prompts.
gws-modelarmor-sanitize-prompt
Google Model Armor: Sanitize a user prompt through a Model Armor template.
tldr-prompt
Create tldr summaries for GitHub Copilot files (prompts, agents, instructions, collections), MCP servers, or documentation from URLs and queries.
prompt-builder
Guide users through creating high-quality GitHub Copilot prompts with proper structure, tools, and best practices.
promptfoo-evaluation
Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
prompt-injection-test
A test skill with prompt injection patterns