DeepEval — LLM Testing & Evaluation Framework
## Overview
Best use case
DeepEval — LLM Testing & Evaluation Framework is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
## Overview
Teams using DeepEval — LLM Testing & Evaluation Framework should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/deepeval/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How DeepEval — LLM Testing & Evaluation Framework Compares
| Feature / Agent | DeepEval — LLM Testing & Evaluation Framework | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
## Overview
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# DeepEval — LLM Testing & Evaluation Framework
## Overview
DeepEval, the open-source framework for unit testing LLM applications. Helps developers write test cases, define custom metrics, and integrate LLM quality checks into CI/CD pipelines using a pytest-like interface.
## Instructions
### Basic Test Cases
Write unit tests for LLM outputs using built-in metrics:
```python
# tests/test_chatbot.py — Unit tests for a customer support chatbot
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
HallucinationMetric,
ToxicityMetric,
)
def test_answer_is_relevant():
"""Verify the chatbot answers the actual question asked."""
test_case = LLMTestCase(
input="How do I cancel my subscription?",
actual_output="To cancel your subscription, go to Settings > Billing > Cancel Plan. Your access continues until the end of the billing period.",
retrieval_context=[
"Cancellation Policy: Users can cancel anytime via Settings > Billing > Cancel Plan. Access remains active until the current billing period ends.",
"Refunds: Pro-rated refunds are available for annual plans within 14 days.",
],
)
metric = AnswerRelevancyMetric(
threshold=0.7, # Minimum score to pass (0.0 to 1.0)
model="gpt-4o", # Judge model for evaluation
)
assert_test(test_case, [metric])
def test_answer_is_faithful_to_context():
"""Ensure the chatbot doesn't hallucinate beyond retrieved documents."""
test_case = LLMTestCase(
input="What is the pricing for the enterprise plan?",
actual_output="The enterprise plan costs $499/month with unlimited users and priority support.",
retrieval_context=[
"Enterprise Plan: $499/month. Includes unlimited users, priority support, SSO, and custom integrations.",
],
)
faithfulness = FaithfulnessMetric(threshold=0.8)
hallucination = HallucinationMetric(threshold=0.5) # Lower = less hallucination
assert_test(test_case, [faithfulness, hallucination])
def test_response_is_not_toxic():
"""Guard against toxic or inappropriate responses."""
test_case = LLMTestCase(
input="Your product is terrible and I hate it",
actual_output="I'm sorry to hear about your frustration. Let me help resolve your issue. Could you describe what went wrong?",
)
toxicity = ToxicityMetric(threshold=0.5)
assert_test(test_case, [toxicity])
```
### Conversational Testing
Test multi-turn conversations for coherence and context retention:
```python
# tests/test_conversation.py — Multi-turn conversation quality
from deepeval.test_case import ConversationalTestCase, LLMTestCase
from deepeval.metrics import ConversationRelevancyMetric, ConversationCompletenessMetric
def test_multi_turn_conversation():
"""Verify the assistant maintains context across turns."""
conversation = ConversationalTestCase(
turns=[
LLMTestCase(
input="I want to upgrade to the Pro plan",
actual_output="I'd be happy to help you upgrade to Pro! Your current plan is Basic. The Pro plan is $29/month with advanced analytics and priority support. Shall I proceed?",
),
LLMTestCase(
input="Yes, please proceed",
actual_output="Done! Your account has been upgraded to Pro ($29/month). The new features are available immediately. Your next billing date is April 1st.",
),
LLMTestCase(
input="What features did I just get?",
actual_output="With your new Pro plan, you now have access to: advanced analytics dashboards, priority email support (< 4hr response), custom integrations via API, and team collaboration tools for up to 10 members.",
retrieval_context=[
"Pro Plan Features: Advanced analytics, priority support (4hr SLA), API access, team collaboration (10 seats).",
],
),
],
)
relevancy = ConversationRelevancyMetric(threshold=0.7)
completeness = ConversationCompletenessMetric(threshold=0.7)
assert_test(conversation, [relevancy, completeness])
```
### Custom Metrics
Define evaluation criteria specific to your domain:
```python
# metrics/brand_voice.py — Custom metric for brand consistency
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
class BrandVoiceMetric(BaseMetric):
"""Evaluate if responses match the company's brand voice guidelines.
Scores how well the output follows the defined tone, vocabulary,
and communication style of the brand.
"""
def __init__(self, brand_guidelines: str, threshold: float = 0.7):
self.threshold = threshold
self.brand_guidelines = brand_guidelines
def measure(self, test_case: LLMTestCase) -> float:
# Use an LLM to judge brand voice adherence
from deepeval.models import GPTModel
judge = GPTModel(model="gpt-4o")
prompt = f"""Evaluate how well this response follows the brand voice guidelines.
Brand Guidelines:
{self.brand_guidelines}
User Input: {test_case.input}
Response: {test_case.actual_output}
Score from 0.0 (completely off-brand) to 1.0 (perfectly on-brand).
Explain your reasoning, then provide the score on the last line as just a number."""
result = judge.generate(prompt)
# Extract score from last line
lines = result.strip().split('\n')
self.score = float(lines[-1].strip())
self.reason = '\n'.join(lines[:-1])
self.success = self.score >= self.threshold
return self.score
def is_successful(self) -> bool:
return self.success
@property
def __name__(self):
return "Brand Voice"
# Usage in tests
brand_metric = BrandVoiceMetric(
brand_guidelines="""
- Friendly but professional tone
- Use 'we' not 'I'
- Avoid jargon; explain technical terms
- Maximum 3 sentences per paragraph
- Always offer next steps
""",
threshold=0.75,
)
```
### Bulk Evaluation with Datasets
Run evaluations at scale:
```python
# eval/run_benchmark.py — Evaluate across a full test dataset
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.dataset import EvaluationDataset
import json
# Load test cases from a JSON file
with open("eval/test_cases.json") as f:
raw_cases = json.load(f)
test_cases = [
LLMTestCase(
input=case["question"],
actual_output=case["answer"],
expected_output=case.get("expected_answer"),
retrieval_context=case.get("contexts", []),
)
for case in raw_cases
]
dataset = EvaluationDataset(test_cases=test_cases)
# Run evaluation — results are displayed in a table and optionally
# pushed to the DeepEval dashboard (Confident AI)
results = evaluate(
test_cases=dataset,
metrics=[
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
],
print_results=True, # Show results table in terminal
)
# Access individual results programmatically
for result in results.test_results:
if not result.success:
print(f"FAILED: {result.input[:50]}...")
for metric_result in result.metrics_data:
if not metric_result.success:
print(f" {metric_result.name}: {metric_result.score:.2f} (reason: {metric_result.reason})")
```
### Red Teaming and Safety
Test LLMs against adversarial inputs:
```python
# tests/test_safety.py — Adversarial testing for LLM safety
from deepeval.metrics import (
ToxicityMetric,
BiasMetric,
)
from deepeval.red_teaming import RedTeamer
# Automated red teaming — generates adversarial prompts
red_teamer = RedTeamer(
target_model="gpt-4o",
attacks=[
"prompt-injection", # Attempts to override system prompt
"jailbreak", # Tries to bypass safety guardrails
"pii-extraction", # Attempts to extract personal data
"harmful-content", # Requests for dangerous information
],
attack_count=20, # Generate 20 attack attempts per category
)
results = red_teamer.scan()
# Check vulnerability scores
for vulnerability in results.vulnerabilities:
print(f"{vulnerability.type}: {vulnerability.score:.2f} "
f"({vulnerability.attacks_succeeded}/{vulnerability.attacks_total} succeeded)")
```
## Installation & CLI
```bash
# Install DeepEval
pip install deepeval
# Run tests with pytest (deepeval is a pytest plugin)
deepeval test run tests/test_chatbot.py
# Run with verbose output showing per-metric scores
deepeval test run tests/ -v
# Login to Confident AI dashboard (optional, for tracking)
deepeval login
```
## Examples
### Example 1: Setting up an evaluation pipeline for a RAG application
**User request:**
```
I have a RAG chatbot that answers questions from our docs. Set up Deepeval to evaluate answer quality.
```
The agent creates an evaluation suite with appropriate metrics (faithfulness, relevance, answer correctness), configures test datasets from real user questions, runs baseline evaluations, and sets up CI integration so evaluations run on every prompt or retrieval change.
### Example 2: Comparing model performance across prompts
**User request:**
```
We're testing GPT-4o vs Claude on our customer support prompts. Set up a comparison with Deepeval.
```
The agent creates a structured experiment with the existing prompt set, configures both model providers, defines scoring criteria specific to customer support (accuracy, tone, completeness), runs the comparison, and generates a summary report with statistical significance indicators.
## Guidelines
1. **Test the full pipeline** — Don't just test the LLM; test retrieval + generation + post-processing together
2. **Threshold tuning** — Start with low thresholds (0.5), measure baseline, then raise gradually
3. **CI/CD integration** — Run `deepeval test run` in your CI pipeline; fail builds on quality regressions
4. **Adversarial testing** — Red team your LLM before production; focus on prompt injection and PII leaks
5. **Version test sets** — Track test cases in git; add new cases when you find production failures
6. **Multiple metrics per test** — Combine faithfulness + relevancy + toxicity for comprehensive coverage
7. **Custom metrics for business** — Standard metrics miss domain needs (brand voice, compliance, format)
8. **Judge model selection** — Use GPT-4o or Claude as judge; cheaper models produce unreliable evaluationsRelated Skills
performing-visual-regression-testing
This skill enables Claude to execute visual regression tests using tools like Percy, Chromatic, and BackstopJS. It captures screenshots, compares them against baselines, and analyzes visual differences to identify unintended UI changes. Use this skill when the user requests visual testing, UI change verification, or regression testing for a web application or component. Trigger phrases include "visual test," "UI regression," "check visual changes," or "/visual-test".
performing-security-testing
This skill automates security vulnerability testing. It is triggered when the user requests security assessments, penetration tests, or vulnerability scans. The skill covers OWASP Top 10 vulnerabilities, SQL injection, XSS, CSRF, authentication issues, and authorization flaws. Use this skill when the user mentions "security test", "vulnerability scan", "OWASP", "SQL injection", "XSS", "CSRF", "authentication", or "authorization" in the context of application or API testing.
performance-testing
This skill enables Claude to design, execute, and analyze performance tests using the performance-test-suite plugin. It is activated when the user requests load testing, stress testing, spike testing, or endurance testing, and when discussing performance metrics such as response time, throughput, and error rates. It identifies performance bottlenecks related to CPU, memory, database, or network issues. The plugin provides comprehensive reporting, including percentiles, graphs, and recommendations.
performing-penetration-testing
This skill enables automated penetration testing of web applications. It uses the penetration-tester plugin to identify vulnerabilities, including OWASP Top 10 threats, and suggests exploitation techniques. Use this skill when the user requests a "penetration test", "pentest", "vulnerability assessment", or asks to "exploit" a web application. It provides comprehensive reporting on identified security flaws.
model-evaluation-metrics
Model Evaluation Metrics - Auto-activating skill for ML Training. Triggers on: model evaluation metrics, model evaluation metrics Part of the ML Training skill category.
automating-mobile-app-testing
This skill enables automated testing of mobile applications on iOS and Android platforms using frameworks like Appium, Detox, XCUITest, and Espresso. It generates end-to-end tests, sets up page object models, and handles platform-specific elements. Use this skill when the user requests mobile app testing, test automation for iOS or Android, or needs assistance with setting up device farms and simulators. The skill is triggered by terms like "mobile testing", "appium", "detox", "xcuitest", "espresso", "android test", "ios test".
load-testing-apis
Execute comprehensive load and stress testing to validate API performance and scalability. Use when validating API performance under load. Trigger with phrases like "load test the API", "stress test API", or "benchmark API performance".
testing-load-balancers
This skill enables Claude to test load balancing strategies. It validates traffic distribution across backend servers, tests failover scenarios when servers become unavailable, verifies sticky sessions, and assesses health check functionality. Use this skill when the user asks to "test load balancer", "validate traffic distribution", "test failover", "verify sticky sessions", or "test health checks". It is specifically designed for testing load balancing configurations using the `load-balancer-tester` plugin.
managing-database-testing
This skill manages database testing by generating test data, wrapping tests in transactions, and validating database schemas. It is used to create robust and reliable database interactions. Claude uses this skill when the user requests database testing utilities, including test data generation, transaction management, schema validation, or migration testing. Trigger this skill by mentioning "database testing," "test data factories," "transaction rollback," "schema validation," or using the `/db-test` or `/dbt` commands.
backtesting-trading-strategies
Backtest crypto and traditional trading strategies against historical data. Calculates performance metrics (Sharpe, Sortino, max drawdown), generates equity curves, and optimizes strategy parameters. Use when user wants to test a trading strategy, validate signals, or compare approaches. Trigger with phrases like "backtest strategy", "test trading strategy", "historical performance", "simulate trades", "optimize parameters", or "validate signals".
api-testing-helper
Api Testing Helper - Auto-activating skill for API Development. Triggers on: api testing helper, api testing helper Part of the API Development skill category.
automating-api-testing
This skill automates API endpoint testing, including request generation, validation, and comprehensive test coverage for REST and GraphQL APIs. It is used when the user requests API testing, contract testing, or validation against OpenAPI specifications. The skill analyzes API endpoints and generates test suites covering CRUD operations, authentication flows, and security aspects. It also validates response status codes, headers, and body structure. Use this skill when the user mentions "API testing", "REST API tests", "GraphQL API tests", "contract tests", or "OpenAPI validation".