healthcare-eval-harness
Patient safety evaluation harness for healthcare application deployments. Automated test suites for CDSS accuracy, PHI exposure, clinical workflow integrity, and integration compliance. Blocks deployments on safety failures.
About this skill
The healthcare-eval-harness is an automated verification system designed to ensure patient safety and compliance for healthcare application deployments. It integrates comprehensive test suites covering critical areas such as Clinical Decision Support System (CDSS) accuracy, Protected Health Information (PHI) exposure, clinical workflow integrity, and adherence to integration compliance standards. Developed as part of the 'everything-claude-code' repository, this skill emphasizes robust, production-grade patterns crucial for healthcare environments. A core principle of this harness is that patient safety is non-negotiable: a single CRITICAL failure detected by the test suites will automatically block the deployment of the application. While examples may use specific test runners like Jest, the underlying test categories and pass thresholds are framework-agnostic, allowing for adaptation to various testing frameworks (e.g., Vitest, pytest, PHPUnit).
Best use case
Ensuring the safety, accuracy, and compliance of new or updated healthcare software before it impacts patient care. It helps prevent critical errors in clinical decision support, protects sensitive patient data from exposure, maintains the reliability of clinical workflows, and ensures regulatory adherence in healthcare application deployments.
Patient safety evaluation harness for healthcare application deployments. Automated test suites for CDSS accuracy, PHI exposure, clinical workflow integrity, and integration compliance. Blocks deployments on safety failures.
Successful deployment of secure, compliant, and highly accurate healthcare applications. The system guarantees the prevention of critical patient safety incidents, significantly reduces the risk of PHI breaches, and ensures that clinical workflows remain intact and reliable. In case of any critical safety failure, the deployment will be automatically blocked, ensuring that only verified, safe applications reach patients.
Practical example
Example input
The AI agent would likely invoke this skill with minimal parameters, assuming the context (e.g., target application, deployment environment) is already established or configured for the harness to run. The skill acts as an orchestrator for the tests. Example: `run_healthcare_safety_eval(target_application_id='EMR_System_v3.1', deployment_environment='staging')` Or, if context is managed by the agent: `execute_patient_safety_harness()`
Example output
The skill would return a detailed report on the evaluation status, indicating whether deployment is safe to proceed or if critical failures were detected.
Example (Success):
```json
{
"status": "SUCCESS",
"reason": "All patient safety evaluations passed without critical issues.",
"summary": {
"overall": "PASS",
"cdss_accuracy": "PASS",
"phi_exposure": "PASS",
"clinical_workflow_integrity": "PASS",
"integration_compliance": "PASS"
},
"deployment_blocked": false,
"report_link": "https://eval-reports.example.com/EMR_v3.1_staging_report_20240101.pdf"
}
```
Example (Failure):
```json
{
"status": "FAILURE",
"reason": "Critical patient safety violations detected. Deployment blocked.",
"summary": {
"overall": "FAILURE",
"cdss_accuracy": "PASS",
"phi_exposure": "FAILURE",
"clinical_workflow_integrity": "FAILURE",
"integration_compliance": "PASS"
},
"details": [
{
"category": "PHI Exposure",
"test": "Unencrypted_Sensitive_Data_Transmission",
"result": "FAIL",
"severity": "CRITICAL",
"message": "Detected unencrypted transmission of patient demographics to a third-party analytics service."
},
{
"category": "Clinical Workflow Integrity",
"test": "Incorrect_Dosage_Calculation_Pediatrics",
"result": "FAIL",
"severity": "CRITICAL",
"message": "Calculation error for 'Drug Z' dosage in pediatric module leading to potential overdose."
}
],
"deployment_blocked": true
}
```When to use this skill
- Before deploying or updating any healthcare application (e.g., CDSS, EMR, telehealth platform, medical device software) to a production or patient-facing environment.
- As a critical safety gate within a CI/CD pipeline for healthcare software development.
- When validating adherence to patient safety standards, data privacy regulations (like HIPAA for PHI), and integration compliance requirements.
- To automate the verification of application integrity and functionality in complex healthcare IT ecosystems.
When not to use this skill
- For applications outside the healthcare domain where patient safety and PHI compliance are not relevant concerns.
- During very early-stage development or prototyping where rapid iteration is prioritized over formal safety verification, although a scaled-down version might still be beneficial.
- If the application is not intended for deployment or doesn't involve critical patient data or clinical workflows.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/healthcare-eval-harness/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How healthcare-eval-harness Compares
| Feature / Agent | healthcare-eval-harness | Standard Approach |
|---|---|---|
| Platform Support | Claude | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | medium | N/A |
Frequently Asked Questions
What does this skill do?
Patient safety evaluation harness for healthcare application deployments. Automated test suites for CDSS accuracy, PHI exposure, clinical workflow integrity, and integration compliance. Blocks deployments on safety failures.
Which AI agents support this skill?
This skill is designed for Claude.
How difficult is it to install?
The installation complexity is rated as medium. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
SKILL.md Source
# Healthcare Eval Harness — Patient Safety Verification
Automated verification system for healthcare application deployments. A single CRITICAL failure blocks deployment. Patient safety is non-negotiable.
> **Note:** Examples use Jest as the reference test runner. Adapt commands for your framework (Vitest, pytest, PHPUnit, etc.) — the test categories and pass thresholds are framework-agnostic.
## When to Use
- Before any deployment of EMR/EHR applications
- After modifying CDSS logic (drug interactions, dose validation, scoring)
- After changing database schemas that touch patient data
- After modifying authentication or access control
- During CI/CD pipeline configuration for healthcare apps
- After resolving merge conflicts in clinical modules
## How It Works
The eval harness runs five test categories in order. The first three (CDSS Accuracy, PHI Exposure, Data Integrity) are CRITICAL gates requiring 100% pass rate — a single failure blocks deployment. The remaining two (Clinical Workflow, Integration) are HIGH gates requiring 95%+ pass rate.
Each category maps to a Jest test path pattern. The CI pipeline runs CRITICAL gates with `--bail` (stop on first failure) and enforces coverage thresholds with `--coverage --coverageThreshold`.
### Eval Categories
**1. CDSS Accuracy (CRITICAL — 100% required)**
Tests all clinical decision support logic: drug interaction pairs (both directions), dose validation rules, clinical scoring vs published specs, no false negatives, no silent failures.
```bash
npx jest --testPathPattern='tests/cdss' --bail --ci --coverage
```
**2. PHI Exposure (CRITICAL — 100% required)**
Tests for protected health information leaks: API error responses, console output, URL parameters, browser storage, cross-facility isolation, unauthenticated access, service role key absence.
```bash
npx jest --testPathPattern='tests/security/phi' --bail --ci
```
**3. Data Integrity (CRITICAL — 100% required)**
Tests clinical data safety: locked encounters, audit trail entries, cascade delete protection, concurrent edit handling, no orphaned records.
```bash
npx jest --testPathPattern='tests/data-integrity' --bail --ci
```
**4. Clinical Workflow (HIGH — 95%+ required)**
Tests end-to-end flows: encounter lifecycle, template rendering, medication sets, drug/diagnosis search, prescription PDF, red flag alerts.
```bash
tmp_json=$(mktemp)
npx jest --testPathPattern='tests/clinical' --ci --json --outputFile="$tmp_json" || true
total=$(jq '.numTotalTests // 0' "$tmp_json")
passed=$(jq '.numPassedTests // 0' "$tmp_json")
if [ "$total" -eq 0 ]; then
echo "No clinical tests found" >&2
exit 1
fi
rate=$(echo "scale=2; $passed * 100 / $total" | bc)
echo "Clinical pass rate: ${rate}% ($passed/$total)"
```
**5. Integration Compliance (HIGH — 95%+ required)**
Tests external systems: HL7 message parsing (v2.x), FHIR validation, lab result mapping, malformed message handling.
```bash
tmp_json=$(mktemp)
npx jest --testPathPattern='tests/integration' --ci --json --outputFile="$tmp_json" || true
total=$(jq '.numTotalTests // 0' "$tmp_json")
passed=$(jq '.numPassedTests // 0' "$tmp_json")
if [ "$total" -eq 0 ]; then
echo "No integration tests found" >&2
exit 1
fi
rate=$(echo "scale=2; $passed * 100 / $total" | bc)
echo "Integration pass rate: ${rate}% ($passed/$total)"
```
### Pass/Fail Matrix
| Category | Threshold | On Failure |
|----------|-----------|------------|
| CDSS Accuracy | 100% | **BLOCK deployment** |
| PHI Exposure | 100% | **BLOCK deployment** |
| Data Integrity | 100% | **BLOCK deployment** |
| Clinical Workflow | 95%+ | WARN, allow with review |
| Integration | 95%+ | WARN, allow with review |
### CI/CD Integration
```yaml
name: Healthcare Safety Gate
on: [push, pull_request]
jobs:
safety-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm ci
# CRITICAL gates — 100% required, bail on first failure
- name: CDSS Accuracy
run: npx jest --testPathPattern='tests/cdss' --bail --ci --coverage --coverageThreshold='{"global":{"branches":80,"functions":80,"lines":80}}'
- name: PHI Exposure Check
run: npx jest --testPathPattern='tests/security/phi' --bail --ci
- name: Data Integrity
run: npx jest --testPathPattern='tests/data-integrity' --bail --ci
# HIGH gates — 95%+ required, custom threshold check
# HIGH gates — 95%+ required
- name: Clinical Workflows
run: |
TMP_JSON=$(mktemp)
npx jest --testPathPattern='tests/clinical' --ci --json --outputFile="$TMP_JSON" || true
TOTAL=$(jq '.numTotalTests // 0' "$TMP_JSON")
PASSED=$(jq '.numPassedTests // 0' "$TMP_JSON")
if [ "$TOTAL" -eq 0 ]; then
echo "::error::No clinical tests found"; exit 1
fi
RATE=$(echo "scale=2; $PASSED * 100 / $TOTAL" | bc)
echo "Pass rate: ${RATE}% ($PASSED/$TOTAL)"
if (( $(echo "$RATE < 95" | bc -l) )); then
echo "::warning::Clinical pass rate ${RATE}% below 95%"
fi
- name: Integration Compliance
run: |
TMP_JSON=$(mktemp)
npx jest --testPathPattern='tests/integration' --ci --json --outputFile="$TMP_JSON" || true
TOTAL=$(jq '.numTotalTests // 0' "$TMP_JSON")
PASSED=$(jq '.numPassedTests // 0' "$TMP_JSON")
if [ "$TOTAL" -eq 0 ]; then
echo "::error::No integration tests found"; exit 1
fi
RATE=$(echo "scale=2; $PASSED * 100 / $TOTAL" | bc)
echo "Pass rate: ${RATE}% ($PASSED/$TOTAL)"
if (( $(echo "$RATE < 95" | bc -l) )); then
echo "::warning::Integration pass rate ${RATE}% below 95%"
fi
```
### Anti-Patterns
- Skipping CDSS tests "because they passed last time"
- Setting CRITICAL thresholds below 100%
- Using `--no-bail` on CRITICAL test suites
- Mocking the CDSS engine in integration tests (must test real logic)
- Allowing deployments when safety gate is red
- Running tests without `--coverage` on CDSS suites
## Examples
### Example 1: Run All Critical Gates Locally
```bash
npx jest --testPathPattern='tests/cdss' --bail --ci --coverage && \
npx jest --testPathPattern='tests/security/phi' --bail --ci && \
npx jest --testPathPattern='tests/data-integrity' --bail --ci
```
### Example 2: Check HIGH Gate Pass Rate
```bash
tmp_json=$(mktemp)
npx jest --testPathPattern='tests/clinical' --ci --json --outputFile="$tmp_json" || true
jq '{
passed: (.numPassedTests // 0),
total: (.numTotalTests // 0),
rate: (if (.numTotalTests // 0) == 0 then 0 else ((.numPassedTests // 0) / (.numTotalTests // 1) * 100) end)
}' "$tmp_json"
# Expected: { "passed": 21, "total": 22, "rate": 95.45 }
```
### Example 3: Eval Report
```
## Healthcare Eval: 2026-03-27 [commit abc1234]
### Patient Safety: PASS
| Category | Tests | Pass | Fail | Status |
|----------|-------|------|------|--------|
| CDSS Accuracy | 39 | 39 | 0 | PASS |
| PHI Exposure | 8 | 8 | 0 | PASS |
| Data Integrity | 12 | 12 | 0 | PASS |
| Clinical Workflow | 22 | 21 | 1 | 95.5% PASS |
| Integration | 6 | 6 | 0 | PASS |
### Coverage: 84% (target: 80%+)
### Verdict: SAFE TO DEPLOY
```Related Skills
healthcare-phi-compliance
Protected Health Information (PHI) and Personally Identifiable Information (PII) compliance patterns for healthcare applications. Covers data classification, access control, audit trails, encryption, and common leak vectors.
healthcare-emr-patterns
EMR/EHR development patterns for healthcare applications. Clinical safety, encounter workflows, prescription generation, clinical decision support integration, and accessibility-first UI for medical data entry.
healthcare-cdss-patterns
Clinical Decision Support System (CDSS) development patterns. Drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), alert severity classification, and integration into EMR workflows.
agent-harness-construction
设计和优化AI代理的动作空间、工具定义和观察格式,以提高完成率。
agent-eval
编码代理(Claude Code、Aider、Codex等)在自定义任务上的直接比较,包含通过率、成本、时间和一致性指标
iterative-retrieval
サブエージェントのコンテキスト問題を解決するために、コンテキスト取得を段階的に洗練するパターン
eval-harness
Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles
workspace-surface-audit
Audit the active repo, MCP servers, plugins, connectors, env surfaces, and harness setup, then recommend the highest-value ECC-native skills, hooks, agents, and operator workflows. Use when the user wants help setting up Claude Code or understanding what capabilities are actually available in their environment.
ui-demo
Record polished UI demo videos using Playwright. Use when the user asks to create a demo, walkthrough, screen recording, or tutorial video of a web application. Produces WebM videos with visible cursor, natural pacing, and professional feel.
token-budget-advisor
Offers the user an informed choice about how much response depth to consume before answering. Use this skill when the user explicitly wants to control response length, depth, or token budget. TRIGGER when: "token budget", "token count", "token usage", "token limit", "response length", "answer depth", "short version", "brief answer", "detailed answer", "exhaustive answer", "respuesta corta vs larga", "cuántos tokens", "ahorrar tokens", "responde al 50%", "dame la versión corta", "quiero controlar cuánto usas", or clear variants where the user is explicitly asking to control answer size or depth. DO NOT TRIGGER when: user has already specified a level in the current session (maintain it), the request is clearly a one-word answer, or "token" refers to auth/session/payment tokens rather than response size.
skill-comply
Visualize whether skills, rules, and agent definitions are actually followed — auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines
santa-method
Multi-agent adversarial verification with convergence loop. Two independent review agents must both pass before output ships.