healthcare-eval-harness

Patient safety evaluation harness for healthcare application deployments. Automated test suites for CDSS accuracy, PHI exposure, clinical workflow integrity, and integration compliance. Blocks deployments on safety failures.

144,923 stars
Complexity: medium

About this skill

The healthcare-eval-harness is an automated verification system designed to ensure patient safety and compliance for healthcare application deployments. It integrates comprehensive test suites covering critical areas such as Clinical Decision Support System (CDSS) accuracy, Protected Health Information (PHI) exposure, clinical workflow integrity, and adherence to integration compliance standards. Developed as part of the 'everything-claude-code' repository, this skill emphasizes robust, production-grade patterns crucial for healthcare environments. A core principle of this harness is that patient safety is non-negotiable: a single CRITICAL failure detected by the test suites will automatically block the deployment of the application. While examples may use specific test runners like Jest, the underlying test categories and pass thresholds are framework-agnostic, allowing for adaptation to various testing frameworks (e.g., Vitest, pytest, PHPUnit).

Best use case

Ensuring the safety, accuracy, and compliance of new or updated healthcare software before it impacts patient care. It helps prevent critical errors in clinical decision support, protects sensitive patient data from exposure, maintains the reliability of clinical workflows, and ensures regulatory adherence in healthcare application deployments.

Patient safety evaluation harness for healthcare application deployments. Automated test suites for CDSS accuracy, PHI exposure, clinical workflow integrity, and integration compliance. Blocks deployments on safety failures.

Successful deployment of secure, compliant, and highly accurate healthcare applications. The system guarantees the prevention of critical patient safety incidents, significantly reduces the risk of PHI breaches, and ensures that clinical workflows remain intact and reliable. In case of any critical safety failure, the deployment will be automatically blocked, ensuring that only verified, safe applications reach patients.

Practical example

Example input

The AI agent would likely invoke this skill with minimal parameters, assuming the context (e.g., target application, deployment environment) is already established or configured for the harness to run. The skill acts as an orchestrator for the tests.

Example:
`run_healthcare_safety_eval(target_application_id='EMR_System_v3.1', deployment_environment='staging')`

Or, if context is managed by the agent:
`execute_patient_safety_harness()`

Example output

The skill would return a detailed report on the evaluation status, indicating whether deployment is safe to proceed or if critical failures were detected.

Example (Success):
```json
{
  "status": "SUCCESS",
  "reason": "All patient safety evaluations passed without critical issues.",
  "summary": {
    "overall": "PASS",
    "cdss_accuracy": "PASS",
    "phi_exposure": "PASS",
    "clinical_workflow_integrity": "PASS",
    "integration_compliance": "PASS"
  },
  "deployment_blocked": false,
  "report_link": "https://eval-reports.example.com/EMR_v3.1_staging_report_20240101.pdf"
}
```

Example (Failure):
```json
{
  "status": "FAILURE",
  "reason": "Critical patient safety violations detected. Deployment blocked.",
  "summary": {
    "overall": "FAILURE",
    "cdss_accuracy": "PASS",
    "phi_exposure": "FAILURE",
    "clinical_workflow_integrity": "FAILURE",
    "integration_compliance": "PASS"
  },
  "details": [
    {
      "category": "PHI Exposure",
      "test": "Unencrypted_Sensitive_Data_Transmission",
      "result": "FAIL",
      "severity": "CRITICAL",
      "message": "Detected unencrypted transmission of patient demographics to a third-party analytics service."
    },
    {
      "category": "Clinical Workflow Integrity",
      "test": "Incorrect_Dosage_Calculation_Pediatrics",
      "result": "FAIL",
      "severity": "CRITICAL",
      "message": "Calculation error for 'Drug Z' dosage in pediatric module leading to potential overdose."
    }
  ],
  "deployment_blocked": true
}
```

When to use this skill

  • Before deploying or updating any healthcare application (e.g., CDSS, EMR, telehealth platform, medical device software) to a production or patient-facing environment.
  • As a critical safety gate within a CI/CD pipeline for healthcare software development.
  • When validating adherence to patient safety standards, data privacy regulations (like HIPAA for PHI), and integration compliance requirements.
  • To automate the verification of application integrity and functionality in complex healthcare IT ecosystems.

When not to use this skill

  • For applications outside the healthcare domain where patient safety and PHI compliance are not relevant concerns.
  • During very early-stage development or prototyping where rapid iteration is prioritized over formal safety verification, although a scaled-down version might still be beneficial.
  • If the application is not intended for deployment or doesn't involve critical patient data or clinical workflows.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/healthcare-eval-harness/SKILL.md --create-dirs "https://raw.githubusercontent.com/affaan-m/everything-claude-code/main/skills/healthcare-eval-harness/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/healthcare-eval-harness/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How healthcare-eval-harness Compares

Feature / Agenthealthcare-eval-harnessStandard Approach
Platform SupportClaudeLimited / Varies
Context Awareness High Baseline
Installation ComplexitymediumN/A

Frequently Asked Questions

What does this skill do?

Patient safety evaluation harness for healthcare application deployments. Automated test suites for CDSS accuracy, PHI exposure, clinical workflow integrity, and integration compliance. Blocks deployments on safety failures.

Which AI agents support this skill?

This skill is designed for Claude.

How difficult is it to install?

The installation complexity is rated as medium. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Healthcare Eval Harness — Patient Safety Verification

Automated verification system for healthcare application deployments. A single CRITICAL failure blocks deployment. Patient safety is non-negotiable.

> **Note:** Examples use Jest as the reference test runner. Adapt commands for your framework (Vitest, pytest, PHPUnit, etc.) — the test categories and pass thresholds are framework-agnostic.

## When to Use

- Before any deployment of EMR/EHR applications
- After modifying CDSS logic (drug interactions, dose validation, scoring)
- After changing database schemas that touch patient data
- After modifying authentication or access control
- During CI/CD pipeline configuration for healthcare apps
- After resolving merge conflicts in clinical modules

## How It Works

The eval harness runs five test categories in order. The first three (CDSS Accuracy, PHI Exposure, Data Integrity) are CRITICAL gates requiring 100% pass rate — a single failure blocks deployment. The remaining two (Clinical Workflow, Integration) are HIGH gates requiring 95%+ pass rate.

Each category maps to a Jest test path pattern. The CI pipeline runs CRITICAL gates with `--bail` (stop on first failure) and enforces coverage thresholds with `--coverage --coverageThreshold`.

### Eval Categories

**1. CDSS Accuracy (CRITICAL — 100% required)**

Tests all clinical decision support logic: drug interaction pairs (both directions), dose validation rules, clinical scoring vs published specs, no false negatives, no silent failures.

```bash
npx jest --testPathPattern='tests/cdss' --bail --ci --coverage
```

**2. PHI Exposure (CRITICAL — 100% required)**

Tests for protected health information leaks: API error responses, console output, URL parameters, browser storage, cross-facility isolation, unauthenticated access, service role key absence.

```bash
npx jest --testPathPattern='tests/security/phi' --bail --ci
```

**3. Data Integrity (CRITICAL — 100% required)**

Tests clinical data safety: locked encounters, audit trail entries, cascade delete protection, concurrent edit handling, no orphaned records.

```bash
npx jest --testPathPattern='tests/data-integrity' --bail --ci
```

**4. Clinical Workflow (HIGH — 95%+ required)**

Tests end-to-end flows: encounter lifecycle, template rendering, medication sets, drug/diagnosis search, prescription PDF, red flag alerts.

```bash
tmp_json=$(mktemp)
npx jest --testPathPattern='tests/clinical' --ci --json --outputFile="$tmp_json" || true
total=$(jq '.numTotalTests // 0' "$tmp_json")
passed=$(jq '.numPassedTests // 0' "$tmp_json")
if [ "$total" -eq 0 ]; then
  echo "No clinical tests found" >&2
  exit 1
fi
rate=$(echo "scale=2; $passed * 100 / $total" | bc)
echo "Clinical pass rate: ${rate}% ($passed/$total)"
```

**5. Integration Compliance (HIGH — 95%+ required)**

Tests external systems: HL7 message parsing (v2.x), FHIR validation, lab result mapping, malformed message handling.

```bash
tmp_json=$(mktemp)
npx jest --testPathPattern='tests/integration' --ci --json --outputFile="$tmp_json" || true
total=$(jq '.numTotalTests // 0' "$tmp_json")
passed=$(jq '.numPassedTests // 0' "$tmp_json")
if [ "$total" -eq 0 ]; then
  echo "No integration tests found" >&2
  exit 1
fi
rate=$(echo "scale=2; $passed * 100 / $total" | bc)
echo "Integration pass rate: ${rate}% ($passed/$total)"
```

### Pass/Fail Matrix

| Category | Threshold | On Failure |
|----------|-----------|------------|
| CDSS Accuracy | 100% | **BLOCK deployment** |
| PHI Exposure | 100% | **BLOCK deployment** |
| Data Integrity | 100% | **BLOCK deployment** |
| Clinical Workflow | 95%+ | WARN, allow with review |
| Integration | 95%+ | WARN, allow with review |

### CI/CD Integration

```yaml
name: Healthcare Safety Gate
on: [push, pull_request]

jobs:
  safety-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci

      # CRITICAL gates — 100% required, bail on first failure
      - name: CDSS Accuracy
        run: npx jest --testPathPattern='tests/cdss' --bail --ci --coverage --coverageThreshold='{"global":{"branches":80,"functions":80,"lines":80}}'

      - name: PHI Exposure Check
        run: npx jest --testPathPattern='tests/security/phi' --bail --ci

      - name: Data Integrity
        run: npx jest --testPathPattern='tests/data-integrity' --bail --ci

      # HIGH gates — 95%+ required, custom threshold check
      # HIGH gates — 95%+ required
      - name: Clinical Workflows
        run: |
          TMP_JSON=$(mktemp)
          npx jest --testPathPattern='tests/clinical' --ci --json --outputFile="$TMP_JSON" || true
          TOTAL=$(jq '.numTotalTests // 0' "$TMP_JSON")
          PASSED=$(jq '.numPassedTests // 0' "$TMP_JSON")
          if [ "$TOTAL" -eq 0 ]; then
            echo "::error::No clinical tests found"; exit 1
          fi
          RATE=$(echo "scale=2; $PASSED * 100 / $TOTAL" | bc)
          echo "Pass rate: ${RATE}% ($PASSED/$TOTAL)"
          if (( $(echo "$RATE < 95" | bc -l) )); then
            echo "::warning::Clinical pass rate ${RATE}% below 95%"
          fi

      - name: Integration Compliance
        run: |
          TMP_JSON=$(mktemp)
          npx jest --testPathPattern='tests/integration' --ci --json --outputFile="$TMP_JSON" || true
          TOTAL=$(jq '.numTotalTests // 0' "$TMP_JSON")
          PASSED=$(jq '.numPassedTests // 0' "$TMP_JSON")
          if [ "$TOTAL" -eq 0 ]; then
            echo "::error::No integration tests found"; exit 1
          fi
          RATE=$(echo "scale=2; $PASSED * 100 / $TOTAL" | bc)
          echo "Pass rate: ${RATE}% ($PASSED/$TOTAL)"
          if (( $(echo "$RATE < 95" | bc -l) )); then
            echo "::warning::Integration pass rate ${RATE}% below 95%"
          fi
```

### Anti-Patterns

- Skipping CDSS tests "because they passed last time"
- Setting CRITICAL thresholds below 100%
- Using `--no-bail` on CRITICAL test suites
- Mocking the CDSS engine in integration tests (must test real logic)
- Allowing deployments when safety gate is red
- Running tests without `--coverage` on CDSS suites

## Examples

### Example 1: Run All Critical Gates Locally

```bash
npx jest --testPathPattern='tests/cdss' --bail --ci --coverage && \
npx jest --testPathPattern='tests/security/phi' --bail --ci && \
npx jest --testPathPattern='tests/data-integrity' --bail --ci
```

### Example 2: Check HIGH Gate Pass Rate

```bash
tmp_json=$(mktemp)
npx jest --testPathPattern='tests/clinical' --ci --json --outputFile="$tmp_json" || true
jq '{
  passed: (.numPassedTests // 0),
  total: (.numTotalTests // 0),
  rate: (if (.numTotalTests // 0) == 0 then 0 else ((.numPassedTests // 0) / (.numTotalTests // 1) * 100) end)
}' "$tmp_json"
# Expected: { "passed": 21, "total": 22, "rate": 95.45 }
```

### Example 3: Eval Report

```
## Healthcare Eval: 2026-03-27 [commit abc1234]

### Patient Safety: PASS

| Category | Tests | Pass | Fail | Status |
|----------|-------|------|------|--------|
| CDSS Accuracy | 39 | 39 | 0 | PASS |
| PHI Exposure | 8 | 8 | 0 | PASS |
| Data Integrity | 12 | 12 | 0 | PASS |
| Clinical Workflow | 22 | 21 | 1 | 95.5% PASS |
| Integration | 6 | 6 | 0 | PASS |

### Coverage: 84% (target: 80%+)
### Verdict: SAFE TO DEPLOY
```

Related Skills

healthcare-phi-compliance

144923
from affaan-m/everything-claude-code

Protected Health Information (PHI) and Personally Identifiable Information (PII) compliance patterns for healthcare applications. Covers data classification, access control, audit trails, encryption, and common leak vectors.

Regulatory ComplianceClaude

healthcare-emr-patterns

144923
from affaan-m/everything-claude-code

EMR/EHR development patterns for healthcare applications. Clinical safety, encounter workflows, prescription generation, clinical decision support integration, and accessibility-first UI for medical data entry.

HealthcareClaude

healthcare-cdss-patterns

144923
from affaan-m/everything-claude-code

Clinical Decision Support System (CDSS) development patterns. Drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), alert severity classification, and integration into EMR workflows.

HealthcareClaude

agent-harness-construction

144923
from affaan-m/everything-claude-code

设计和优化AI代理的动作空间、工具定义和观察格式,以提高完成率。

DevelopmentClaude

agent-eval

144923
from affaan-m/everything-claude-code

编码代理(Claude Code、Aider、Codex等)在自定义任务上的直接比较,包含通过率、成本、时间和一致性指标

DevelopmentClaude

iterative-retrieval

144923
from affaan-m/everything-claude-code

サブエージェントのコンテキスト問題を解決するために、コンテキスト取得を段階的に洗練するパターン

DevelopmentClaude

eval-harness

144923
from affaan-m/everything-claude-code

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

DevelopmentClaude

workspace-surface-audit

144923
from affaan-m/everything-claude-code

Audit the active repo, MCP servers, plugins, connectors, env surfaces, and harness setup, then recommend the highest-value ECC-native skills, hooks, agents, and operator workflows. Use when the user wants help setting up Claude Code or understanding what capabilities are actually available in their environment.

DevelopmentClaude

ui-demo

144923
from affaan-m/everything-claude-code

Record polished UI demo videos using Playwright. Use when the user asks to create a demo, walkthrough, screen recording, or tutorial video of a web application. Produces WebM videos with visible cursor, natural pacing, and professional feel.

Developer ToolsClaude

token-budget-advisor

144923
from affaan-m/everything-claude-code

Offers the user an informed choice about how much response depth to consume before answering. Use this skill when the user explicitly wants to control response length, depth, or token budget. TRIGGER when: "token budget", "token count", "token usage", "token limit", "response length", "answer depth", "short version", "brief answer", "detailed answer", "exhaustive answer", "respuesta corta vs larga", "cuántos tokens", "ahorrar tokens", "responde al 50%", "dame la versión corta", "quiero controlar cuánto usas", or clear variants where the user is explicitly asking to control answer size or depth. DO NOT TRIGGER when: user has already specified a level in the current session (maintain it), the request is clearly a one-word answer, or "token" refers to auth/session/payment tokens rather than response size.

Productivity & Content CreationClaude

skill-comply

144923
from affaan-m/everything-claude-code

Visualize whether skills, rules, and agent definitions are actually followed — auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines

DevelopmentClaude

santa-method

144923
from affaan-m/everything-claude-code

Multi-agent adversarial verification with convergence loop. Two independent review agents must both pass before output ships.

Quality AssuranceClaude