audit-agents-skills
Audit Claude Code agents, skills, and commands for quality and production readiness. Use when evaluating skill quality, checking production readiness scores, or comparing agents against best-practice templates.
Best use case
audit-agents-skills is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Audit Claude Code agents, skills, and commands for quality and production readiness. Use when evaluating skill quality, checking production readiness scores, or comparing agents against best-practice templates.
Teams using audit-agents-skills should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/audit-agents-skills/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How audit-agents-skills Compares
| Feature / Agent | audit-agents-skills | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Audit Claude Code agents, skills, and commands for quality and production readiness. Use when evaluating skill quality, checking production readiness scores, or comparing agents against best-practice templates.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
SKILL.md Source
# Audit Agents/Skills/Commands (Advanced Skill)
Comprehensive quality audit system for Claude Code agents, skills, and commands. Provides quantitative scoring, comparative analysis, and production readiness grading based on industry best practices.
## Purpose
**Problem**: Manual validation of agents/skills is error-prone and inconsistent. According to the LangChain Agent Report 2026, 29.5% of organizations deploy agents without systematic evaluation, leading to "agent bugs" as the top challenge (18% of teams).
**Solution**: Automated quality scoring across 16 weighted criteria with production readiness thresholds (80% = Grade B minimum for production deployment).
**Key Features**:
- Quantitative scoring (32 points for agents/skills, 20 for commands)
- Weighted criteria (Identity 3x, Prompt 2x, Validation 1x, Design 2x)
- Production readiness grading (A-F scale with 80% threshold)
- Comparative analysis vs reference templates
- JSON/Markdown dual output for programmatic integration
- Fix suggestions for failing criteria
---
## Modes
| Mode | Usage | Output |
|------|-------|--------|
| **Quick Audit** | Top-5 critical criteria only | Fast pass/fail (3-5 min for 20 files) |
| **Full Audit** | All 16 criteria per file | Detailed scores + recommendations (10-15 min) |
| **Comparative** | Full + benchmark vs templates | Analysis + gap identification (15-20 min) |
**Default**: Full Audit (recommended for first run)
---
## Methodology
### Why These Criteria?
The 16-criteria framework is derived from:
1. **Claude Code Best Practices** (Ultimate Guide line 4921: Agent Validation Checklist)
2. **Industry Data** (LangChain Agent Report 2026: evaluation gaps)
3. **Production Failures** (Community feedback on hardcoded paths, missing error handling)
4. **Composition Patterns** (Skills should reference other skills, agents should be modular)
### Scoring Philosophy
**Weight Rationale**:
- **Identity (3x)**: If users can't find/invoke the agent, quality is irrelevant (discoverability > quality)
- **Prompt (2x)**: Determines reliability and accuracy of outputs
- **Validation (1x)**: Improves robustness but is secondary to core functionality
- **Design (2x)**: Impacts long-term maintainability and scalability
**Grade Standards**:
- **A (90-100%)**: Production-ready, minimal risk
- **B (80-89%)**: Good, meets production threshold
- **C (70-79%)**: Needs improvement before production
- **D (60-69%)**: Significant gaps, not production-ready
- **F (<60%)**: Critical issues, requires major refactoring
**Industry Alignment**: The 80% threshold aligns with software engineering best practices for production deployment (e.g., code coverage >80%, security scan pass rates).
---
## Workflow
### Phase 1: Discovery
1. **Scan directories**:
```
.claude/agents/
.claude/skills/
.claude/commands/
examples/agents/ (if exists)
examples/skills/ (if exists)
examples/commands/ (if exists)
```
2. **Classify files** by type (agent/skill/command)
3. **Load reference templates** (for Comparative mode):
```
guide/examples/agents/ (benchmark files)
guide/examples/skills/ (benchmark files)
guide/examples/commands/ (benchmark files)
```
### Phase 2: Scoring Engine
Load scoring criteria from `scoring/criteria.yaml`:
```yaml
agents:
max_points: 32
categories:
identity:
weight: 3
criteria:
- id: A1.1
name: "Clear name"
points: 3
detection: "frontmatter.name exists and is descriptive"
# ... (16 total criteria)
```
For each file:
1. Parse frontmatter (YAML)
2. Extract content sections
3. Run detection patterns (regex, keyword search)
4. Calculate score: `(points / max_points) × 100`
5. Assign grade (A-F)
### Phase 3: Comparative Analysis (Comparative Mode Only)
For each project file:
1. Find closest matching template (by description similarity)
2. Compare scores per criterion
3. Identify gaps: `template_score - project_score`
4. Flag significant gaps (>10 points difference)
**Example**:
```
Project file: .claude/agents/debugging-specialist.md (Score: 78%, Grade C)
Closest template: examples/agents/debugging-specialist.md (Score: 94%, Grade A)
Gaps:
- Anti-hallucination measures: -2 points (template has, project missing)
- Edge cases documented: -1 point (template has 5 examples, project has 1)
- Integration documented: -1 point (template references 3 skills, project none)
Total gap: 16 points (explains C vs A difference)
```
### Phase 4: Report Generation
**Markdown Report** (`audit-report.md`):
- Summary table (overall + by type)
- Individual scores with top issues
- Detailed breakdown per file (collapsible)
- Prioritized recommendations
**JSON Output** (`audit-report.json`):
```json
{
"metadata": {
"project_path": "/path/to/project",
"audit_date": "2026-02-07",
"mode": "full",
"version": "1.0.0"
},
"summary": {
"overall_score": 82.5,
"overall_grade": "B",
"total_files": 15,
"production_ready_count": 10,
"production_ready_percentage": 66.7
},
"by_type": {
"agents": { "count": 5, "avg_score": 85.2, "grade": "B" },
"skills": { "count": 8, "avg_score": 78.9, "grade": "C" },
"commands": { "count": 2, "avg_score": 92.0, "grade": "A" }
},
"files": [
{
"path": ".claude/agents/debugging-specialist.md",
"type": "agent",
"score": 78.1,
"grade": "C",
"points_obtained": 25,
"points_max": 32,
"failed_criteria": [
{
"id": "A2.4",
"name": "Anti-hallucination measures",
"points_lost": 2,
"recommendation": "Add section on source verification"
}
]
}
],
"top_issues": [
{
"issue": "Missing error handling",
"affected_files": 8,
"impact": "Runtime failures unhandled",
"priority": "high"
}
]
}
```
### Phase 5: Fix Suggestions (Optional)
For each failing criterion, generate **actionable fix**:
```markdown
### File: .claude/agents/debugging-specialist.md
**Issue**: Missing anti-hallucination measures (2 points lost)
**Fix**:
Add this section after "Methodology":
## Source Verification
- Always cite sources for technical claims
- Use phrases: "According to [documentation]...", "Based on [tool output]..."
- If uncertain, state: "I don't have verified information on..."
- Never invent: statistics, version numbers, API signatures, stack traces
**Detection**: Grep for keywords: "verify", "cite", "source", "evidence"
```
---
## Scoring Criteria
See `scoring/criteria.yaml` for complete definitions. Summary:
### Agents (32 points max)
| Category | Weight | Criteria Count | Max Points |
|----------|--------|----------------|------------|
| Identity | 3x | 4 | 12 |
| Prompt Quality | 2x | 4 | 8 |
| Validation | 1x | 4 | 4 |
| Design | 2x | 4 | 8 |
**Key Criteria**:
- Clear name (3 pts): Not generic like "agent1"
- Description with triggers (3 pts): Contains "when"/"use"
- Role defined (2 pts): "You are..." statement
- 3+ examples (1 pt): Usage scenarios documented
- Single responsibility (2 pts): Focused, not "general purpose"
### Skills (32 points max)
| Category | Weight | Criteria Count | Max Points |
|----------|--------|----------------|------------|
| Structure | 3x | 4 | 12 |
| Content | 2x | 4 | 8 |
| Technical | 1x | 4 | 4 |
| Design | 2x | 4 | 8 |
**Key Criteria**:
- Valid SKILL.md (3 pts): Proper naming
- Name valid (3 pts): Lowercase, 1-64 chars, no spaces
- Methodology described (2 pts): Workflow section exists
- No hardcoded paths (1 pt): No `/Users/`, `/home/`
- Clear triggers (2 pts): "When to use" section
### Commands (20 points max)
| Category | Weight | Criteria Count | Max Points |
|----------|--------|----------------|------------|
| Structure | 3x | 4 | 12 |
| Quality | 2x | 4 | 8 |
**Key Criteria**:
- Valid frontmatter (3 pts): name + description
- Argument hint (3 pts): If uses `$ARGUMENTS`
- Step-by-step workflow (3 pts): Numbered sections
- Error handling (2 pts): Mentions failure modes
---
## Detection Patterns
### Frontmatter Parsing
```python
import yaml
import re
def parse_frontmatter(content):
match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL)
if match:
return yaml.safe_load(match.group(1))
return None
```
### Keyword Detection
```python
def has_keywords(text, keywords):
text_lower = text.lower()
return any(kw in text_lower for kw in keywords)
# Example
has_trigger = has_keywords(description, ['when', 'use', 'trigger'])
has_error_handling = has_keywords(content, ['error', 'failure', 'fallback'])
```
### Overlap Detection (Duplication Check)
```python
def jaccard_similarity(text1, text2):
words1 = set(text1.lower().split())
words2 = set(text2.lower().split())
intersection = words1 & words2
union = words1 | words2
return len(intersection) / len(union) if union else 0
# Flag if similarity > 0.5 (50% keyword overlap)
if jaccard_similarity(desc1, desc2) > 0.5:
issues.append("High overlap with another file")
```
### Token Counting (Approximate)
```python
def estimate_tokens(text):
# Rough estimate: 1 token ≈ 0.75 words
word_count = len(text.split())
return int(word_count * 1.3)
# Check budget
tokens = estimate_tokens(file_content)
if tokens > 5000:
issues.append("File too large (>5K tokens)")
```
---
## Industry Context
**Source**: LangChain Agent Report 2026 (public report, page 14-22)
**Key Findings**:
- **29.5%** of organizations deploy agents without systematic evaluation
- **18%** cite "agent bugs" as their primary challenge
- **Only 12%** use automated quality checks (88% manual or none)
- **43%** report difficulty maintaining agent quality over time
- **Top issues**: Hallucinations (31%), poor error handling (28%), unclear triggers (22%)
**Implications**:
1. **Automation gap**: Most teams rely on manual checklists (error-prone at scale)
2. **Quality debt**: Agents deployed without validation accumulate technical debt
3. **Maintenance burden**: 43% struggle with quality over time (no tracking system)
**This skill addresses**:
- Automation: Replaces manual checklists with quantitative scoring
- Tracking: JSON output enables trend analysis over time
- Standards: 80% threshold provides clear production gate
---
## Output Examples
### Quick Audit (Top-5 Criteria)
```markdown
# Quick Audit: Agents/Skills/Commands
**Files**: 15 (5 agents, 8 skills, 2 commands)
**Critical Issues**: 3 files fail top-5 criteria
## Top-5 Criteria (Pass/Fail)
| File | Valid Name | Has Triggers | Error Handling | No Hardcoded Paths | Examples |
|------|------------|--------------|----------------|--------------------|----------|
| agent1.md | ✅ | ✅ | ❌ | ✅ | ❌ |
| skill2/ | ✅ | ❌ | ✅ | ❌ | ✅ |
## Action Required
1. **Add error handling**: 5 files
2. **Remove hardcoded paths**: 3 files
3. **Add usage examples**: 4 files
```
### Full Audit
See Phase 4: Report Generation above for full structure.
### Comparative (Full + Benchmarks)
```markdown
# Comparative Audit
## Project vs Templates
| File | Project Score | Template Score | Gap | Top Missing |
|------|---------------|----------------|-----|-------------|
| debugging-specialist.md | 78% (C) | 94% (A) | -16 pts | Anti-hallucination, edge cases |
| testing-expert/ | 85% (B) | 91% (A) | -6 pts | Integration docs |
## Recommendations
Focus on these gaps to reach template quality:
1. **Anti-hallucination measures** (8 files): Add source verification sections
2. **Edge case documentation** (5 files): Add failure scenario examples
3. **Integration documentation** (4 files): List compatible agents/skills
```
---
## Usage
### Basic (Full Audit)
```bash
# In Claude Code
Use skill: audit-agents-skills
# Specify path
Use skill: audit-agents-skills for ~/projects/my-app
```
### With Options
```bash
# Quick audit (fast)
Use skill: audit-agents-skills with mode=quick
# Comparative (benchmark analysis)
Use skill: audit-agents-skills with mode=comparative
# Generate fixes
Use skill: audit-agents-skills with fixes=true
# Custom output path
Use skill: audit-agents-skills with output=~/Desktop/audit.json
```
### JSON Output Only
```bash
# For programmatic integration
Use skill: audit-agents-skills with format=json output=audit.json
```
---
## Integration with CI/CD
### Pre-commit Hook
```bash
#!/bin/bash
# .git/hooks/pre-commit
# Run quick audit on changed agent/skill/command files
changed_files=$(git diff --cached --name-only | grep -E "^\.claude/(agents|skills|commands)/")
if [ -n "$changed_files" ]; then
echo "Running quick audit on changed files..."
# Run audit (requires Claude Code CLI wrapper)
# Exit with 1 if any file scores <80%
fi
```
### GitHub Actions
```yaml
name: Audit Agents/Skills
on: [pull_request]
jobs:
audit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run quality audit
run: |
# Run audit skill
# Parse JSON output
# Fail if overall_score < 80
```
---
## Comparison: Command vs Skill
| Aspect | Command (`/audit-agents-skills`) | Skill (this file) |
|--------|----------------------------------|-------------------|
| **Scope** | Current project only | Multi-project, comparative |
| **Output** | Markdown report | Markdown + JSON |
| **Speed** | Fast (5-10 min) | Slower (10-20 min with comparative) |
| **Depth** | Standard 16 criteria | Same + benchmark analysis |
| **Fix suggestions** | Via `--fix` flag | Built-in with recommendations |
| **Programmatic** | Terminal output | JSON for CI/CD integration |
| **Best for** | Quick checks, dev workflow | Deep audits, quality tracking |
**Recommendation**: Use command for daily checks, skill for release gates and quality tracking.
---
## Maintenance
### Updating Criteria
Edit `scoring/criteria.yaml`:
```yaml
agents:
categories:
identity:
criteria:
- id: A1.5 # New criterion
name: "API versioning specified"
points: 3
detection: "mentions API version or compatibility"
```
Version bump: Increment `version` in frontmatter when criteria change.
### Adding File Types
To support new file types (e.g., "workflows"):
1. Add to `scoring/criteria.yaml`:
```yaml
workflows:
max_points: 24
categories: [...]
```
2. Update detection logic (file path patterns)
3. Update report templates
---
## Related
- **Command version**: `.claude/commands/audit-agents-skills.md`
- **Agent Validation Checklist**: guide line 4921 (manual 16 criteria)
- **Skill Validation**: guide line 5491 (spec documentation)
- **Reference templates**: `examples/agents/`, `examples/skills/`, `examples/commands/`
---
## Changelog
**v1.0.0** (2026-02-07):
- Initial release
- 16-criteria framework (agents/skills/commands)
- 3 audit modes (quick/full/comparative)
- JSON + Markdown output
- Fix suggestions
- Industry context (LangChain 2026 report)
---
**Skill ready for use**: `audit-agents-skills`Related Skills
eval-skills
Audit all skills in the current project for frontmatter completeness, effort level appropriateness, allowed-tools scoping, and content quality. Produces a scored report with effort-level recommendations for each skill. Use when onboarding to a new project, reviewing skill quality before shipping, or adding effort fields to an existing skill library.
voice-refine
Transform verbose voice input into structured, token-efficient Claude prompts. Use when cleaning up voice memos, dictation output, or speech-to-text transcriptions that contain filler words, repetitions, and unstructured thoughts.
talk-stage6-revision
Produces revision sheets with quick navigation by act, a master concept-to-URL table, Q&A cheat-sheet with 6-10 anticipated questions, glossary, and external resources list. Use when preparing for a talk with Q&A, creating shareable reference material for attendees, or building a safety-net glossary for live delivery.
talk-stage5-script
Produces a complete 5-act pitch with speaker notes, a slide-by-slide specification, and a ready-to-paste Kimi prompt for AI slide generation. Requires validated angle and title from Stage 4. Use when you have a confirmed talk angle and need the full script, slide spec, and AI-generated presentation prompt.
talk-stage4-position
Generates 3-4 strategic talk angles with strength/weakness analysis, title options, CFP descriptions, and a peer feedback draft, then enforces a mandatory CHECKPOINT for user confirmation before scripting. Use when deciding how to frame a talk, preparing a CFP submission, or choosing between multiple narrative angles.
talk-stage3-concepts
Builds a numbered, categorized concept catalogue from the talk summary and timeline, scoring each concept HIGH / MEDIUM / LOW for talk potential with optional repo enrichment. Use when you need a structured inventory of concepts before choosing a talk angle, or when assessing which ideas have the strongest presentation potential.
talk-stage2-research
Performs git archaeology, changelog analysis, and builds a verified factual timeline by cross-referencing git history with source material. REX mode only — skipped automatically in Concept mode. Use when building a REX talk and you need verified commit metrics, release timelines, and contributor data from a git repository.
talk-stage1-extract
Extracts and structures source material (articles, transcripts, notes) into a talk summary with narrative arc, themes, metrics, and gaps. Auto-detects REX vs Concept type. Use when starting a new talk from any source material or auditing existing material before committing to a talk.
talk-pipeline
Orchestrates the complete talk preparation pipeline from raw material to revision sheets, running 6 stages in sequence with human-in-the-loop checkpoints for REX or Concept mode talks. Use when starting a new talk pipeline, resuming a pipeline from a specific stage, or running the full end-to-end preparation workflow.
skill-creator
Scaffold a new Claude Code skill with SKILL.md, frontmatter, and bundled resources. Use when creating a custom skill, standardizing skill structure across a team, or packaging a skill for distribution.
rtk-optimizer
Wrap high-verbosity shell commands with RTK to reduce token consumption. Use when running git log, git diff, cargo test, pytest, or other verbose CLI output that wastes context window tokens.
release-notes-generator
Generate release notes in 3 formats (CHANGELOG.md, PR body, Slack announcement) from git commits. Automatically categorizes changes and converts technical language to user-friendly messaging. Use for releases, changelogs, version notes, what's new summaries, or ship announcements.