sherlock-review

Evidence-based investigative code review using deductive reasoning to determine what actually happened versus what was claimed. Use when verifying implementation claims, investigating bugs, validating fixes, or conducting root cause analysis. Elementary approach to finding truth through systematic observation.

298 stars

byproffesor-for-testing

View on GitHub Installation ↓

Best use case

sherlock-review is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using sherlock-review should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/sherlock-review/SKILL.md --create-dirs "https://raw.githubusercontent.com/proffesor-for-testing/agentic-qe/main/.claude/skills/sherlock-review/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/sherlock-review/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How sherlock-review Compares

Feature / Agent	sherlock-review	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

Cursor vs Codex for AI Workflows

Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.

SKILL.md Source

# Sherlock Review

<default_to_action>
When investigating code claims:
1. OBSERVE: Gather all evidence (code, tests, history, behavior)
2. DEDUCE: What does evidence actually show vs. what was claimed?
3. ELIMINATE: Rule out what cannot be true
4. CONCLUDE: Does evidence support the claim?
5. DOCUMENT: Findings with proof, not assumptions

**The 3-Step Investigation:**
```bash
# 1. OBSERVE: Gather evidence
git diff <commit>
npm test -- --coverage

# 2. DEDUCE: Compare claim vs reality
# Does code match description?
# Do tests prove the fix/feature?

# 3. CONCLUDE: Verdict with evidence
# SUPPORTED / PARTIALLY SUPPORTED / NOT SUPPORTED
```

**Holmesian Principles:**
- "Data! Data! Data!" - Collect before concluding
- "Eliminate the impossible" - What cannot be true?
- "You see, but do not observe" - Run code, don't just read
- Trust only reproducible evidence
</default_to_action>

## Quick Reference Card

### Evidence Collection Checklist

| Category | What to Check | How |
|----------|---------------|-----|
| **Claim** | PR description, commit messages | Read thoroughly |
| **Code** | Actual file changes | `git diff` |
| **Tests** | Coverage, assertions | Run independently |
| **Behavior** | Runtime output | Execute locally |
| **Timeline** | When things happened | `git log`, `git blame` |

### Verdict Levels

| Verdict | Meaning |
|---------|---------|
| ✓ **TRUE** | Evidence fully supports claim |
| ⚠ **PARTIALLY TRUE** | Claim accurate but incomplete |
| ✗ **FALSE** | Evidence contradicts claim |
| ? **NONSENSICAL** | Claim doesn't apply to context |

---

## Investigation Template

```markdown
## Sherlock Investigation: [Claim]

### The Claim
"[What PR/commit claims to do]"

### Evidence Examined
- Code changes: [files, lines]
- Tests added: [count, coverage]
- Behavior observed: [what actually happens]

### Deductive Analysis

**Claim**: [specific assertion]
**Evidence**: [what you found]
**Deduction**: [logical conclusion]
**Verdict**: ✓/⚠/✗

### Findings
- What works: [with evidence]
- What doesn't: [with evidence]
- What's missing: [gaps in implementation/testing]

### Recommendations
1. [Action based on findings]
```

## Minimum Findings Enforcement
Every investigation MUST surface at least 3 weighted observations (CRITICAL=3, HIGH=2, MEDIUM=1, LOW=0.5). Elementary observations count at INFORMATIONAL=0.25 weight. A Sherlock investigation that finds nothing is a failed investigation -- Holmes always finds clues.

---

## Investigation Scenarios

### Scenario 1: "This Fixed the Bug"

**Steps:**
1. Reproduce bug on commit before fix
2. Verify bug is gone on commit with fix
3. Check if fix addresses root cause or symptom
4. Test edge cases not in original report

**Red Flags:**
- Fix that just removes error logging
- Works only for specific test case
- Workarounds instead of root cause fix
- No regression test added

### Scenario 2: "Improved Performance by 50%"

**Steps:**
1. Run benchmark on baseline commit
2. Run same benchmark on optimized commit
3. Compare in identical conditions
4. Verify measurement methodology

**Red Flags:**
- Tested only on toy data
- Different comparison conditions
- Trade-offs not mentioned

### Scenario 3: "Handles All Edge Cases"

**Steps:**
1. List all edge cases in code path
2. Check each has test coverage
3. Test boundary conditions
4. Verify error handling paths

**Red Flags:**
- `catch {}` swallowing errors
- Generic error messages
- No logging of critical errors

---

## Example Investigation

```markdown
## Case: PR #123 "Fix race condition in async handler"

### Claims Examined:
1. "Eliminates race condition"
2. "Adds mutex locking"
3. "100% thread safe"

### Evidence:
- File: src/handlers/async-handler.js
- Changes: Added `async/await`, removed callbacks
- Tests: 2 new tests for async flow
- Coverage: 85% (was 75%)

### Analysis:

**Claim 1: "Eliminates race condition"**
Evidence: Added `await` to sequential operations. No actual mutex.
Deduction: Race avoided by removing concurrency, not synchronization.
Verdict: ⚠ PARTIALLY TRUE (solved differently than claimed)

**Claim 2: "Adds mutex locking"**
Evidence: No mutex library, no lock variables, no sync primitives.
Verdict: ✗ FALSE

**Claim 3: "100% thread safe"**
Evidence: JavaScript is single-threaded. No worker threads used.
Verdict: ? NONSENSICAL (meaningless in this context)

### Conclusion:
Fix works but not for reasons claimed. Race condition avoided by
making operations sequential, not by adding synchronization.

### Recommendations:
1. Update PR description to accurately reflect solution
2. Add test for concurrent request handling
3. Remove incorrect technical claims
```

---

## Agent Integration

```typescript
// Evidence-based code review
await Task("Sherlock Review", {
  prNumber: 123,
  claims: [
    "Fixes memory leak",
    "Improves performance 30%"
  ],
  verifyReproduction: true,
  testEdgeCases: true
}, "qe-code-reviewer");

// Bug fix verification
await Task("Verify Fix", {
  bugCommit: 'abc123',
  fixCommit: 'def456',
  reproductionSteps: steps,
  testBoundaryConditions: true
}, "qe-code-reviewer");
```

---

## Agent Coordination Hints

### Memory Namespace
```
aqe/sherlock/
├── investigations/*   - Investigation reports
├── evidence/*         - Collected evidence
├── verdicts/*         - Claim verdicts
└── patterns/*         - Common deception patterns
```

### Fleet Coordination
```typescript
const investigationFleet = await FleetManager.coordinate({
  strategy: 'evidence-investigation',
  agents: [
    'qe-code-reviewer',        // Code analysis
    'qe-security-auditor',     // Security claim verification
    'qe-performance-validator' // Performance claim verification
  ],
  topology: 'parallel'
});
```

---

## Related Skills
- [brutal-honesty-review](../brutal-honesty-review/) - Direct technical criticism
- [context-driven-testing](../context-driven-testing/) - Adapt to context
- [bug-reporting-excellence](../bug-reporting-excellence/) - Document findings

---

## Remember

**"It is a capital mistake to theorize before one has data."** Trust only reproducible evidence. Don't trust commit messages, documentation, or "works on my machine."

**The Sherlock Standard:** Every claim must be verified empirically. What does the evidence actually show?

Related Skills

qe-sherlock-review

298

from proffesor-for-testing/agentic-qe

qe-pr-review

298

from proffesor-for-testing/agentic-qe

Scope-aware GitHub PR review with user-friendly tone and trust tier validation

qe-github-code-review

298

from proffesor-for-testing/agentic-qe

Comprehensive GitHub code review with AI-powered swarm coordination

qe-code-review-quality

298

from proffesor-for-testing/agentic-qe

Conduct context-driven code reviews focusing on quality, testability, and maintainability. Use when reviewing code, providing feedback, or establishing review practices.

qe-brutal-honesty-review

298

from proffesor-for-testing/agentic-qe

Unvarnished technical criticism combining Linus Torvalds' precision, Gordon Ramsay's standards, and James Bach's BS-detection. Use when code/tests need harsh reality checks, certification schemes smell fishy, or technical decisions lack rigor. No sugar-coating, just surgical truth about what's broken and why.

pr-review

298

from proffesor-for-testing/agentic-qe

Use when reviewing a GitHub PR for quality, scope correctness, trust tier compliance, or generating user-friendly review feedback.

github-code-review

298

from proffesor-for-testing/agentic-qe

Comprehensive GitHub code review with AI-powered swarm coordination

code-review-quality

298

from proffesor-for-testing/agentic-qe

Conduct context-driven code reviews focusing on quality, testability, and maintainability. Use when reviewing code, providing feedback, or establishing review practices.

brutal-honesty-review

298

from proffesor-for-testing/agentic-qe

qe-visual-testing-advanced

298

from proffesor-for-testing/agentic-qe

Advanced visual regression testing with pixel-perfect comparison, AI-powered diff analysis, responsive design validation, and cross-browser visual consistency. Use when detecting UI regressions, validating designs, or ensuring visual consistency.

qe-verification-quality

298

from proffesor-for-testing/agentic-qe

Comprehensive truth scoring, code quality verification, and automatic rollback system with 0.95 accuracy threshold for ensuring high-quality agent outputs and codebase reliability.

qe-testability-scoring

298

from proffesor-for-testing/agentic-qe

AI-powered testability assessment using 10 principles of intrinsic testability with Playwright and optional Vibium integration. Evaluates web applications against Observability, Controllability, Algorithmic Simplicity, Transparency, Stability, Explainability, Unbugginess, Smallness, Decomposability, and Similarity. Use when assessing software testability, evaluating test readiness, identifying testability improvements, or generating testability reports.