home / skills / proffesor-for-testing / agentic-qe / sherlock-review

sherlock-review skill

This skill conducts evidence-based code reviews to verify claims against actual behavior, guiding you toward reproducible conclusions.

npx playbooks add skill proffesor-for-testing/agentic-qe --skill sherlock-review

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

6.6 KB

---
name: sherlock-review
description: "Evidence-based investigative code review using deductive reasoning to determine what actually happened versus what was claimed. Use when verifying implementation claims, investigating bugs, validating fixes, or conducting root cause analysis. Elementary approach to finding truth through systematic observation."
category: quality-review
priority: high
tokenEstimate: 1100
agents: [qe-code-reviewer, qe-security-auditor, qe-performance-validator]
implementation_status: optimized
optimization_version: 1.0
last_optimized: 2025-12-03
dependencies: []
quick_reference_card: true
tags: [investigation, evidence-based, code-review, root-cause, deduction]
---

# Sherlock Review

<default_to_action>
When investigating code claims:
1. OBSERVE: Gather all evidence (code, tests, history, behavior)
2. DEDUCE: What does evidence actually show vs. what was claimed?
3. ELIMINATE: Rule out what cannot be true
4. CONCLUDE: Does evidence support the claim?
5. DOCUMENT: Findings with proof, not assumptions

**The 3-Step Investigation:**
```bash
# 1. OBSERVE: Gather evidence
git diff <commit>
npm test -- --coverage

# 2. DEDUCE: Compare claim vs reality
# Does code match description?
# Do tests prove the fix/feature?

# 3. CONCLUDE: Verdict with evidence
# SUPPORTED / PARTIALLY SUPPORTED / NOT SUPPORTED
```

**Holmesian Principles:**
- "Data! Data! Data!" - Collect before concluding
- "Eliminate the impossible" - What cannot be true?
- "You see, but do not observe" - Run code, don't just read
- Trust only reproducible evidence
</default_to_action>

## Quick Reference Card

### Evidence Collection Checklist

| Category | What to Check | How |
|----------|---------------|-----|
| **Claim** | PR description, commit messages | Read thoroughly |
| **Code** | Actual file changes | `git diff` |
| **Tests** | Coverage, assertions | Run independently |
| **Behavior** | Runtime output | Execute locally |
| **Timeline** | When things happened | `git log`, `git blame` |

### Verdict Levels

| Verdict | Meaning |
|---------|---------|
| ✓ **TRUE** | Evidence fully supports claim |
| ⚠ **PARTIALLY TRUE** | Claim accurate but incomplete |
| ✗ **FALSE** | Evidence contradicts claim |
| ? **NONSENSICAL** | Claim doesn't apply to context |

---

## Investigation Template

```markdown
## Sherlock Investigation: [Claim]

### The Claim
"[What PR/commit claims to do]"

### Evidence Examined
- Code changes: [files, lines]
- Tests added: [count, coverage]
- Behavior observed: [what actually happens]

### Deductive Analysis

**Claim**: [specific assertion]
**Evidence**: [what you found]
**Deduction**: [logical conclusion]
**Verdict**: ✓/⚠/✗

### Findings
- What works: [with evidence]
- What doesn't: [with evidence]
- What's missing: [gaps in implementation/testing]

### Recommendations
1. [Action based on findings]
```

---

## Investigation Scenarios

### Scenario 1: "This Fixed the Bug"

**Steps:**
1. Reproduce bug on commit before fix
2. Verify bug is gone on commit with fix
3. Check if fix addresses root cause or symptom
4. Test edge cases not in original report

**Red Flags:**
- Fix that just removes error logging
- Works only for specific test case
- Workarounds instead of root cause fix
- No regression test added

### Scenario 2: "Improved Performance by 50%"

**Steps:**
1. Run benchmark on baseline commit
2. Run same benchmark on optimized commit
3. Compare in identical conditions
4. Verify measurement methodology

**Red Flags:**
- Tested only on toy data
- Different comparison conditions
- Trade-offs not mentioned

### Scenario 3: "Handles All Edge Cases"

**Steps:**
1. List all edge cases in code path
2. Check each has test coverage
3. Test boundary conditions
4. Verify error handling paths

**Red Flags:**
- `catch {}` swallowing errors
- Generic error messages
- No logging of critical errors

---

## Example Investigation

```markdown
## Case: PR #123 "Fix race condition in async handler"

### Claims Examined:
1. "Eliminates race condition"
2. "Adds mutex locking"
3. "100% thread safe"

### Evidence:
- File: src/handlers/async-handler.js
- Changes: Added `async/await`, removed callbacks
- Tests: 2 new tests for async flow
- Coverage: 85% (was 75%)

### Analysis:

**Claim 1: "Eliminates race condition"**
Evidence: Added `await` to sequential operations. No actual mutex.
Deduction: Race avoided by removing concurrency, not synchronization.
Verdict: ⚠ PARTIALLY TRUE (solved differently than claimed)

**Claim 2: "Adds mutex locking"**
Evidence: No mutex library, no lock variables, no sync primitives.
Verdict: ✗ FALSE

**Claim 3: "100% thread safe"**
Evidence: JavaScript is single-threaded. No worker threads used.
Verdict: ? NONSENSICAL (meaningless in this context)

### Conclusion:
Fix works but not for reasons claimed. Race condition avoided by
making operations sequential, not by adding synchronization.

### Recommendations:
1. Update PR description to accurately reflect solution
2. Add test for concurrent request handling
3. Remove incorrect technical claims
```

---

## Agent Integration

```typescript
// Evidence-based code review
await Task("Sherlock Review", {
  prNumber: 123,
  claims: [
    "Fixes memory leak",
    "Improves performance 30%"
  ],
  verifyReproduction: true,
  testEdgeCases: true
}, "qe-code-reviewer");

// Bug fix verification
await Task("Verify Fix", {
  bugCommit: 'abc123',
  fixCommit: 'def456',
  reproductionSteps: steps,
  testBoundaryConditions: true
}, "qe-code-reviewer");
```

---

## Agent Coordination Hints

### Memory Namespace
```
aqe/sherlock/
├── investigations/*   - Investigation reports
├── evidence/*         - Collected evidence
├── verdicts/*         - Claim verdicts
└── patterns/*         - Common deception patterns
```

### Fleet Coordination
```typescript
const investigationFleet = await FleetManager.coordinate({
  strategy: 'evidence-investigation',
  agents: [
    'qe-code-reviewer',        // Code analysis
    'qe-security-auditor',     // Security claim verification
    'qe-performance-validator' // Performance claim verification
  ],
  topology: 'parallel'
});
```

---

## Related Skills
- [brutal-honesty-review](../brutal-honesty-review/) - Direct technical criticism
- [context-driven-testing](../context-driven-testing/) - Adapt to context
- [bug-reporting-excellence](../bug-reporting-excellence/) - Document findings

---

## Remember

**"It is a capital mistake to theorize before one has data."** Trust only reproducible evidence. Don't trust commit messages, documentation, or "works on my machine."

**The Sherlock Standard:** Every claim must be verified empirically. What does the evidence actually show?

Overview

This skill performs evidence-based investigative code reviews using deductive reasoning to determine what actually happened versus what was claimed. It guides reviewers through systematic observation, elimination of impossibilities, and documented conclusions. Use it to verify implementation claims, validate fixes, and perform root cause analysis with reproducible proof.

How this skill works

The skill collects evidence from code diffs, commit history, tests, and runtime behavior, then compares the evidence to stated claims. It applies a three-step approach: observe (gather artifacts and run tests), deduce (compare claim vs. reality and rule out impossibilities), and conclude (produce a verdict supported by evidence). Findings are documented in a reproducible investigation template with concrete recommendations.

When to use it

Verifying that a PR or commit actually implements the claimed change
Investigating a reported bug to confirm root cause and validate a fix
Validating performance or scalability improvement claims with benchmarks
Assessing whether tests truly cover reported edge cases or regressions
Audit code to resolve conflicting claims in documentation, commits, or tests

Best practices

Collect raw, reproducible evidence: git diffs, test runs, logs, and benchmarks
Reproduce behavior locally in the same conditions before concluding
Prefer specific, testable claims over vague statements
Rule out impossible explanations before accepting a deduction
Document every verdict with exact lines, test output, and commands run

Example use cases

Confirming a claimed bug fix by reproducing the bug on the baseline commit and proving it is gone on the fix commit
Verifying a claimed 50% performance improvement by running identical benchmarks on baseline and optimized commits
Checking that a PR claiming 'handles all edge cases' includes tests for each listed edge case and appropriate error handling
Determining whether a memory leak claim is supported by allocation traces and test coverage
Producing an investigation report that replaces speculative PR descriptions with evidence-backed conclusions

FAQ

What verdicts does the skill produce?

Verdicts are TRUE (supported), PARTIALLY TRUE (supported but incomplete), FALSE (contradicted), or NONSENSICAL (claim not applicable). Each verdict includes supporting evidence.

Do I need special tools to run investigations?

No. Use existing VCS, test runners, and benchmarking tools. The skill expects reproducible outputs (git, npm/test, logs, benchmarks) and documents the exact commands used.