home / skills / terrylica / cc-skills / code-clone-assistant

This skill detects code clones using PMD CPD and Semgrep and guides targeted refactoring to reduce duplication.

npx playbooks add skill terrylica/cc-skills --skill code-clone-assistant

Review the files below or copy the command above to add this skill to your agents.

Files (5)
SKILL.md
7.7 KB
---
name: code-clone-assistant
description: Detect and refactor code duplication with PMD CPD. TRIGGERS - code clones, DRY violations, duplicate code.
allowed-tools: Read, Grep, Bash, Edit, Write
---

# Code Clone Assistant

Detect code clones and guide refactoring using PMD CPD (exact duplicates) + Semgrep (patterns).

## Tools

- **PMD CPD v7.17.0+**: Exact duplicate detection
- **Semgrep v1.140.0+**: Pattern-based detection

**Tested**: October 2025 - 30 violations detected across 3 sample files
**Coverage**: ~3x more violations than using either tool alone

---

## When to Use This Skill

Use this skill when:

- Finding duplicate code in a codebase
- Detecting DRY violations
- Refactoring similar code patterns
- Identifying copy-paste code

---

## Why Two Tools?

PMD CPD and Semgrep detect different clone types:

| Aspect       | PMD CPD                          | Semgrep                          |
| ------------ | -------------------------------- | -------------------------------- |
| **Detects**  | Exact copy-paste duplicates      | Similar patterns with variations |
| **Scope**    | Across files ✅                  | Within/across files (Pro only)   |
| **Matching** | Token-based (ignores formatting) | Pattern-based (AST matching)     |
| **Rules**    | ❌ No custom rules               | ✅ Custom rules                  |

**Result**: Using both finds ~3x more DRY violations.

### Clone Types

| Type   | Description                     | PMD CPD         | Semgrep     |
| ------ | ------------------------------- | --------------- | ----------- |
| Type-1 | Exact copies                    | ✅ Default      | ✅          |
| Type-2 | Renamed identifiers             | ✅ `--ignore-*` | ✅          |
| Type-3 | Near-miss with variations       | ⚠️ Partial      | ✅ Patterns |
| Type-4 | Semantic clones (same behavior) | ❌              | ❌          |

---

## Quick Start Workflow

```bash
# Step 1: Detect exact duplicates (PMD CPD)
pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md

# Step 2: Detect pattern violations (Semgrep)
semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif

# Step 3: Analyze combined results (Claude Code)
# Parse both outputs, prioritize by severity

# Step 4: Refactor (Claude Code with user approval)
# Extract shared functions, consolidate patterns, verify tests
```

---

---

## Accepted Exceptions (Known Intentional Duplication)

Not all code duplication is a problem. Some codebases deliberately use copy-and-adapt patterns where refactoring would be harmful. When running clone detection, **always check for accepted exceptions before recommending refactoring**.

### When Duplication Is Acceptable

| Pattern                                         | Why Acceptable                                                                                                                                                                                                    | Example                                                            |
| ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ |
| **Generation-per-directory experiments**        | Each generation is an immutable, self-contained experiment. Sharing code across generations would break provenance and make past experiments non-reproducible.                                                    | SQL templates, sweep scripts where each `gen{NNN}/` is independent |
| **SQL templates with placeholder substitution** | SQL has no import/include mechanism. Templates use `sed` placeholder replacement (`__PLACEHOLDER__`), not function calls. Extracting shared CTEs into separate files would break the single-file execution model. | ClickHouse sweep templates sharing signal detection + metrics CTEs |
| **Protocol/schema boilerplate**                 | Serialization formats, API contracts, and wire protocols require exact structure in each location. Abstracting them hides the contract.                                                                           | NDJSON telemetry line construction in wrapper scripts              |
| **Test fixtures and golden files**              | Test data intentionally duplicates production patterns to verify behavior. Sharing fixtures creates brittle cross-test dependencies.                                                                              | Test setup code, expected output snapshots                         |

### How to Report Accepted Exceptions

When clone detection finds duplication that matches an accepted exception pattern:

1. **Report it** — always show the user what was found (lines, tokens, files)
2. **Flag as accepted** — explicitly state it matches a known exception pattern
3. **Explain why** — cite the specific reason refactoring is not recommended
4. **Do NOT recommend refactoring** — this is the key difference from actionable findings

**Example output format**:

```
Code Clone Analysis Results

PMD CPD Findings:
  Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs
    gen610_template.sql:33 ↔ gen710_template.sql:38
    Status: ACCEPTED EXCEPTION (generation-per-directory experiment)
    Reason: Each generation is immutable. Shared CTEs would break
            experiment provenance and reproducibility.

  Clone 2: 36 lines (478 tokens) — metrics aggregation
    gen610_template.sql:207 ↔ gen710_template.sql:244
    Status: ACCEPTED EXCEPTION (SQL template without include mechanism)

Actionable Findings: 0
Accepted Exceptions: 2
```

### Project-Level Exception Configuration

Projects can declare accepted exception patterns in their `CLAUDE.md`:

```markdown
## Code Clone Exceptions

- `sql/gen*_template.sql` — generation-per-directory experiments (immutable)
- `scripts/gen*/` — copy-and-adapt sweep scripts (no shared infrastructure)
- `tests/fixtures/` — intentional duplication for test isolation
```

When this section exists in a project's `CLAUDE.md`, the code-clone-assistant should check it before classifying findings.

---

## Reference Documentation

For detailed information, see:

- [Detection Commands](./references/detection-commands.md) - PMD CPD and Semgrep command details
- [Complete Workflow](./references/complete-workflow.md) - Detection, analysis, and presentation phases
- [Refactoring Strategies](./references/refactoring-strategies.md) - Approaches for addressing violations

---

## Troubleshooting

| Issue                      | Cause                        | Solution                                         |
| -------------------------- | ---------------------------- | ------------------------------------------------ |
| PMD CPD not found          | Not installed or not in PATH | `brew install pmd` or download from PMD releases |
| Semgrep timeout            | Large codebase scan          | Use `--exclude` to limit scope                   |
| No duplicates detected     | minimum-tokens too high      | Lower `--minimum-tokens` value (try 15)          |
| Too many false positives   | minimum-tokens too low       | Increase `--minimum-tokens` (try 30+)            |
| Language not recognized    | Wrong `-l` flag              | Check PMD CPD supported languages list           |
| SARIF parse error          | Semgrep output malformed     | Upgrade Semgrep to latest version                |
| Memory error on large repo | Java heap too small          | Set `PMD_JAVA_OPTS=-Xmx4g`                       |
| Missing clone rules file   | Custom rules not created     | Create `clone-rules.yaml` or use default config  |

Overview

This skill detects code duplication and guides safe refactoring by combining PMD CPD for exact duplicate detection with Semgrep for pattern-based matches. It parses both tool outputs, prioritizes findings, and produces actionable recommendations while honoring declared exceptions. The assistant helps extract shared functions, consolidate patterns, and verify changes with tests and user approval.

How this skill works

The skill runs PMD CPD to find token-based exact copies and Semgrep to detect near-miss or pattern-based clones. It merges results, de-duplicates overlaps, ranks by severity, and checks project-level accepted exceptions before recommending changes. Where refactoring is appropriate it suggests concrete edits (extract function, consolidate template, or create shared module) and frames the change with tests and rollback considerations.

When to use it

  • Scanning a codebase for copy-paste duplicates and DRY violations
  • Before a large refactor to identify safe consolidation candidates
  • During code review to flag repeated logic that could be abstracted
  • When maintaining generated-per-directory experiments or many templates
  • To validate whether suspected duplicates are actionable or accepted exceptions

Best practices

  • Run both PMD CPD and Semgrep — together they find ~3x more violations than either alone
  • Declare accepted exception patterns in a project-level CLAUDE.md so findings can be auto-classified
  • Prioritize small, high-impact clones for initial refactors and keep tests green
  • Always show file/line context and explicitly mark accepted exceptions instead of auto-refactoring
  • Tune PMD minimum-tokens and Semgrep rules to balance false positives and coverage

Example use cases

  • Detecting exact SQL template duplicates across generation directories and flagging acceptable exceptions
  • Finding repeated metrics aggregation code that should be extracted into a shared helper
  • Locating near-duplicate Python functions with renamed identifiers for consolidation
  • Auditing a repo before release to reduce maintenance cost from copy-paste code
  • Filtering clone findings by project exception rules to focus on actionable work

FAQ

Why use both PMD CPD and Semgrep?

PMD CPD finds token-level exact copies efficiently; Semgrep detects pattern-based or near-miss clones with custom rules. Combining them expands coverage and reduces missed DRY violations.

When should duplication be accepted instead of refactored?

Accept duplication when it preserves provenance, execution model, protocol contracts, or test isolation—documented as accepted exceptions. The skill will flag and explain these cases rather than recommend refactoring.