home / skills / obra / superpowers-lab / finding-duplicate-functions

finding-duplicate-functions skill

/skills/finding-duplicate-functions

This skill helps identify duplicate-intent functions across a codebase by clustering semantically similar utilities for consolidation.

npx playbooks add skill obra/superpowers-lab --skill finding-duplicate-functions

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
4.7 KB
---
name: finding-duplicate-functions
description: Use when auditing a codebase for semantic duplication - functions that do the same thing but have different names or implementations. Especially useful for LLM-generated codebases where new functions are often created rather than reusing existing ones.
---

# Finding Duplicate-Intent Functions

## Overview

LLM-generated codebases accumulate semantic duplicates: functions that serve the same purpose but were implemented independently. Classical copy-paste detectors (jscpd) find syntactic duplicates but miss "same intent, different implementation."

This skill uses a two-phase approach: classical extraction followed by LLM-powered intent clustering.

## When to Use

- Codebase has grown organically with multiple contributors (human or LLM)
- You suspect utility functions have been reimplemented multiple times
- Before major refactoring to identify consolidation opportunities
- After jscpd has been run and syntactic duplicates are already handled

## Quick Reference

| Phase | Tool | Model | Output |
|-------|------|-------|--------|
| 1. Extract | `scripts/extract-functions.sh` | - | `catalog.json` |
| 2. Categorize | `scripts/categorize-prompt.md` | haiku | `categorized.json` |
| 3. Split | `scripts/prepare-category-analysis.sh` | - | `categories/*.json` |
| 4. Detect | `scripts/find-duplicates-prompt.md` | opus | `duplicates/*.json` |
| 5. Report | `scripts/generate-report.sh` | - | `report.md` |

## Process

```dot
digraph duplicate_detection {
  rankdir=TB;
  node [shape=box];

  extract [label="1. Extract function catalog\n./scripts/extract-functions.sh"];
  categorize [label="2. Categorize by domain\n(haiku subagent)"];
  split [label="3. Split into categories\n./scripts/prepare-category-analysis.sh"];
  detect [label="4. Find duplicates per category\n(opus subagent per category)"];
  report [label="5. Generate report\n./scripts/generate-report.sh"];
  review [label="6. Human review & consolidate"];

  extract -> categorize -> split -> detect -> report -> review;
}
```

### Phase 1: Extract Function Catalog

```bash
./scripts/extract-functions.sh src/ -o catalog.json
```

Options:
- `-o FILE`: Output file (default: stdout)
- `-c N`: Lines of context to capture (default: 15)
- `-t GLOB`: File types (default: `*.ts,*.tsx,*.js,*.jsx`)
- `--include-tests`: Include test files (excluded by default)

Test files (`*.test.*`, `*.spec.*`, `__tests__/**`) are excluded by default since test utilities are less likely to be consolidation candidates.

### Phase 2: Categorize by Domain

Dispatch a **haiku** subagent using the prompt in `scripts/categorize-prompt.md`.

Insert the contents of `catalog.json` where indicated in the prompt template. Save output as `categorized.json`.

### Phase 3: Split into Categories

```bash
./scripts/prepare-category-analysis.sh categorized.json ./categories
```

Creates one JSON file per category. Only categories with 3+ functions are worth analyzing.

### Phase 4: Find Duplicates (Per Category)

For each category file in `./categories/`, dispatch an **opus** subagent using the prompt in `scripts/find-duplicates-prompt.md`.

Save each output as `./duplicates/{category}.json`.

### Phase 5: Generate Report

```bash
./scripts/generate-report.sh ./duplicates ./duplicates-report.md
```

Produces a prioritized markdown report grouped by confidence level.

### Phase 6: Human Review

Review the report. For HIGH confidence duplicates:
1. Verify the recommended survivor has tests
2. Update callers to use the survivor
3. Delete the duplicates
4. Run tests

## High-Risk Duplicate Zones

Focus extraction on these areas first - they accumulate duplicates fastest:

| Zone | Common Duplicates |
|------|-------------------|
| `utils/`, `helpers/`, `lib/` | General utilities reimplemented |
| Validation code | Same checks written multiple ways |
| Error formatting | Error-to-string conversions |
| Path manipulation | Joining, resolving, normalizing paths |
| String formatting | Case conversion, truncation, escaping |
| Date formatting | Same formats implemented repeatedly |
| API response shaping | Similar transformations for different endpoints |

## Common Mistakes

**Extracting too much**: Focus on exported functions and public methods. Internal helpers are less likely to be duplicated across files.

**Skipping the categorization step**: Going straight to duplicate detection on the full catalog produces noise. Categories focus the comparison.

**Using haiku for duplicate detection**: Haiku is cost-effective for categorization but misses subtle semantic duplicates. Use Opus for the actual duplicate analysis.

**Consolidating without tests**: Before deleting duplicates, ensure the survivor has tests covering all use cases of the deleted functions.

Overview

This skill helps audit a codebase for semantic duplicates: functions that do the same thing but have different names or implementations. It combines a classical extraction step with LLM-powered categorization and per-category duplicate detection. The goal is to surface consolidation opportunities and produce a prioritized report for human review.

How this skill works

First, the skill extracts a catalog of functions from source files with context using a shell extractor. Next, an LLM categorizes functions by intent into domains, then splits the catalog into category files. A stronger LLM inspects each category to identify function pairs or groups that implement the same intent and produces confidence-tagged duplicate sets. Finally, a report generator consolidates findings for human review and safe remediation.

When to use it

  • Codebase has grown organically with multiple contributors (human or LLM).
  • You suspect utility or validation functions were reimplemented in different places.
  • Preparing for a large refactor and you want to minimize redundant code before changes.
  • After running syntactic duplicate tools (e.g., jscpd) and wanting semantic-level detection.
  • When auditing LLM-generated code where new functions are often created rather than reused.

Best practices

  • Restrict extraction to exported functions and public methods to reduce noise.
  • Run categorization before duplicate detection—compare only within intent categories.
  • Only analyze categories with 3+ functions for efficiency and signal quality.
  • Require tests for the chosen survivor function before deleting duplicates.
  • Treat the LLM outputs as suggestions; perform human review and integration testing before changes.

Example use cases

  • Find multiple implementations of date and string formatting utilities spread across a repo.
  • Detect repeated API response shaping functions implemented differently per endpoint.
  • Locate similar validation and error formatting logic in helper libraries.
  • Consolidate path and filesystem helpers that have diverged over time.
  • Prioritize cleanup before a library refactor or release to reduce maintenance burden.

FAQ

How accurate are the LLM detections?

LLM detections are probabilistic; outputs include confidence tags. Use HIGH-confidence results as strong candidates but always verify with tests and manual review.

What files should I include when extracting?

Start with source files (e.g., *.ts, *.js) and exclude tests by default. Include tests only if you suspect test utilities are duplicated across suites.