home / skills / sounder25 / google-antigravity-skills-library / 14_detect_duplicate_files

14_detect_duplicate_files skill

/14_detect_duplicate_files

This skill detects duplicate files across a workspace by content hashes and reports deduplication opportunities to save space.

npx playbooks add skill sounder25/google-antigravity-skills-library --skill 14_detect_duplicate_files

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
1.6 KB
---
name: Detect Duplicate Files
description: Identify duplicate files across the workspace using SHA256 hashing to reduce redundancy and confusion.
version: 1.0.0
author: Antigravity Skills Library
created: 2026-01-16
leverage_score: 2/5
---

# SKILL-014: Detect Duplicate Files

## Overview

Scans the workspace for identical files (by content, not name) to detect redundancy, copy-paste errors, or accidental forks. Generates a report suggesting deduplication actions.

## Trigger Phrases

- `find duplicates`
- `check for duplicate files`
- `scan redundancy`

## Inputs

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `--workspace-path` | string | No | Current directory | Root to scan |
| `--min-size` | int | No | 0 | Minimum file size in bytes to check |
| `--exclude` | string[] | No | `node_modules`, `.git`, `bin`, `obj` | Directories to ignore |

## Outputs

### 1. DUPLICATE_REPORT.md

Summary of found duplicates:
```markdown
# Duplicate File Report

**Total Duplicates:** 12
**Wasted Space:** 4.5 MB

## Group 1 (Hash: a1b2...)
- `src/utils/math.ts` (Original?)
- `src/legacy/math_copy.ts`

## Group 2 (Hash: c3d4...)
- `config/settings.json`
- `deploy/settings.prod.json`
```

## Implementation

### Script: find_duplicates.ps1

1. recurses through directory (respecting excludes).
2. Calculates SHA256 hash of every file.
3. Groups by hash.
4. Filters groups with count < 2.
5. Generates Markdown report.

## Use Cases

1. **Cleanup:** Reducing repo size by removing accidental copies of large assets.
2. **Refactoring:** Finding code that was copy-pasted instead of shared.

Overview

This skill scans a workspace to identify files with identical content using SHA256 hashing. It produces a Markdown report that groups duplicates, shows total wasted space, and suggests candidates for deduplication. The goal is to reduce redundancy, surface accidental copies, and simplify cleanup tasks.

How this skill works

The script recursively walks the target directory while respecting configurable exclude patterns and a minimum file size. It computes a SHA256 hash for each file, groups files by hash, and filters out groups with only one file. Finally, it summarizes duplicate groups and aggregate wasted space in a DUPLICATE_REPORT.md file.

When to use it

  • Before repository cleanup or large refactors to locate redundant assets or code copies
  • When you suspect accidental forks or copy-paste proliferation across the workspace
  • To reduce repository size by identifying duplicate large binaries or media files
  • As a periodic maintenance step in CI pipelines for monorepos or multi-package projects
  • Prior to creating releases or containers to avoid shipping duplicate files

Best practices

  • Exclude build folders (node_modules, .git, bin, obj) to avoid scanning generated artifacts
  • Set a sensible --min-size to skip trivial small files and speed up scans
  • Review DUPLICATE_REPORT.md before deleting: confirm originals and intended copies
  • Use version control or backups when removing files to prevent accidental data loss
  • Automate scans in CI but gate destructive actions behind manual review

Example use cases

  • Cleanup: locate duplicate large images or model files and replace copies with references
  • Refactoring: find copy-pasted utility modules to consolidate into a single shared file
  • Repository slimming: identify duplicated vendor libraries mistakenly committed in multiple places
  • Audit: verify that configuration files are not duplicated across environments with slight name changes

FAQ

How does it decide which file is the original?

The script only groups identical content by hash; it does not mark originals. The report lists files in each group for manual inspection to choose the canonical file.

Can I exclude directories or change the minimum size?

Yes. Use the --exclude parameter to add directories and --min-size to ignore small files. Defaults already exclude node_modules, .git, bin, and obj.