home / skills / gptomics / bioskills / duplicate-handling

duplicate-handling skill

/alignment-files/duplicate-handling

This skill helps you identify and remove PCR and optical duplicates in BAM files using fixmate and markdup to improve variant calling.

npx playbooks add skill gptomics/bioskills --skill duplicate-handling

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
7.2 KB
---
name: bio-duplicate-handling
description: Mark and remove PCR/optical duplicates using samtools fixmate and markdup. Use when preparing alignments for variant calling or when duplicate reads would bias analysis.
tool_type: cli
primary_tool: samtools
---

## Version Compatibility

Reference examples tested with: picard 3.1+, pysam 0.22+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Duplicate Handling

**"Remove PCR duplicates from my BAM file"** → Mark or remove duplicate reads using the fixmate-sort-markdup pipeline to prevent duplicate bias in variant calling.
- CLI: `samtools fixmate`, `samtools markdup` (samtools)
- Python: `pysam.fixmate()`, `pysam.markdup()` (pysam)

Mark and remove PCR/optical duplicates using samtools.

## Why Remove Duplicates?

PCR duplicates are identical copies of the same original molecule, created during library preparation. They:
- Inflate coverage artificially
- Bias allele frequencies
- Can create false positive variant calls

Optical duplicates are clusters read multiple times due to their proximity on the flowcell.

## Duplicate Marking Workflow

**Goal:** Mark PCR/optical duplicates so they can be excluded from downstream variant calling and coverage analysis.

**Approach:** Name-sort, add mate tags with fixmate, coordinate-sort, then run markdup. The pipeline version avoids intermediate files.

**Reference (samtools 1.19+):**
```bash
# 1. Sort by name (required for fixmate)
samtools sort -n -o namesort.bam input.bam

# 2. Add mate information with fixmate
samtools fixmate -m namesort.bam fixmate.bam

# 3. Sort by coordinate (required for markdup)
samtools sort -o coordsort.bam fixmate.bam

# 4. Mark duplicates
samtools markdup coordsort.bam marked.bam

# 5. Index result
samtools index marked.bam
```

### Pipeline Version
```bash
samtools sort -n input.bam | \
    samtools fixmate -m - - | \
    samtools sort - | \
    samtools markdup - marked.bam

samtools index marked.bam
```

## samtools fixmate

Adds mate information required by markdup. Must be run on name-sorted BAM.

### Basic Usage
```bash
samtools fixmate namesorted.bam fixmate.bam
```

### Add Mate Score Tag (-m)
```bash
# Required for markdup to work correctly
samtools fixmate -m namesorted.bam fixmate.bam
```

### Multi-threaded
```bash
samtools fixmate -m -@ 4 namesorted.bam fixmate.bam
```

### Remove Secondary/Unmapped
```bash
samtools fixmate -r -m namesorted.bam fixmate.bam
```

## samtools markdup

Marks or removes duplicate alignments. Requires coordinate-sorted BAM with mate tags from fixmate.

### Mark Duplicates (Keep in File)
```bash
samtools markdup input.bam marked.bam
```

### Remove Duplicates
```bash
samtools markdup -r input.bam deduped.bam
```

### Output Statistics
```bash
samtools markdup -s input.bam marked.bam 2> markdup_stats.txt
```

### Optical Duplicate Distance
```bash
# Set pixel distance for optical duplicate detection (default: 100)
samtools markdup -d 2500 input.bam marked.bam
```

### Multi-threaded
```bash
samtools markdup -@ 4 input.bam marked.bam
```

### Write Stats to File
```bash
samtools markdup -f stats.txt input.bam marked.bam
```

## Duplicate Statistics

### Check Duplicate Rate
```bash
samtools flagstat marked.bam
# Look for "duplicates" line
```

### Count Duplicates
```bash
# Count reads with duplicate flag
samtools view -c -f 1024 marked.bam
```

### Percentage Duplicates
```bash
total=$(samtools view -c marked.bam)
dups=$(samtools view -c -f 1024 marked.bam)
echo "scale=2; $dups * 100 / $total" | bc
```

## pysam Python Alternative

### Full Pipeline
```python
import pysam

# Sort by name
pysam.sort('-n', '-o', 'namesort.bam', 'input.bam')

# Fixmate
pysam.fixmate('-m', 'namesort.bam', 'fixmate.bam')

# Sort by coordinate
pysam.sort('-o', 'coordsort.bam', 'fixmate.bam')

# Mark duplicates
pysam.markdup('coordsort.bam', 'marked.bam')

# Index
pysam.index('marked.bam')
```

### Check Duplicate Flag
```python
import pysam

with pysam.AlignmentFile('marked.bam', 'rb') as bam:
    total = 0
    duplicates = 0
    for read in bam:
        total += 1
        if read.is_duplicate:
            duplicates += 1

    print(f'Total: {total}')
    print(f'Duplicates: {duplicates}')
    print(f'Rate: {duplicates/total*100:.2f}%')
```

### Filter Out Duplicates
```python
import pysam

with pysam.AlignmentFile('marked.bam', 'rb') as infile:
    with pysam.AlignmentFile('nodup.bam', 'wb', header=infile.header) as outfile:
        for read in infile:
            if not read.is_duplicate:
                outfile.write(read)
```

### Mark Duplicates Manually (Simple Case)
```python
import pysam
from collections import defaultdict

def simple_markdup(input_bam, output_bam):
    seen = defaultdict(set)

    with pysam.AlignmentFile(input_bam, 'rb') as infile:
        with pysam.AlignmentFile(output_bam, 'wb', header=infile.header) as outfile:
            for read in infile:
                if read.is_unmapped:
                    outfile.write(read)
                    continue

                key = (read.reference_id, read.reference_start, read.is_reverse,
                       read.next_reference_id, read.next_reference_start)

                if key in seen:
                    read.is_duplicate = True
                else:
                    seen[key].add(read.query_name)

                outfile.write(read)

simple_markdup('sorted.bam', 'marked.bam')
```

## Alternative: From Aligner

Some aligners can mark duplicates directly:

### BWA-MEM2 with samblaster
```bash
bwa-mem2 mem ref.fa R1.fq R2.fq | \
    samblaster | \
    samtools sort -o marked.bam
```

### Using Picard (Alternative Tool)
```bash
java -jar picard.jar MarkDuplicates \
    I=input.bam \
    O=marked.bam \
    M=metrics.txt
```

## Quick Reference

| Task | Command |
|------|---------|
| Full workflow | `sort -n \| fixmate -m \| sort \| markdup` |
| Mark duplicates | `samtools markdup in.bam out.bam` |
| Remove duplicates | `samtools markdup -r in.bam out.bam` |
| Count duplicates | `samtools view -c -f 1024 marked.bam` |
| View non-duplicates | `samtools view -F 1024 marked.bam` |
| Get stats | `samtools markdup -s in.bam out.bam` |

## Duplicate FLAG

| Flag | Value | Meaning |
|------|-------|---------|
| 0x400 | 1024 | PCR or optical duplicate |

### Filter Commands
```bash
# View only duplicates
samtools view -f 1024 marked.bam

# View non-duplicates only
samtools view -F 1024 marked.bam

# Count non-duplicates
samtools view -c -F 1024 marked.bam
```

## Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `mate not found` | Input not name-sorted | Run `samtools sort -n` first |
| `no MC tag` | fixmate not run with -m | Re-run fixmate with `-m` flag |
| `not coordinate sorted` | Input to markdup not sorted | Run `samtools sort` after fixmate |

## Related Skills

- alignment-sorting - Sort by name/coordinate for workflow
- alignment-filtering - Filter duplicates from output
- bam-statistics - Check duplicate rates with flagstat
- variant-calling - Duplicate marking before calling

Overview

This skill marks and optionally removes PCR and optical duplicates from BAM/CRAM alignments using samtools (fixmate + markdup) or equivalent pysam calls. It provides a reliable pipeline to add mate tags, ensure proper sorting, mark duplicates, and produce indexed outputs ready for variant calling. The goal is to prevent duplicate-induced coverage inflation and biased allele frequencies.

How this skill works

The workflow name-sorts the file, runs fixmate to add mate information (and mate score tags), coordinate-sorts the result, then runs markdup to mark or remove duplicates. Commands are available as a streaming pipeline to avoid unnecessary intermediates, and equivalent pysam examples show the same steps programmatically. The skill also covers statistics (flagstat, counts) and common error cases with solutions.

When to use it

  • Before variant calling to avoid duplicate-driven false positives
  • When PCR amplification or optical duplication is suspected in the library
  • When preparing alignments for accurate coverage or allele frequency estimates
  • When aligner output does not already include reliable duplicate marking
  • When you need audit-able duplicate statistics and indexed output

Best practices

  • Verify tool versions (samtools, pysam, picard) and adjust flags to match installed APIs
  • Always run samtools sort -n before fixmate and coordinate sort before markdup
  • Use fixmate -m to add mate score tags required by markdup
  • Stream commands to reduce temporary files: sort -n | fixmate -m - - | sort - | markdup - out.bam
  • Save markdup stats (-s or -f) and run samtools index on the final BAM

Example use cases

  • CLI pipeline to mark duplicates and index: samtools sort -n in.bam | samtools fixmate -m - - | samtools sort - | samtools markdup - marked.bam && samtools index marked.bam
  • Remove duplicates for a strict analysis: samtools markdup -r coordsort.bam deduped.bam
  • Programmatic pipeline in Python using pysam.sort, pysam.fixmate, pysam.markdup, then pysam.index
  • Compute duplicate rate: use samtools flagstat and samtools view -c -f 1024 marked.bam
  • Filter out duplicates in Python by checking read.is_duplicate and writing non-duplicates

FAQ

Why does markdup say "mate not found"?

That error means the input was not name-sorted before fixmate. Run samtools sort -n first, then fixmate -m, then coordinate-sort before markdup.

Do I need fixmate -m?

Yes. markdup requires mate score tags that fixmate -m writes; without -m you may get incorrect duplicate detection.