home / skills / gptomics / bioskills / assembly-polishing

assembly-polishing skill

/genome-assembly/assembly-polishing

This skill helps you polish genome assemblies by applying Pilon, medaka, and racon workflows to improve base accuracy with short and long reads.

npx playbooks add skill gptomics/bioskills --skill assembly-polishing

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
7.6 KB
---
name: bio-genome-assembly-assembly-polishing
description: Polish genome assemblies to reduce errors using short reads (Pilon), long reads (Racon), or ONT-specific tools (medaka). Essential for improving long-read assembly accuracy. Use when improving assembly accuracy with polishing tools.
tool_type: cli
primary_tool: Pilon
---

## Version Compatibility

Reference examples tested with: BWA 0.7.17+, QUAST 5.2+, minimap2 2.26+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Assembly Polishing

**"Polish my genome assembly"** → Iteratively correct base-level errors in a draft assembly using short-read or long-read alignments.
- CLI: `pilon --genome draft.fa --frags short_reads.bam` (short-read), `medaka_polish` (ONT), `racon` (long-read)

## Polishing Strategies

| Tool | Input Reads | Best For |
|------|-------------|----------|
| Pilon | Illumina | Final polishing |
| medaka | ONT | ONT assemblies |
| Racon | Long reads | Quick polishing |
| NextPolish | Both | Combined approach |

## Recommended Workflows

### ONT Assembly
1. Racon (2-3 rounds with ONT)
2. medaka (1 round)
3. Pilon (2-3 rounds with Illumina)

### PacBio CLR Assembly
1. Racon (2-3 rounds)
2. Pilon (2-3 rounds with Illumina)

### PacBio HiFi Assembly
- Often no polishing needed (>99% accuracy)
- Optional Pilon if Illumina available

## Pilon (Illumina Polishing)

### Installation

```bash
conda install -c bioconda pilon
```

### Basic Usage

```bash
# Map short reads to assembly
bwa index assembly.fasta
bwa mem -t 16 assembly.fasta R1.fq.gz R2.fq.gz | samtools sort -o aligned.bam
samtools index aligned.bam

# Run Pilon
pilon --genome assembly.fasta --frags aligned.bam --output polished
```

### Key Options

| Option | Description |
|--------|-------------|
| `--genome` | Input assembly |
| `--frags` | Paired-end BAM |
| `--output` | Output prefix |
| `--changes` | Write changes file |
| `--vcf` | Write VCF of changes |
| `--fix` | What to fix (snps, indels, gaps, all) |
| `--threads` | Threads for alignment |
| `--mindepth` | Min depth for correction |

### Multiple Rounds

```bash
#!/bin/bash
ASSEMBLY=$1
R1=$2
R2=$3
ROUNDS=${4:-3}

current=$ASSEMBLY

for i in $(seq 1 $ROUNDS); do
    echo "=== Pilon round $i ==="
    bwa index $current
    bwa mem -t 16 $current $R1 $R2 | samtools sort -o round${i}.bam
    samtools index round${i}.bam

    pilon --genome $current --frags round${i}.bam --output pilon_round${i} --changes

    current=pilon_round${i}.fasta

    changes=$(wc -l < pilon_round${i}.changes)
    echo "Changes made: $changes"

    if [ $changes -eq 0 ]; then
        echo "No more changes, stopping"
        break
    fi
done

cp $current final_polished.fasta
```

### Fix Specific Issues

```bash
# Only fix SNPs and small indels
pilon --genome assembly.fa --frags aligned.bam --output polished --fix snps,indels

# Only fill gaps
pilon --genome assembly.fa --frags aligned.bam --output polished --fix gaps
```

## medaka (ONT Polishing)

### Installation

```bash
conda install -c bioconda medaka
```

### Basic Usage

```bash
medaka_consensus -i reads.fastq.gz -d assembly.fasta -o medaka_output -t 8
```

### Key Options

| Option | Description |
|--------|-------------|
| `-i` | Input reads |
| `-d` | Draft assembly |
| `-o` | Output directory |
| `-t` | Threads |
| `-m` | Model name |

### Model Selection

```bash
# List available models
medaka tools list_models

# Use specific model (match your basecaller)
medaka_consensus -i reads.fq.gz -d assembly.fa -o output -m r1041_e82_400bps_sup_v5.1.0
```

### Models for Common Chemistries

| Chemistry | Model |
|-----------|-------|
| R10.4.1 + SUP | r1041_e82_400bps_sup_* |
| R10.4.1 + HAC | r1041_e82_400bps_hac_* |
| R9.4.1 + SUP | r941_sup_* |

### Output

```
medaka_output/
├── consensus.fasta    # Polished assembly
├── calls_to_draft.bam # Alignments
└── *.hdf              # Intermediate files
```

## Racon (Long-Read Polishing)

### Installation

```bash
conda install -c bioconda racon
```

### Basic Usage

```bash
# Map reads to assembly
minimap2 -ax map-ont assembly.fasta reads.fastq.gz > aligned.sam

# Polish
racon -t 16 reads.fastq.gz aligned.sam assembly.fasta > polished.fasta
```

### Multiple Rounds

```bash
#!/bin/bash
ASSEMBLY=$1
READS=$2
ROUNDS=${3:-3}

current=$ASSEMBLY

for i in $(seq 1 $ROUNDS); do
    echo "=== Racon round $i ==="
    minimap2 -ax map-ont $current $READS > round${i}.sam
    racon -t 16 $READS round${i}.sam $current > racon_round${i}.fasta
    current=racon_round${i}.fasta
done

cp $current racon_polished.fasta
```

### Key Options

| Option | Description |
|--------|-------------|
| `-t` | Threads |
| `-m` | Match score (default: 3) |
| `-x` | Mismatch score (default: -5) |
| `-g` | Gap penalty (default: -4) |
| `-w` | Window size (default: 500) |

## Complete Polishing Workflow

**Goal:** Maximize assembly accuracy through iterative multi-tool polishing.

**Approach:** Apply Racon with long reads, then medaka for ONT-specific error correction, then Pilon with short reads for final accuracy.

### ONT Assembly Polishing

```bash
#!/bin/bash
set -euo pipefail

ASSEMBLY=$1      # Flye assembly
ONT_READS=$2     # ONT reads
ILLUMINA_R1=$3   # Illumina R1
ILLUMINA_R2=$4   # Illumina R2
OUTDIR=$5

mkdir -p $OUTDIR

# Step 1: Racon polishing (2 rounds)
echo "=== Racon Polishing ==="
current=$ASSEMBLY
for i in 1 2; do
    minimap2 -ax map-ont $current $ONT_READS > ${OUTDIR}/racon_${i}.sam
    racon -t 16 $ONT_READS ${OUTDIR}/racon_${i}.sam $current > ${OUTDIR}/racon_${i}.fasta
    current=${OUTDIR}/racon_${i}.fasta
done

# Step 2: medaka polishing
echo "=== medaka Polishing ==="
medaka_consensus -i $ONT_READS -d $current -o ${OUTDIR}/medaka -t 8
current=${OUTDIR}/medaka/consensus.fasta

# Step 3: Pilon polishing (2 rounds)
echo "=== Pilon Polishing ==="
for i in 1 2; do
    bwa index $current
    bwa mem -t 16 $current $ILLUMINA_R1 $ILLUMINA_R2 | samtools sort -o ${OUTDIR}/pilon_${i}.bam
    samtools index ${OUTDIR}/pilon_${i}.bam
    pilon --genome $current --frags ${OUTDIR}/pilon_${i}.bam --output ${OUTDIR}/pilon_${i}
    current=${OUTDIR}/pilon_${i}.fasta
done

cp $current ${OUTDIR}/final_polished.fasta
echo "Done: ${OUTDIR}/final_polished.fasta"
```

## NextPolish (Combined Approach)

### Installation

```bash
conda install -c bioconda nextpolish
```

### Usage

```bash
# Create config file
cat > run.cfg << EOF
[General]
job_type = local
job_prefix = nextPolish
task = best
rewrite = yes
rerun = 3
parallel_jobs = 2
multithread_jobs = 8
genome = assembly.fasta
genome_size = auto
workdir = ./01_rundir
[lgs_option]
lgs_fofn = lgs.fofn
lgs_options = -min_read_len 1k -max_depth 100
lgs_minimap2_options = -x map-ont
[sgs_option]
sgs_fofn = sgs.fofn
sgs_options = -max_depth 100
EOF

# File of filenames
ls reads.fastq.gz > lgs.fofn
ls R1.fq.gz R2.fq.gz > sgs.fofn

# Run
nextPolish run.cfg
```

## Quality Assessment

**Goal:** Measure the accuracy improvement from polishing.

**Approach:** Compare original and polished assemblies with QUAST against a reference, and check alignment error rates.

After polishing, assess improvement:

```bash
# Compare to reference (if available)
quast.py -r reference.fa original.fa polished.fa -o quast_comparison

# Check error rate
minimap2 -ax map-ont polished.fa reads.fq.gz | samtools stats | grep "error rate"
```

## Related Skills

- long-read-assembly - Initial assembly
- short-read-assembly - Source of polishing reads
- assembly-qc - Assess polishing improvement
- long-read-sequencing - medaka variant calling

Overview

This skill polishes draft genome assemblies to reduce base-level errors using short reads (Pilon), long reads (Racon), and ONT-specific polishing (medaka). It describes recommended multi-tool workflows, example commands, and iterative strategies to maximize final assembly accuracy. The guidance covers tool selection by read type and practical tips for multi-round polishing and quality assessment.

How this skill works

The skill aligns reads back to the draft assembly, then runs polishers to correct SNPs, indels and fill small gaps. For long-read drafts it recommends iterative Racon rounds followed by medaka for ONT-specific errors; for final accuracy it maps Illumina reads and runs Pilon for one or more rounds. It also covers combined tools (NextPolish) and verification using QUAST and alignment error-rate metrics.

When to use it

  • You have a draft assembly with residual base errors after initial assembly.
  • You have long-read (ONT/PacBio) data and want quick consensus improvement.
  • You have Illumina short reads and need final polishing for high base accuracy.
  • You plan to produce a reference-quality assembly or run downstream variant calling.
  • You want to compare different polishers or document improvements with QUAST.

Best practices

  • Match tools to read type: Racon for long reads, medaka for ONT, Pilon for Illumina short reads.
  • Run multiple polishing rounds and stop when changes converge (no changes file entries).
  • Index and sort BAM/SAM files before polishing; provide sufficient threads to speed mapping.
  • Select medaka models that match your basecaller/chemistry (list models before use).
  • Verify tool versions and adapt example flags if your installed version differs.

Example use cases

  • ONT assembly: run 2–3 rounds of Racon, then 1 medaka round, then 2 Pilon rounds with Illumina for final polishing.
  • PacBio CLR assembly: 2–3 Racon rounds followed by Pilon with short reads if available.
  • HiFi assemblies: often no polishing needed; optionally run Pilon with Illumina for edge improvements.
  • Combined NextPolish workflow to integrate long and short reads in a single automated run.
  • Quality assessment: run QUAST against a reference and compute alignment error rates with minimap2 + samtools stats.

FAQ

How many polishing rounds should I run?

Start with 2–3 rounds of long-read polishing (Racon) and 1 medaka round for ONT, then 1–3 Pilon rounds with Illumina. Stop when the changes file reports zero changes or improvements plateau.

Do I always need Pilon after medaka?

Not always. Pilon improves short-read–supported base accuracy and helps correct residual errors; include it when Illumina data are available and when highest per-base accuracy is required.