home / skills / gptomics / bioskills / long-read-assembly

long-read-assembly skill

/genome-assembly/long-read-assembly

This skill helps assemble long-read genomes from ONT or PacBio using Flye and Canu for high-contiguity bacterial genomes.

npx playbooks add skill gptomics/bioskills --skill long-read-assembly

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
6.5 KB
---
name: bio-genome-assembly-long-read-assembly
description: De novo genome assembly from Oxford Nanopore or PacBio long reads using Flye and Canu. Produces highly contiguous assemblies suitable for complete bacterial genomes and resolving complex regions. Use when assembling genomes from ONT or PacBio reads.
tool_type: cli
primary_tool: Flye
---

## Version Compatibility

Reference examples tested with: Canu 2.2+, Flye 2.9+, hifiasm 0.19+, wtdbg2 2.5+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Long-Read Assembly

**"Assemble a genome from long reads"** → Build a contiguous de novo assembly from ONT or PacBio reads, producing complete or near-complete chromosomes.
- CLI: `flye --nano-raw reads.fq -o output` (ONT), `canu -p asm -d output -nanopore reads.fq` (ONT/PacBio)

## Tool Comparison

| Tool | Speed | Memory | Best For |
|------|-------|--------|----------|
| Flye | Fast | Moderate | General purpose, bacteria, ONT |
| Canu | Slow | High | High accuracy, complex genomes |
| Wtdbg2 | Very fast | Low | Draft assemblies |

> **Note:** For PacBio HiFi data, see the dedicated **hifi-assembly** skill which covers hifiasm.

## Flye

### Installation

```bash
conda install -c bioconda flye
```

### Basic Usage

```bash
# Oxford Nanopore
flye --nano-raw reads.fastq.gz --out-dir flye_output --threads 16

# PacBio CLR
flye --pacbio-raw reads.fastq.gz --out-dir flye_output --threads 16

# PacBio HiFi
flye --pacbio-hifi reads.fastq.gz --out-dir flye_output --threads 16
```

### Read Type Options

| Option | Read Type |
|--------|-----------|
| `--nano-raw` | ONT regular reads |
| `--nano-corr` | ONT corrected reads |
| `--nano-hq` | ONT Q20+ reads (Guppy 5+) |
| `--pacbio-raw` | PacBio CLR |
| `--pacbio-corr` | PacBio corrected |
| `--pacbio-hifi` | PacBio HiFi/CCS |

### Key Options

| Option | Description |
|--------|-------------|
| `--out-dir` | Output directory |
| `--threads` | Number of threads |
| `--genome-size` | Estimated genome size (e.g., 5m, 100m) |
| `--iterations` | Polishing iterations (default: 1) |
| `--meta` | Metagenome mode |
| `--plasmids` | Recover plasmids |
| `--keep-haplotypes` | Don't collapse haplotypes |
| `--scaffold` | Enable scaffolding |

### Genome Size Estimation

```bash
# Estimate if unknown
flye --nano-raw reads.fq.gz --out-dir output --genome-size 5m

# Size formats: 1000, 1k, 1m, 1g
```

### Output Files

```
flye_output/
├── assembly.fasta       # Final assembly
├── assembly_graph.gfa   # Assembly graph
├── assembly_info.txt    # Contig statistics
└── flye.log             # Log file
```

### Bacterial Assembly

```bash
flye \
    --nano-raw bacteria.fastq.gz \
    --out-dir bacteria_assembly \
    --genome-size 5m \
    --threads 16
```

### Metagenome Assembly

```bash
flye \
    --nano-raw metagenome.fastq.gz \
    --out-dir meta_assembly \
    --meta \
    --threads 32
```

### With Plasmid Recovery

```bash
flye \
    --nano-raw isolate.fastq.gz \
    --out-dir assembly \
    --plasmids \
    --threads 16
```

## Canu

### Installation

```bash
conda install -c bioconda canu
```

### Basic Usage

```bash
# ONT reads
canu -p assembly -d canu_output genomeSize=5m -nanopore reads.fastq.gz

# PacBio HiFi
canu -p assembly -d canu_output genomeSize=5m -pacbio-hifi reads.fastq.gz
```

### Key Options

| Option | Description |
|--------|-------------|
| `-p` | Assembly prefix |
| `-d` | Output directory |
| `genomeSize=` | Estimated size (required) |
| `maxThreads=` | Max threads |
| `maxMemory=` | Max memory (e.g., 64g) |
| `useGrid=false` | Disable grid execution |
| `correctedErrorRate=` | Expected error rate |

### Read Type Options

| Option | Read Type |
|--------|-----------|
| `-nanopore` | ONT reads |
| `-nanopore-raw` | ONT raw (deprecated) |
| `-pacbio` | PacBio CLR |
| `-pacbio-hifi` | PacBio HiFi/CCS |

### Fast Mode

```bash
canu -p asm -d output genomeSize=5m \
    -nanopore reads.fq.gz \
    useGrid=false \
    maxThreads=16 \
    maxMemory=32g
```

### High-Quality Mode (PacBio HiFi)

```bash
canu -p asm -d output genomeSize=5m \
    -pacbio-hifi reads.fq.gz \
    correctedErrorRate=0.01
```

### Output Files

```
canu_output/
├── assembly.contigs.fasta   # Contigs
├── assembly.unassembled.fasta
├── assembly.report
└── assembly.seqStore/
```

## Wtdbg2 (Fast Draft)

### Installation

```bash
conda install -c bioconda wtdbg
```

### Basic Usage

```bash
# Assemble
wtdbg2 -x ont -g 5m -t 16 -i reads.fq.gz -o draft

# Consensus
wtpoa-cns -t 16 -i draft.ctg.lay.gz -o draft.ctg.fa
```

### Platform Presets

| Preset | Platform |
|--------|----------|
| `-x ont` | ONT R9 |
| `-x ccs` | PacBio HiFi |
| `-x rs` | PacBio CLR |
| `-x sq` | ONT R10 |

## Complete Workflows

**Goal:** Run end-to-end long-read assembly pipelines from raw reads to contigs.

**Approach:** Use Flye for initial assembly, optionally followed by short-read polishing.

### ONT Bacterial Assembly

```bash
#!/bin/bash
set -euo pipefail

READS=$1
OUTDIR=$2
SIZE=${3:-5m}

echo "=== ONT Bacterial Assembly ==="

# Flye assembly
flye \
    --nano-raw $READS \
    --out-dir ${OUTDIR}/flye \
    --genome-size $SIZE \
    --threads 16

# Stats
echo "Assembly statistics:"
cat ${OUTDIR}/flye/assembly_info.txt

echo "Assembly: ${OUTDIR}/flye/assembly.fasta"
```

### Hybrid Assembly (Long + Short)

```bash
#!/bin/bash
set -euo pipefail

LONG=$1
SHORT_R1=$2
SHORT_R2=$3
OUTDIR=$4

# 1. Long-read assembly with Flye
flye --nano-raw $LONG --out-dir ${OUTDIR}/flye --genome-size 5m --threads 16

# 2. Polish with short reads (Pilon)
# See assembly-polishing skill
```

## Quality Expectations

| Metric | Bacterial | Eukaryotic |
|--------|-----------|------------|
| Contigs | 1-10 | 100-1000+ |
| N50 | >1 Mb | Variable |
| Complete chromosomes | Often | Rare |

## Troubleshooting

### Low Contiguity
- Check coverage (need >30x)
- Try increasing iterations in Flye
- Consider supplementing with short reads

### Memory Issues
- Use Flye (more memory efficient)
- Reduce threads
- Filter reads by length/quality

### Misassemblies
- Polish with Pilon/medaka
- Validate with short reads
- Check for contamination

## Related Skills

- hifi-assembly - PacBio HiFi assembly with hifiasm
- assembly-polishing - Polish long-read assemblies
- assembly-qc - QUAST and BUSCO assessment
- short-read-assembly - Hybrid with Illumina
- long-read-sequencing - Read QC and alignment

Overview

This skill performs de novo genome assembly from Oxford Nanopore or PacBio long reads using Flye and Canu to produce highly contiguous assemblies. It targets complete bacterial genomes and difficult repetitive regions, offering recommended CLI commands, tool comparisons, and end-to-end workflow examples. Use it to generate assembly contigs suitable for downstream polishing and QC.

How this skill works

The skill provides practical command patterns for running Flye and Canu with the appropriate read-type flags and genome-size settings, plus fast-draft options using wtdbg2. It explains output files to expect, tuning options (threads, memory, iterations), and workflow steps such as Flye-based assembly followed by short-read polishing. Troubleshooting guidance covers coverage checks, memory limits, and common fixes for misassemblies.

When to use it

  • Assembling bacterial genomes from ONT or PacBio CLR reads to obtain near-complete chromosomes
  • Building initial long-read assemblies before polishing with short reads or medaka/Pilon
  • Resolving repetitive or plasmid sequences that short reads cannot span
  • Generating draft assemblies quickly when exploring sample diversity or screening isolates
  • When you have >30× long-read coverage and want high contiguity rather than immediate base-level accuracy

Best practices

  • Estimate genome size and pass it to the assembler (e.g., 5m, 100m) to improve assembly behavior
  • Prefer Flye for fast, moderate-memory bacterial assemblies and Canu for higher-accuracy or complex-genome scenarios
  • Filter ultra-short or very low-quality reads if memory or runtime are constrained
  • Run multiple polishing iterations and validate with short reads or alignment-based tools to reduce errors
  • Check tool versions before running and adapt command flags if installed versions differ

Example use cases

  • ONT bacterial assembly: flye --nano-raw reads.fastq.gz --out-dir out --genome-size 5m --threads 16 to produce assembly.fasta and assembly_info.txt
  • High-accuracy or complex assembly: canu -p asm -d out genomeSize=5m -nanopore reads.fastq.gz with maxMemory and maxThreads tuned
  • Metagenome mode for mixed samples: flye --nano-raw metagenome.fastq.gz --out-dir meta --meta --threads 32
  • Fast draft for quick evaluation: wtdbg2 -x ont -g 5m -t 16 -i reads.fq.gz -o draft followed by wtpoa-cns for consensus
  • Hybrid workflow: assemble long reads with Flye then polish with Illumina reads using Pilon or similar tools

FAQ

What coverage do I need for good assemblies?

Aim for at least 30× long-read coverage for bacterial genomes; higher coverage improves contiguity and error correction potential.

Which assembler should I pick for bacteria?

Use Flye for fast, moderate-memory bacterial assemblies; use Canu when accuracy and handling complex repeats outweigh runtime and memory costs.

How do I handle memory or runtime failures?

Reduce threads, filter short reads, or switch to Flye if Canu is exceeding memory. Also verify tool versions and adjust flags accordingly.