home / skills / gptomics / bioskills / metagenomics-pipeline

metagenomics-pipeline skill

/workflows/metagenomics-pipeline

This skill orchestrates end-to-end metagenomics analysis from FASTQ to taxonomic and functional profiles using Kraken2, Bracken, MetaPhlAn, and HUMAnN.

npx playbooks add skill gptomics/bioskills --skill metagenomics-pipeline

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
8.2 KB
---
name: bio-workflows-metagenomics-pipeline
description: End-to-end metagenomics workflow from FASTQ to taxonomic and functional profiles. Covers Kraken2 classification, Bracken abundance estimation, and HUMAnN functional profiling. Use when profiling metagenomic samples.
tool_type: cli
primary_tool: Kraken2
workflow: true
depends_on:
  - read-qc/fastp-workflow
  - metagenomics/kraken-classification
  - metagenomics/metaphlan-profiling
  - metagenomics/abundance-estimation
  - metagenomics/functional-profiling
  - metagenomics/metagenome-visualization
qc_checkpoints:
  - after_qc: "Q30 >80%, host reads removed"
  - after_classification: "Classification rate >60%, known taxa dominant"
  - after_functional: "Pathway coverage reasonable, unmapped <50%"
---

## Version Compatibility

Reference examples tested with: Bowtie2 2.5.3+, Bracken 2.9+, HUMAnN 3.8+, Kraken2 2.1+, MetaPhlAn 4.1+, fastp 0.23+, matplotlib 3.8+, pandas 2.2+, seaborn 0.13+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Metagenomics Pipeline

**"Analyze my metagenomic samples from FASTQ to taxonomic and functional profiles"** → Orchestrate host depletion, Kraken2/Bracken taxonomic classification, MetaPhlAn profiling, HUMAnN3 functional analysis, and AMR gene detection.

Complete workflow from metagenomic FASTQ to taxonomic and functional profiles.

## Workflow Overview

```
FASTQ files
    |
    v
[1. QC & Host Removal] --> fastp + Bowtie2
    |
    v
[2. Taxonomic Classification]
    |
    +---> Kraken2 + Bracken (fast, database-dependent)
    |
    +---> MetaPhlAn (marker-based, standardized)
    |
    v
[3. Functional Profiling] --> HUMAnN
    |
    v
Taxonomic profiles + Pathway abundances
```

## Primary Path: Kraken2 + Bracken + HUMAnN

### Step 1: Quality Control and Host Removal

```bash
# QC with fastp
for sample in sample1 sample2 sample3; do
    fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
        -o trimmed/${sample}_R1.fq.gz -O trimmed/${sample}_R2.fq.gz \
        --detect_adapter_for_pe \
        --qualified_quality_phred 20 \
        --length_required 50 \
        --html qc/${sample}_fastp.html
done

# Remove host reads (human example)
for sample in sample1 sample2 sample3; do
    bowtie2 -p 8 -x human_index \
        -1 trimmed/${sample}_R1.fq.gz \
        -2 trimmed/${sample}_R2.fq.gz \
        --un-conc-gz host_removed/${sample}_R%.fq.gz \
        > /dev/null 2> qc/${sample}_host_removal.log
done
```

### Step 2A: Kraken2 Classification

```bash
# Classify reads
for sample in sample1 sample2 sample3; do
    kraken2 --db kraken2_db \
        --threads 8 \
        --paired \
        --report kraken/${sample}.report \
        --output kraken/${sample}.output \
        host_removed/${sample}_R1.fq.gz \
        host_removed/${sample}_R2.fq.gz
done
```

### Step 2B: Bracken Abundance Estimation

```bash
# Estimate species abundance
for sample in sample1 sample2 sample3; do
    bracken -d kraken2_db \
        -i kraken/${sample}.report \
        -o bracken/${sample}.species.txt \
        -r 150 \
        -l S \
        -t 10
done

# Combine samples into abundance matrix
combine_bracken_outputs.py \
    --files bracken/*.species.txt \
    -o bracken/combined_species.txt
```

### Step 2C: Alternative - MetaPhlAn Profiling

```bash
# Profile with MetaPhlAn 4
for sample in sample1 sample2 sample3; do
    metaphlan host_removed/${sample}_R1.fq.gz,host_removed/${sample}_R2.fq.gz \
        --bowtie2out metaphlan/${sample}.bowtie2.bz2 \
        --input_type fastq \
        --nproc 8 \
        -o metaphlan/${sample}_profile.txt
done

# Merge profiles
merge_metaphlan_tables.py metaphlan/*_profile.txt > metaphlan/merged_abundance.txt
```

### Step 3: Functional Profiling with HUMAnN

```bash
# Run HUMAnN
for sample in sample1 sample2 sample3; do
    # Concatenate paired reads
    cat host_removed/${sample}_R1.fq.gz host_removed/${sample}_R2.fq.gz > \
        host_removed/${sample}_concat.fq.gz

    humann --input host_removed/${sample}_concat.fq.gz \
        --output humann/${sample} \
        --threads 8 \
        --metaphlan-options "--bowtie2db metaphlan_db"
done

# Normalize and join tables
humann_renorm_table --input humann/sample1/sample1_pathabundance.tsv \
    --output humann/sample1/sample1_pathabundance_cpm.tsv \
    --units cpm

humann_join_tables --input humann \
    --output humann/merged_pathabundance.tsv \
    --file_name pathabundance
```

### Visualization

```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load Bracken species table
species = pd.read_csv('bracken/combined_species.txt', sep='\t', index_col=0)

# Top 20 species heatmap
top20 = species.sum(axis=1).nlargest(20).index
plt.figure(figsize=(12, 8))
sns.heatmap(species.loc[top20], cmap='viridis', annot=False)
plt.title('Top 20 Species Abundance')
plt.tight_layout()
plt.savefig('top20_species_heatmap.pdf')

# Stacked bar plot
species_norm = species.div(species.sum()) * 100
top10 = species_norm.sum(axis=1).nlargest(10).index
other = species_norm.loc[~species_norm.index.isin(top10)].sum()

plot_data = species_norm.loc[top10].T
plot_data['Other'] = other
plot_data.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.ylabel('Relative Abundance (%)')
plt.legend(bbox_to_anchor=(1.05, 1))
plt.tight_layout()
plt.savefig('species_barplot.pdf')
```

## Parameter Recommendations

| Step | Parameter | Value |
|------|-----------|-------|
| fastp | --length_required | 50 (metagenomic reads) |
| Kraken2 | --confidence | 0.0 (default) or 0.1 |
| Bracken | -r | Read length (e.g., 150) |
| Bracken | -l | S (species) or G (genus) |
| Bracken | -t | 10 (min reads threshold) |
| MetaPhlAn | --min_cu_len | 2000 (default) |
| HUMAnN | --threads | 8+ |

## Troubleshooting

| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| Low classification rate | Database mismatch, novel organisms | Try different database, check sample type |
| High unclassified | Novel microbes, host contamination | Remove host, use larger database |
| High host reads | Incomplete host removal | Use multiple host reference genomes |
| HUMAnN slow | Large files | Increase threads, pre-filter reads |

## Complete Pipeline Script

```bash
#!/bin/bash
set -e

THREADS=8
KRAKEN_DB="kraken2_standard_db"
HOST_INDEX="human_bt2_index"
SAMPLES="sample1 sample2 sample3"
OUTDIR="metagenomics_results"

mkdir -p ${OUTDIR}/{trimmed,host_removed,kraken,bracken,metaphlan,humann,qc}

# Step 1: QC
echo "=== QC ==="
for sample in $SAMPLES; do
    fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
        -o ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
        -O ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
        --length_required 50 \
        --html ${OUTDIR}/qc/${sample}_fastp.html -w ${THREADS}
done

# Host removal
echo "=== Host Removal ==="
for sample in $SAMPLES; do
    bowtie2 -p ${THREADS} -x ${HOST_INDEX} \
        -1 ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
        -2 ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
        --un-conc-gz ${OUTDIR}/host_removed/${sample}_R%.fq.gz \
        > /dev/null 2> ${OUTDIR}/qc/${sample}_host.log
done

# Step 2: Kraken2
echo "=== Kraken2 ==="
for sample in $SAMPLES; do
    kraken2 --db ${KRAKEN_DB} --threads ${THREADS} --paired \
        --report ${OUTDIR}/kraken/${sample}.report \
        --output ${OUTDIR}/kraken/${sample}.output \
        ${OUTDIR}/host_removed/${sample}_R1.fq.gz \
        ${OUTDIR}/host_removed/${sample}_R2.fq.gz
done

# Bracken
echo "=== Bracken ==="
for sample in $SAMPLES; do
    bracken -d ${KRAKEN_DB} \
        -i ${OUTDIR}/kraken/${sample}.report \
        -o ${OUTDIR}/bracken/${sample}.species.txt \
        -r 150 -l S -t 10
done

echo "=== Pipeline Complete ==="
echo "Kraken reports: ${OUTDIR}/kraken/"
echo "Bracken abundances: ${OUTDIR}/bracken/"
```

## Related Skills

- metagenomics/kraken-classification - Kraken2 details
- metagenomics/metaphlan-profiling - MetaPhlAn parameters
- metagenomics/abundance-estimation - Bracken options
- metagenomics/functional-profiling - HUMAnN workflow
- metagenomics/metagenome-visualization - Plotting functions

Overview

This skill implements an end-to-end metagenomics pipeline from raw FASTQ to taxonomic and functional profiles. It orchestrates quality control, host depletion, fast taxonomic classification (Kraken2 + Bracken), marker-based profiling (MetaPhlAn), and functional profiling with HUMAnN. The goal is reproducible, scalable profiling suitable for comparative studies and visualization.

How this skill works

The pipeline first runs fastp for read QC and Bowtie2 to remove host reads. It then classifies reads with Kraken2 and refines abundances with Bracken, with MetaPhlAn offered as a marker-based alternative. Finally, HUMAnN produces pathway and gene-family profiles; outputs are normalized and joined for downstream visualization and statistical analysis.

When to use it

  • Profiling microbial community composition from shotgun metagenomic FASTQ samples
  • Generating species-level abundance matrices for comparative analysis
  • Creating pathway and gene-family abundance tables for functional analysis
  • Removing host contamination prior to analysis (human, model organism)
  • Producing inputs for visualization and diversity statistics

Best practices

  • Verify tool versions (Kraken2, Bracken, HUMAnN, Bowtie2, fastp, MetaPhlAn) before running and adapt flags if APIs differ
  • Perform host removal with an appropriate reference genome(s) to reduce false classification
  • Choose Bracken read length (-r) matching your sequencer and set a sensible read threshold (-t) to filter low-support taxa
  • Run HUMAnN on concatenated host-depleted reads and normalize outputs (CPM) before merging samples
  • Keep separate directories for QC, host_removed, kraken, bracken, metaphlan, and humann outputs for reproducibility

Example use cases

  • Run the full pipeline on a cohort of environmental or clinical samples to generate taxonomic and pathway tables
  • Compare Kraken2+Bracken and MetaPhlAn results to assess database-dependent classification differences
  • Produce heatmaps and stacked barplots of top taxa for a manuscript figure using the combined Bracken matrix
  • Detect potential contamination by inspecting host removal logs and high host-read samples
  • Run HUMAnN pathway abundance across samples and join tables for differential pathway analysis

FAQ

What parameters are essential to set for Bracken?

Set the read length (-r) to your sequencing read length (e.g., 150) and choose taxonomic level (-l) such as S for species; use -t to exclude taxa with low read support.

Why run both Kraken2 and MetaPhlAn?

Kraken2+Bracken gives fast, database-dependent read-level classification and abundance estimation; MetaPhlAn uses conserved markers for standardized profiling. Running both helps validate results and capture different strengths.