home / skills / gptomics / bioskills / consensus-sequences

consensus-sequences skill

/variant-calling/consensus-sequences

This skill helps generate sample-specific consensus FASTA sequences by applying VCF variants to a reference using bcftools consensus, with haplotype and

npx playbooks add skill gptomics/bioskills --skill consensus-sequences

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
8.4 KB
---
name: bio-consensus-sequences
description: Generate consensus FASTA sequences by applying VCF variants to a reference using bcftools consensus. Use when creating sample-specific reference sequences or reconstructing haplotypes.
tool_type: cli
primary_tool: bcftools
---

## Version Compatibility

Reference examples tested with: BioPython 1.83+, bcftools 1.19+, bedtools 2.31+, minimap2 2.26+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Consensus Sequences

**"Generate a consensus sequence from my VCF"** → Apply called variants to a reference FASTA, producing a sample-specific genome with optional haplotype selection and low-coverage masking.
- CLI: `bcftools consensus -f reference.fa input.vcf.gz`
- Python: `cyvcf2` + `Bio.SeqIO` for simple SNP-only cases

## Basic Usage

### Generate Consensus

```bash
bcftools consensus -f reference.fa input.vcf.gz > consensus.fa
```

### Specify Sample

```bash
bcftools consensus -f reference.fa -s sample1 input.vcf.gz > sample1.fa
```

### Output to File

```bash
bcftools consensus -f reference.fa -o consensus.fa input.vcf.gz
```

## Haplotype Selection

### First Haplotype Only

```bash
bcftools consensus -f reference.fa -H 1 input.vcf.gz > haplotype1.fa
```

### Second Haplotype Only

```bash
bcftools consensus -f reference.fa -H 2 input.vcf.gz > haplotype2.fa
```

### Haplotype Options

| Option | Description |
|--------|-------------|
| `-H 1` | First haplotype |
| `-H 2` | Second haplotype |
| `-H A` | Apply all ALT alleles |
| `-H R` | Apply REF alleles where heterozygous |
| `-I` | Apply IUPAC ambiguity codes (separate flag) |

## IUPAC Codes for Heterozygous Sites

```bash
bcftools consensus -f reference.fa -I input.vcf.gz > consensus_iupac.fa
```

Heterozygous sites encoded with IUPAC ambiguity codes:
- A/G → R
- C/T → Y
- A/C → M
- G/T → K
- A/T → W
- C/G → S

## Missing Data Handling

### Mark Missing as N

```bash
bcftools consensus -f reference.fa -M N input.vcf.gz > consensus.fa
```

### Mark Low Coverage as N

Using a mask BED file:

```bash
# Create mask from depth
samtools depth input.bam | awk '$3<10 {print $1"\t"$2-1"\t"$2}' > low_coverage.bed

# Apply mask
bcftools consensus -f reference.fa -m low_coverage.bed input.vcf.gz > consensus.fa
```

### Mask Options

| Option | Description |
|--------|-------------|
| `-m FILE` | Mask regions in BED file with N |
| `-M CHAR` | Character for masked regions (default N) |

## Region Selection

### Specific Region

```bash
bcftools consensus -f reference.fa -r chr1:1000-2000 input.vcf.gz > region.fa
```

### Multiple Regions

Use with BED file to extract multiple regions.

## Chain Files

### Generate Chain File

```bash
bcftools consensus -f reference.fa -c chain.txt input.vcf.gz > consensus.fa
```

Chain files map coordinates between reference and consensus:
- Useful for liftover of annotations
- Required when indels change sequence length

### Chain File Format

```
chain score ref_name ref_size ref_strand ref_start ref_end query_name query_size query_strand query_start query_end id
```

## Sample-Specific Consensus

### For Each Sample

```bash
for sample in $(bcftools query -l input.vcf.gz); do
    bcftools consensus -f reference.fa -s "$sample" input.vcf.gz > "${sample}.fa"
done
```

### Both Haplotypes

```bash
sample="sample1"
bcftools consensus -f reference.fa -s "$sample" -H 1 input.vcf.gz > "${sample}_hap1.fa"
bcftools consensus -f reference.fa -s "$sample" -H 2 input.vcf.gz > "${sample}_hap2.fa"
```

## Filtering Before Consensus

### PASS Variants Only

```bash
bcftools view -f PASS input.vcf.gz | \
    bcftools consensus -f reference.fa > consensus.fa
```

### High-Quality Variants Only

```bash
bcftools filter -i 'QUAL>=30 && INFO/DP>=10' input.vcf.gz | \
    bcftools consensus -f reference.fa > consensus.fa
```

### SNPs Only

```bash
bcftools view -v snps input.vcf.gz | \
    bcftools consensus -f reference.fa > consensus_snps.fa
```

## Sequence Naming

### Default Naming

Output uses reference sequence names.

### Custom Prefix

```bash
bcftools consensus -f reference.fa -p "sample1_" input.vcf.gz > consensus.fa
```

Sequences named: `sample1_chr1`, `sample1_chr2`, etc.

## Common Workflows

**Goal:** Generate consensus sequences for downstream analyses like phylogenetics, viral surveillance, or gene-level comparison.

**Approach:** Filter variants to high-quality calls, apply per-sample consensus generation, mask low-coverage regions with N, then combine for multi-sample workflows.

### Phylogenetic Analysis Preparation

```bash
# For each sample, generate consensus
mkdir -p consensus
for sample in $(bcftools query -l cohort.vcf.gz); do
    bcftools view -s "$sample" cohort.vcf.gz | \
        bcftools view -c 1 | \
        bcftools consensus -f reference.fa > "consensus/${sample}.fa"
done

# Combine for alignment
cat consensus/*.fa > all_samples.fa
```

### Viral Genome Assembly

```bash
# Apply high-quality variants only
bcftools filter -i 'QUAL>=30 && INFO/DP>=20' variants.vcf.gz | \
    bcftools view -f PASS | \
    bcftools consensus -f reference.fa -M N > consensus.fa
```

### Gene-Specific Consensus

```bash
# Extract gene region
bcftools consensus -f reference.fa -r chr1:1000000-1010000 \
    -s sample1 variants.vcf.gz > gene.fa
```

### Masked Low-Coverage Regions

```bash
# Create mask from coverage
samtools depth -a input.bam | \
    awk '$3<5 {print $1"\t"$2-1"\t"$2}' | \
    bedtools merge > low_coverage.bed

# Generate consensus with mask
bcftools consensus -f reference.fa -m low_coverage.bed \
    variants.vcf.gz > consensus.fa
```

## Verify Consensus

### Check Differences

```bash
# Align consensus to reference
minimap2 -a reference.fa consensus.fa | samtools view -bS > alignment.bam

# Or simple comparison
diff <(grep -v "^>" reference.fa) <(grep -v "^>" consensus.fa) | head
```

### Count Changes

```bash
# Number of differences
bcftools view -H input.vcf.gz | wc -l
```

## Handling Overlapping Variants

bcftools consensus handles overlapping variants automatically:
- Applies variants in order
- Warns about conflicts

Check for warnings:
```bash
bcftools consensus -f reference.fa input.vcf.gz 2>&1 | grep -i warn
```

## cyvcf2 Consensus (Simple Cases)

### Manual Consensus Generation

```python
from cyvcf2 import VCF
from Bio import SeqIO

# Load reference
ref_dict = {rec.id: str(rec.seq) for rec in SeqIO.parse('reference.fa', 'fasta')}

# Apply variants (SNPs only, simplified)
vcf = VCF('input.vcf.gz')
changes = {}

for variant in vcf:
    if variant.is_snp and len(variant.ALT) == 1:
        chrom = variant.CHROM
        pos = variant.POS - 1  # 0-based
        if chrom not in changes:
            changes[chrom] = {}
        changes[chrom][pos] = variant.ALT[0]

# Apply changes
for chrom, positions in changes.items():
    seq = list(ref_dict[chrom])
    for pos, alt in positions.items():
        seq[pos] = alt
    ref_dict[chrom] = ''.join(seq)

# Write output
with open('consensus.fa', 'w') as f:
    for chrom, seq in ref_dict.items():
        f.write(f'>{chrom}\n{seq}\n')
```

Note: Use `bcftools consensus` for production - handles indels and edge cases properly.

## Quick Reference

| Task | Command |
|------|---------|
| Basic consensus | `bcftools consensus -f ref.fa in.vcf.gz` |
| Specific sample | `bcftools consensus -f ref.fa -s sample in.vcf.gz` |
| Haplotype 1 | `bcftools consensus -f ref.fa -H 1 in.vcf.gz` |
| IUPAC codes | `bcftools consensus -f ref.fa -I in.vcf.gz` |
| With mask | `bcftools consensus -f ref.fa -m mask.bed in.vcf.gz` |
| Generate chain | `bcftools consensus -f ref.fa -c chain.txt in.vcf.gz` |
| Specific region | `bcftools consensus -f ref.fa -r chr1:1-1000 in.vcf.gz` |

## Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `not indexed` | VCF not indexed | Run `bcftools index` |
| `sequence not found` | Chromosome mismatch | Check chromosome names |
| `overlapping records` | Variants overlap | Usually OK, check warnings |
| `REF does not match` | Wrong reference | Use same reference as caller |

## Related Skills

- variant-calling - Generate VCF for consensus
- filtering-best-practices - Filter variants before consensus
- variant-normalization - Normalize indels first
- alignment-files/reference-operations - Reference manipulation

Overview

This skill generates consensus FASTA sequences by applying VCF variants to a reference using bcftools consensus and lightweight Python examples. It produces sample-specific references, haplotype reconstructions, and masked sequences for low-coverage regions. The skill documents command-line flags, masking, haplotype options, and minimal cyvcf2+BioPython patterns for simple SNP-only cases.

How this skill works

It applies variant records from a VCF/BCF to a reference FASTA and emits modified sequences per contig. Options let you pick samples (-s), haplotypes (-H), IUPAC encoding for heterozygous sites (-I), region selection (-r), and masks (-m/-M) to mark low-confidence bases as N. For production use it recommends bcftools consensus for correct indel handling; a small cyvcf2 example shows how to build a SNP-only consensus in Python.

When to use it

  • Create sample-specific reference genomes for downstream analyses (phylogenetics, annotation).
  • Reconstruct phased haplotypes or output each haplotype separately for diploid samples.
  • Mask low-coverage or unreliable regions before comparative analyses.
  • Produce region- or gene-specific consensus sequences for targeted studies.
  • Generate consensus for viral surveillance and outbreak tracking.

Best practices

  • Verify tool and library versions (bcftools, samtools, cyvcf2, BioPython) before running examples.
  • Filter VCF to high-quality/PASS calls (e.g., QUAL and DP thresholds) prior to consensus generation.
  • Index VCF/BCF and ensure chromosome names match the reference to avoid ‘sequence not found’ errors.
  • Mask low-coverage regions with BED generated from samtools depth and bedtools merge to avoid erroneous bases downstream.
  • Use bcftools consensus for production to correctly handle indels and overlapping variants; use cyvcf2 only for simple SNP-only scripts.

Example use cases

  • Generate one FASTA per sample from a cohort VCF for a phylogenetic pipeline.
  • Create two FASTAs per sample representing haplotype1 and haplotype2 for population analyses.
  • Produce a masked consensus for a viral genome, marking low-coverage areas as N for safer downstream alignment.
  • Extract a gene region consensus using -r chr:start-end for gene-level comparisons or primer design.
  • Apply a chain file when indels alter coordinates to preserve liftover compatibility of annotations.

FAQ

What handles indels and complex cases reliably, bcftools or cyvcf2?

Use bcftools consensus for indels and overlapping variants; cyvcf2 examples are limited to simple SNP-only transformations.

How do I mask low-coverage regions?

Generate a BED from samtools depth (positions with depth below threshold), merge with bedtools, then pass it to bcftools consensus with -m and optionally set the mask character with -M.