home / skills / gptomics / bioskills / fastq-quality

fastq-quality skill

/sequence-io/fastq-quality

This skill helps you analyze and filter FASTQ reads by quality using Biopython, enabling trimming, reporting, and quality-aware filtering.

npx playbooks add skill gptomics/bioskills --skill fastq-quality

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
8.5 KB
---
name: bio-fastq-quality
description: Work with FASTQ quality scores using Biopython. Use when analyzing read quality, filtering by quality, trimming low-quality bases, or generating quality reports.
tool_type: python
primary_tool: Bio.SeqIO
---

## Version Compatibility

Reference examples tested with: BioPython 1.83+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# FASTQ Quality Scores

**"Filter my FASTQ reads by quality score"** → Access, analyze, and filter Phred quality scores, trim low-quality bases, and generate per-position quality profiles.
- Python: `SeqIO.parse()` with `letter_annotations['phred_quality']` (BioPython)

Analyze and manipulate FASTQ quality scores using Biopython.

## Required Imports

```python
from Bio import SeqIO
from Bio.Seq import Seq
```

## Accessing Quality Scores

Quality scores are stored in `letter_annotations['phred_quality']` as a list of integers.

```python
for record in SeqIO.parse('reads.fastq', 'fastq'):
    qualities = record.letter_annotations['phred_quality']
    print(record.id, qualities[:10])  # First 10 quality scores
```

## Quality Score Basics

| Phred Score | Error Probability | Accuracy |
|-------------|-------------------|----------|
| 10 | 1 in 10 | 90% |
| 20 | 1 in 100 | 99% |
| 30 | 1 in 1000 | 99.9% |
| 40 | 1 in 10000 | 99.99% |

## Code Patterns

### Calculate Average Quality per Read
```python
for record in SeqIO.parse('reads.fastq', 'fastq'):
    quals = record.letter_annotations['phred_quality']
    avg_qual = sum(quals) / len(quals)
    print(f'{record.id}: {avg_qual:.1f}')
```

### Filter Reads by Mean Quality
```python
def high_quality_reads(records, min_avg_qual=20):
    for record in records:
        quals = record.letter_annotations['phred_quality']
        if sum(quals) / len(quals) >= min_avg_qual:
            yield record

records = SeqIO.parse('reads.fastq', 'fastq')
good_reads = high_quality_reads(records, min_avg_qual=25)
SeqIO.write(good_reads, 'filtered.fastq', 'fastq')
```

### Filter by Minimum Quality at Any Position
```python
def all_bases_above(records, min_qual=20):
    for record in records:
        if min(record.letter_annotations['phred_quality']) >= min_qual:
            yield record
```

### Trim Low-Quality Ends (3' Trimming)
```python
def trim_low_quality(record, min_qual=20):
    quals = record.letter_annotations['phred_quality']
    trim_pos = len(quals)
    for i in range(len(quals) - 1, -1, -1):
        if quals[i] >= min_qual:
            trim_pos = i + 1
            break
    return record[:trim_pos]

records = SeqIO.parse('reads.fastq', 'fastq')
trimmed = (trim_low_quality(rec) for rec in records)
SeqIO.write(trimmed, 'trimmed.fastq', 'fastq')
```

### Sliding Window Quality Trim

**Goal:** Trim a read at the first position where average quality in a sliding window drops below threshold.

**Approach:** Slide a fixed-size window across quality scores; when the window average falls below the cutoff, truncate the record at that position.

**Reference (BioPython 1.83+):**
```python
def sliding_window_trim(record, window_size=5, min_avg_qual=20):
    quals = record.letter_annotations['phred_quality']
    for i in range(len(quals) - window_size + 1):
        window = quals[i:i + window_size]
        if sum(window) / window_size < min_avg_qual:
            return record[:i] if i > 0 else None
    return record
```

### Quality Statistics Summary
```python
import statistics

all_quals = []
for record in SeqIO.parse('reads.fastq', 'fastq'):
    all_quals.extend(record.letter_annotations['phred_quality'])

print(f'Mean quality: {statistics.mean(all_quals):.1f}')
print(f'Median quality: {statistics.median(all_quals):.1f}')
print(f'Min quality: {min(all_quals)}')
print(f'Max quality: {max(all_quals)}')
```

### Per-Position Quality Profile

**Goal:** Compute mean quality at each read position to identify systematic quality drops (e.g., read-end degradation).

**Approach:** Accumulate quality scores by position across all reads, then compute per-position means.

**Reference (BioPython 1.83+):**
```python
from collections import defaultdict

position_quals = defaultdict(list)
for record in SeqIO.parse('reads.fastq', 'fastq'):
    for i, q in enumerate(record.letter_annotations['phred_quality']):
        position_quals[i].append(q)

for pos in sorted(position_quals.keys())[:20]:
    quals = position_quals[pos]
    print(f'Position {pos}: mean={sum(quals)/len(quals):.1f}')
```

### Count Reads by Quality Threshold
```python
thresholds = [20, 25, 30, 35]
counts = {t: 0 for t in thresholds}

for record in SeqIO.parse('reads.fastq', 'fastq'):
    avg = sum(record.letter_annotations['phred_quality']) / len(record.seq)
    for t in thresholds:
        if avg >= t:
            counts[t] += 1

for t, c in counts.items():
    print(f'Q>={t}: {c} reads')
```

### Remove N Bases and Low Quality Together
```python
def clean_read(record, min_qual=20):
    seq = str(record.seq)
    quals = record.letter_annotations['phred_quality']
    keep = [(s, q) for s, q in zip(seq, quals) if s != 'N' and q >= min_qual]
    if not keep:
        return None
    new_seq, new_quals = zip(*keep)
    new_record = record[:0]  # Empty copy with same metadata
    new_record.seq = Seq(''.join(new_seq))
    new_record.letter_annotations['phred_quality'] = list(new_quals)
    return new_record
```

## FASTQ Format Variants

| Variant | Format String | Quality Encoding | ASCII Range |
|---------|---------------|------------------|-------------|
| Sanger/Illumina 1.8+ | `'fastq'` | Phred+33 (standard) | 33-126 |
| Solexa | `'fastq-solexa'` | Solexa+64 | 59-126 |
| Illumina 1.3-1.7 | `'fastq-illumina'` | Phred+64 | 64-126 |

Most modern data uses standard `'fastq'` (Sanger/Illumina 1.8+).

## Quality Score Conversion

For legacy data using different quality encodings:

```python
from Bio.SeqIO.QualityIO import phred_quality_from_solexa, solexa_quality_from_phred
```

### Convert Solexa to Phred

```python
from Bio.SeqIO.QualityIO import phred_quality_from_solexa

# Convert single score
solexa_score = 10
phred_score = phred_quality_from_solexa(solexa_score)

# Convert list of scores
solexa_scores = [10, 20, 30, 40]
phred_scores = [phred_quality_from_solexa(s) for s in solexa_scores]
```

### Convert Phred to Solexa

```python
from Bio.SeqIO.QualityIO import solexa_quality_from_phred

phred_score = 30
solexa_score = solexa_quality_from_phred(phred_score)
```

### Convert Between FASTQ Variants

```python
from Bio import SeqIO

# Read old Illumina format, write standard format
records = SeqIO.parse('old_reads.fastq', 'fastq-illumina')
SeqIO.write(records, 'standard_reads.fastq', 'fastq')

# Read Solexa format, write standard format
records = SeqIO.parse('solexa_reads.fastq', 'fastq-solexa')
SeqIO.write(records, 'standard_reads.fastq', 'fastq')
```

### Auto-Detect Quality Encoding

**Goal:** Determine which FASTQ quality encoding (Sanger, Solexa, or Illumina 1.3+) a file uses.

**Approach:** Sample quality lines, find the minimum ASCII value, and compare against known offset ranges.

**Reference (BioPython 1.83+):**
```python
def detect_quality_encoding(filepath, sample_size=1000):
    min_qual = 126
    max_qual = 0
    count = 0

    with open(filepath) as f:
        for i, line in enumerate(f):
            if i % 4 == 3:  # Quality line
                for char in line.strip():
                    min_qual = min(min_qual, ord(char))
                    max_qual = max(max_qual, ord(char))
                count += 1
                if count >= sample_size:
                    break

    if min_qual < 59:
        return 'fastq'  # Sanger/Illumina 1.8+ (Phred+33)
    elif min_qual < 64:
        return 'fastq-solexa'  # Solexa+64
    else:
        return 'fastq-illumina'  # Illumina 1.3+ (Phred+64)
```

## Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `KeyError: 'phred_quality'` | Not FASTQ or wrong variant | Check format, try 'fastq-illumina' |
| Quality scores all 0 | Wrong encoding assumed | Try different FASTQ variant |
| Trimmed reads empty | Too aggressive trimming | Lower quality threshold |

## Related Skills

- read-sequences - Parse FASTQ files
- filter-sequences - Filter reads by other criteria (length, content)
- paired-end-fastq - Handle R1/R2 paired quality filtering
- sequence-statistics - Generate summary statistics including quality
- alignment-files - After filtering, align reads with bwa/bowtie2; quality scores in BAM

Overview

This skill provides concise Biopython patterns for inspecting and manipulating FASTQ quality scores. It shows how to read Phred scores, compute per-read and per-position statistics, trim low-quality bases, and convert legacy encodings. Use the examples as ready-to-run snippets you can adapt to your pipeline.

How this skill works

The code uses Bio.SeqIO to parse FASTQ records and accesses quality values via record.letter_annotations['phred_quality']. It demonstrates common operations: average and per-position summaries, filters by mean or minimum score, end- and sliding-window trimming, and encoding detection/conversion. Examples assume modern Biopython (tested with 1.83+) but include guidance to inspect your installed API if signatures differ.

When to use it

  • Evaluate overall read quality before alignment or assembly
  • Filter reads with low mean quality or any low-quality base
  • Trim low-quality tails or apply sliding-window trimming
  • Generate per-position quality profiles to detect systematic drops
  • Convert FASTQ files between legacy encodings (Solexa/Illumina)

Best practices

  • Verify Biopython and Python versions and adapt calls if APIs differ
  • Sample quality encodings before assuming Phred+33 to avoid mis-decoding scores
  • Use conservative thresholds first; check how many reads are discarded before tightening cutoffs
  • Combine base-level filters (remove Ns, minimum base quality) with read-level filters for cleaner output
  • Write filtered/trimmed records to new files to preserve originals for audit

Example use cases

  • Compute mean/median quality across a run to decide sequencing batch quality
  • Filter reads with mean quality >= 25 and write to filtered.fastq for downstream alignment
  • Trim 3' low-quality bases or apply sliding-window trim to maximize usable read length
  • Produce per-position mean quality to visualize read-end degradation and inform trimming parameters
  • Detect FASTQ encoding and convert old Illumina/Solexa files to standard Phred+33 format

FAQ

What if record.letter_annotations['phred_quality'] raises KeyError?

The file may not be parsed with the correct format. Try alternate format strings (e.g., 'fastq-illumina' or 'fastq-solexa') or inspect the raw file to confirm it is FASTQ.

How do I choose trimming thresholds?

Start with common cutoffs (min base Q20, mean Q25), run on a sample, and inspect how many reads are trimmed or dropped. Adjust to balance data retention vs. error rate.