home / skills / benchflow-ai / skillsbench / bio-seq

bio-seq skill

/registry/terminal_bench_2.0/full_batch_reviewed/terminal_bench_2_0_dna-insert/environment/skills/bio-seq

This skill reads, writes, and manipulates biological sequence files, enabling format conversion, sequence operations, and indexed random access for large

npx playbooks add skill benchflow-ai/skillsbench --skill bio-seq

Review the files below or copy the command above to add this skill to your agents.

Files (5)
SKILL.md
3.8 KB
---
name: bio-fasta
description: "Read/write FASTA, GenBank, FASTQ files. Sequence manipulation (complement, translate). Indexed random access via faidx. For NGS pipelines (SAM/BAM/VCF), use pysam. For BLAST, use gget or blat-integration."
user_invocable: true
---

# Sequence I/O

Read, write, and manipulate biological sequence files (FASTA, GenBank, FASTQ).

## When to Use This Skill

This skill should be used when:

- Reading or writing sequence files (FASTA, GenBank, FASTQ)
- Converting between sequence file formats
- Manipulating sequences (complement, reverse complement, translate)
- Extracting sequences from large indexed FASTA files (faidx)
- Calculating sequence statistics (GC content, molecular weight, Tm)

## When NOT to Use This Skill

- **NGS alignment files (SAM/BAM/VCF)** → Use `pysam`
- **BLAST searches** → Use `gget` (quick) or `blat-integration` (large-scale)
- **Multiple sequence alignment** → Use `msa-advanced`
- **Phylogenetic analysis** → Use `etetoolkit`
- **NCBI database queries** → Use `pubmed-database` or `gene-database`

## Tool Selection Guide

| Task | Tool | Reference |
|------|------|-----------|
| Parse FASTA/GenBank/FASTQ | `Bio.SeqIO` | `biopython_seqio.md` |
| Convert file formats | `Bio.SeqIO.convert()` | `biopython_seqio.md` |
| Sequence operations | `Bio.Seq` | `biopython_seqio.md` |
| Large FASTA random access | `pysam.FastaFile` + faidx | `faidx.md` |
| GC%, Tm, molecular weight | `Bio.SeqUtils` | `utilities.md` |

## Quick Start

### Installation

```bash
uv pip install biopython pysam
```

### Read FASTA

```python
from Bio import SeqIO

for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(f"{record.id}: {len(record.seq)} bp")
```

### Convert GenBank to FASTA

```python
from Bio import SeqIO

SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
```

### Random Access with faidx

```python
import pysam

# Create index (once)
pysam.faidx("reference.fasta")

# Random access
fasta = pysam.FastaFile("reference.fasta")
seq = fasta.fetch("chr1", 1000, 2000)  # 0-based coordinates
fasta.close()
```

### Sequence Operations

```python
from Bio.Seq import Seq

seq = Seq("ATGCGATCGATCG")
print(seq.complement())
print(seq.reverse_complement())
print(seq.translate())
```

## Reference Documentation

Consult the appropriate reference file for detailed documentation:

### `references/biopython_seqio.md`

- `Bio.Seq` object and sequence operations
- `Bio.SeqIO` for file parsing and writing
- `SeqRecord` object and annotations
- Supported file formats
- Format conversion patterns

### `references/faidx.md`

- Creating FASTA index with `pysam.faidx()`
- `pysam.FastaFile` for random access
- Coordinate systems (0-based vs 1-based)
- Performance considerations for large files
- Common patterns (variant context, gene extraction)

### `references/utilities.md`

- GC content calculation (`gc_fraction`)
- Molecular weight (`molecular_weight`)
- Melting temperature (`MeltingTemp`)
- Codon usage analysis
- Restriction enzyme sites

### `references/formats.md`

- FASTA format specification
- GenBank format specification
- FASTQ format and quality scores
- Format detection and validation

## Coordinate Systems

**Biopython**: Uses Python-style 0-based, half-open intervals for slicing.

**pysam.FastaFile.fetch()**:
- Numeric arguments: 0-based (`fetch("chr1", 999, 2000)` = positions 999-1999)
- Region strings: 1-based (`fetch("chr1:1000-2000")` = positions 1000-2000)

## Common Pitfalls

1. **Coordinate confusion**: Remember which tool uses 0-based vs 1-based
2. **Missing faidx index**: Random access requires `.fai` file
3. **Format mismatch**: Verify file format matches the format string in `SeqIO.parse()`
4. **Iterator exhaustion**: `SeqIO.parse()` returns an iterator; convert to list if multiple passes needed
5. **Large files**: Use iterators, not `list()`, for memory efficiency

Overview

This skill provides robust sequence I/O and basic sequence manipulation for FASTA, GenBank, and FASTQ files. It supports format conversion, common sequence operations (complement, reverse complement, translate), and indexed random access to large FASTA files via faidx. It is optimized for use in genomics pipelines where fast, memory-efficient access and format interoperability are needed.

How this skill works

The skill uses Biopython's SeqIO and Seq objects to parse, write, convert, and manipulate sequences. For large reference FASTA files it creates and uses a faidx index via pysam.FastaFile for fast, coordinate-based random access. Utility functions expose GC content, molecular weight, and melting temperature calculations from SeqUtils.

When to use it

  • Reading or writing sequence files (FASTA, GenBank, FASTQ)
  • Converting between sequence file formats or exporting SeqRecord annotations
  • Performing sequence ops: complement, reverse complement, translation
  • Extracting subsequences from large indexed FASTA files with faidx
  • Computing sequence statistics: GC%, molecular weight, melting temperature

Best practices

  • Create a .fai index (pysam.faidx) once and reuse pysam.FastaFile for repeated fetches
  • Prefer iterators (SeqIO.parse) for large files to avoid high memory use
  • Be explicit about coordinate systems—confirm 0-based vs 1-based depending on API
  • Validate input format strings to match file contents before parsing
  • Close file handles (pysam.FastaFile.close(), file objects) to avoid resource leaks

Example use cases

  • Convert a GenBank file to FASTA for downstream tools using SeqIO.convert
  • Fetch a gene region from a 3+ GB reference FASTA with pysam.FastaFile.fetch for variant context
  • Compute GC% and Tm for primers during primer design workflows
  • Batch-translate coding sequences to protein sequences for annotation pipelines
  • Stream-process large FASTQ files to compute read length distributions without loading entire file

FAQ

Do I need an index to fetch subsequences from FASTA?

Yes. Create a .fai index with pysam.faidx(reference.fasta) before using pysam.FastaFile.fetch for random access.

Which tool handles SAM/BAM/VCF?

Use pysam for alignment and variant file formats; this skill focuses on FASTA/GenBank/FASTQ and sequence ops.

What coordinate system does fetch() use?

pysam.FastaFile.fetch() numeric start/end are 0-based half-open; region strings like "chr1:1000-2000" are 1-based inclusive.