home / skills / gptomics / bioskills / structure-navigation

structure-navigation skill

safe

/structural-biology/structure-navigation

This skill helps you navigate Bio.PDB SMCRA hierarchy in Python to access structure elements, extract sequences, and inspect atomic coordinates.

npx playbooks add skill gptomics/bioskills --skill structure-navigation

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

8.8 KB

---
name: bio-pdb-structure-navigation
description: Navigate protein structure hierarchy using Biopython Bio.PDB SMCRA model. Use when accessing models, chains, residues, and atoms, iterating over structure levels, or extracting sequences from PDB files.
tool_type: python
primary_tool: Bio.PDB
---

## Version Compatibility

Reference examples tested with: BioPython 1.83+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Structure Navigation

**"Access residues and atoms in a PDB structure"** → Navigate the Structure-Model-Chain-Residue-Atom hierarchy to iterate over components, extract sequences, and access atomic coordinates.
- Python: `structure[0]['A'][100]['CA'].get_vector()` for direct access

Navigate the Structure-Model-Chain-Residue-Atom (SMCRA) hierarchy to access and iterate over structure components.

## Required Imports

```python
from Bio.PDB import PDBParser, PPBuilder, Selection
from Bio.Data.PDBData import protein_letters_3to1
```

## SMCRA Hierarchy

```
Structure
    |
    +-- Model (0, 1, ...)      # NMR ensembles, crystal asymmetric unit
          |
          +-- Chain (A, B, ...) # Polypeptide chains, ligands
                |
                +-- Residue     # Amino acids, nucleotides, hetero groups
                      |
                      +-- Atom  # Individual atoms
```

## Accessing Hierarchy Levels

```python
from Bio.PDB import PDBParser

parser = PDBParser(QUIET=True)
structure = parser.get_structure('protein', 'protein.pdb')

# Access by index/ID
model = structure[0]          # First model
chain = model['A']            # Chain A
residue = chain[100]          # Residue 100 (simple numbering)
residue = chain[(' ', 100, ' ')]  # Full residue ID (hetfield, resseq, icode)
atom = residue['CA']          # C-alpha atom
```

## Iterating Over Structure

```python
# Iterate all levels
for model in structure:
    for chain in model:
        for residue in chain:
            for atom in residue:
                print(f'{chain.id}:{residue.id[1]}:{atom.name}')

# Shortcut iterators (all levels below current)
for chain in structure.get_chains():
    print(f'Chain: {chain.id}')

for residue in structure.get_residues():
    print(f'Residue: {residue.resname}')

for atom in structure.get_atoms():
    print(f'Atom: {atom.name} at {atom.coord}')
```

## Residue Identification

```python
# Residue ID is a tuple: (hetfield, resseq, icode)
for residue in chain:
    hetfield, resseq, icode = residue.id
    print(f'Residue {resseq}{icode}: {residue.resname}')

# hetfield values:
#   ' '  - standard amino acid
#   'W'  - water
#   'H_xxx' - hetero residue (ligand, modified residue)

# Filter standard residues only
standard_residues = [r for r in chain if r.id[0] == ' ']

# Filter water
waters = [r for r in chain if r.id[0] == 'W']

# Filter hetero atoms (ligands)
hetero = [r for r in chain if r.id[0].startswith('H_')]
```

## Atom Properties

```python
for atom in residue:
    print(f'Name: {atom.name}')
    print(f'Element: {atom.element}')
    print(f'Coordinates: {atom.coord}')
    print(f'B-factor: {atom.bfactor}')
    print(f'Occupancy: {atom.occupancy}')
    print(f'Full ID: {atom.full_id}')
    print(f'Serial number: {atom.serial_number}')
```

## Getting Full Identifiers

```python
# Full hierarchical ID from any entity
atom = structure[0]['A'][100]['CA']
print(atom.get_full_id())
# ('protein', 0, 'A', (' ', 100, ' '), ('CA', ' '))

# Components: (structure_id, model_id, chain_id, residue_id, atom_id)
```

## Checking for Children

```python
# Check if entity has child
if chain.has_id(100):
    residue = chain[100]

# Check if residue has atom
if residue.has_id('CA'):
    ca = residue['CA']

# Get list of all children
chains = structure[0].get_list()
residues = chain.get_list()
atoms = residue.get_list()
```

## Getting Parent Entity

```python
# Navigate up hierarchy
atom = structure[0]['A'][100]['CA']

residue = atom.get_parent()
chain = residue.get_parent()
model = chain.get_parent()
structure = model.get_parent()
```

## Extracting Polypeptide Sequences

```python
from Bio.PDB import PDBParser, PPBuilder

parser = PDBParser(QUIET=True)
structure = parser.get_structure('protein', 'protein.pdb')

ppb = PPBuilder()
for pp in ppb.build_peptides(structure):
    seq = pp.get_sequence()
    print(f'Polypeptide: {seq}')
    print(f'Length: {len(seq)}')

# Get all sequences as list
sequences = [pp.get_sequence() for pp in ppb.build_peptides(structure)]
```

## Using CaPPBuilder for Broken Chains

```python
from Bio.PDB import CaPPBuilder

# Use when backbone is incomplete
# Connects residues if CA atoms are within 4.3 Angstroms
ppb = CaPPBuilder()
for pp in ppb.build_peptides(structure):
    print(f'Fragment: {pp.get_sequence()}')
```

## Converting Residue Names

```python
from Bio.Data.PDBData import protein_letters_3to1

# Three-letter to one-letter conversion
three_letter = 'ALA'
one_letter = protein_letters_3to1.get(three_letter, 'X')
print(f'{three_letter} -> {one_letter}')  # ALA -> A

# Build sequence manually
sequence = ''
for residue in chain:
    if residue.id[0] == ' ':  # Standard residue
        code = protein_letters_3to1.get(residue.resname, 'X')
        sequence += code
print(f'Sequence: {sequence}')
```

## Using Selection.unfold_entities

```python
from Bio.PDB import Selection

# Extract entities at specific level
# Codes: S=structure, M=model, C=chain, R=residue, A=atom

# Get all residues from structure
residues = Selection.unfold_entities(structure, 'R')
print(f'Total residues: {len(residues)}')

# Get all atoms from a chain
atoms = Selection.unfold_entities(chain, 'A')
print(f'Atoms in chain: {len(atoms)}')

# Get all chains from model
chains = Selection.unfold_entities(model, 'C')
```

## Handling Disordered Atoms

```python
# Check for disorder
if atom.is_disordered():
    print(f'Atom {atom.name} has multiple conformations')
    print(f'Alt locations: {atom.disordered_get_id_list()}')

    # Select specific conformation
    atom.disordered_select('A')
    print(f'Coord for alt A: {atom.coord}')

    # Get all conformations
    for altloc in atom.disordered_get_id_list():
        atom.disordered_select(altloc)
        print(f'  {altloc}: {atom.coord}')

# Get unpacked list (all conformations)
all_atoms = atom.disordered_get_list()
```

## Handling Disordered Residues

```python
# Point mutations at same position
if residue.is_disordered():
    print(f'Disordered residue at {residue.id}')
    names = residue.disordered_get_id_list()
    print(f'Alternative residues: {names}')

    # Select specific residue type
    residue.disordered_select('ALA')
```

## Finding Specific Atoms

```python
# Get backbone atoms
backbone_names = ['N', 'CA', 'C', 'O']
for residue in chain:
    backbone = [residue[name] for name in backbone_names if residue.has_id(name)]

# Get all C-alpha atoms
ca_atoms = [r['CA'] for r in structure.get_residues() if r.has_id('CA')]
print(f'Found {len(ca_atoms)} CA atoms')

# Get sidechain atoms
for residue in chain:
    sidechain = [a for a in residue if a.name not in ['N', 'CA', 'C', 'O']]
```

## Filtering by Residue Type

```python
# Get only amino acids
amino_acids = [r for r in chain if r.id[0] == ' ']

# Get specific amino acid types
arginines = [r for r in chain if r.resname == 'ARG']
charged = [r for r in chain if r.resname in ['ARG', 'LYS', 'ASP', 'GLU']]

# Get hetero atoms
ligands = [r for r in chain if r.id[0].startswith('H_')]
for lig in ligands:
    print(f'Ligand: {lig.resname} at position {lig.id[1]}')
```

## Counting Entities

```python
# Count at each level
n_models = len(list(structure.get_models()))
n_chains = len(list(structure.get_chains()))
n_residues = len(list(structure.get_residues()))
n_atoms = len(list(structure.get_atoms()))

print(f'Models: {n_models}, Chains: {n_chains}')
print(f'Residues: {n_residues}, Atoms: {n_atoms}')

# Count per chain
for chain in structure.get_chains():
    n_res = len([r for r in chain if r.id[0] == ' '])
    print(f'Chain {chain.id}: {n_res} amino acids')
```

## Working with NMR Ensembles

```python
# NMR structures have multiple models
parser = PDBParser(QUIET=True)
structure = parser.get_structure('nmr', 'nmr_structure.pdb')

n_models = len(list(structure.get_models()))
print(f'NMR ensemble with {n_models} conformers')

# Iterate over models
for model in structure:
    # Each model is a separate conformation
    ca_coords = [r['CA'].coord for r in model.get_residues() if r.has_id('CA')]
    print(f'Model {model.id}: {len(ca_coords)} CA atoms')
```

## Related Skills

- structure-io - Parse and write structure files
- geometric-analysis - Measure distances, angles, RMSD
- structure-modification - Modify coordinates and properties
- sequence-manipulation/seq-objects - Work with extracted sequences

Overview

This skill provides concise patterns and helpers to navigate PDB structure hierarchy using Biopython's Bio.PDB SMCRA model. It shows practical ways to access models, chains, residues, and atoms, extract sequences, and handle common PDB issues like disorder or broken backbones. Examples assume Biopython >= 1.83 but include guidance to adapt to other versions.

How this skill works

The skill inspects the Structure→Model→Chain→Residue→Atom (SMCRA) hierarchy and exposes code patterns to index, iterate, filter, and query each level. It demonstrates direct access (e.g., structure[0]['A'][100]['CA']), bulk iterators (get_chains(), get_residues(), get_atoms()), Selection.unfold_entities shortcuts, and polypeptide extraction via PPBuilder/CaPPBuilder. It also covers identifying residue types, extracting coordinates, and navigating disordered entities.

When to use it

Access a specific atom, residue, chain, or model by index/ID.
Iterate entire structures to collect coordinates, atoms, or residue-level data.
Extract polypeptide sequences from PDB files, including fragmented backbones.
Filter residues (standard, water, hetero/ligand) or atom subsets (CA, backbone, sidechains).
Handle NMR ensembles or multiple models in a structure.

Best practices

Verify Biopython version before running examples and adapt if signatures differ.
Prefer Selection.unfold_entities for bulk extraction (residues/atoms) to simplify code.
Filter standard residues by residue.id[0] == ' ' to avoid waters/hetero groups when building sequences.
Use PPBuilder for intact backbones and CaPPBuilder when CA-only or broken chains exist.
Check has_id and is_disordered before direct indexing to avoid exceptions.

Example use cases

Extract all C-alpha coordinates for RMSD or distance calculations across models.
Build one-letter sequences from chains while skipping hetero residues and waters.
Collect all ligand residues using id[0].startswith('H_') for binding-site analysis.
Iterate an NMR ensemble to compute per-model CA counts or compare conformers.
Select alternative atom conformations from disordered atoms and retrieve coordinates for each altloc.

FAQ

What if a direct index like chain[100] raises KeyError?

Residue indexing can require the full ID tuple (hetfield, resseq, icode). Use chain[(' ', 100, ' ')] or check chain.has_id before accessing.

How do I convert three-letter residue names to one-letter codes?

Use Bio.Data.PDBData.protein_letters_3to1.get(res.resname, 'X') to map three-letter codes to one-letter; fallback to 'X' for unknowns.