home / skills / gptomics / bioskills / substructure-search

substructure-search skill

/chemoinformatics/substructure-search

This skill helps you identify compounds containing specific substructures in molecular libraries using SMARTS patterns with RDKit.

npx playbooks add skill gptomics/bioskills --skill substructure-search

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
6.2 KB
---
name: bio-substructure-search
description: Searches molecular libraries for substructure matches using SMARTS patterns with RDKit. Filters compounds by pharmacophore features, functional groups, or scaffold matches with atom mapping. Use when finding compounds containing specific chemical moieties or filtering libraries by structural features.
tool_type: python
primary_tool: RDKit
---

## Version Compatibility

Reference examples tested with: RDKit 2024.03+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Substructure Search

**"Filter my library for compounds containing a specific functional group"** → Search molecular collections for substructure matches using SMARTS patterns, identifying compounds that contain specified chemical moieties, scaffolds, or pharmacophore features.
- Python: `mol.HasSubstructMatch()`, `Chem.MolFromSmarts()` (RDKit)

Find molecules containing specific structural patterns using SMARTS.

## Basic Substructure Search

```python
from rdkit import Chem

mol = Chem.MolFromSmiles('c1ccc(O)cc1CCO')

# Check if pattern exists
pattern = Chem.MolFromSmarts('[OH]')  # Hydroxyl group
has_hydroxyl = mol.HasSubstructMatch(pattern)
print(f'Contains hydroxyl: {has_hydroxyl}')

# Get all matches (atom indices)
matches = mol.GetSubstructMatches(pattern)
print(f'Hydroxyl positions: {matches}')
```

## Common SMARTS Patterns

| Pattern | SMARTS | Description |
|---------|--------|-------------|
| Hydroxyl | `[OH]` | Alcohol/phenol |
| Primary amine | `[NH2]` | Primary amine |
| Secondary amine | `[NH1]` | Secondary amine |
| Carboxylic acid | `[CX3](=O)[OX2H1]` | COOH |
| Amide | `[CX3](=O)[NX3]` | C(=O)N |
| Benzene | `c1ccccc1` | Phenyl ring |
| Any aromatic | `[a]` | Any aromatic atom |
| Halogen | `[F,Cl,Br,I]` | Any halogen |

## Library Filtering

**Goal:** Filter a molecular library to retain only compounds containing (or lacking) a specific structural pattern.

**Approach:** Parse a SMARTS pattern and test each molecule for a substructure match, returning those that pass the inclusion or exclusion criterion.

```python
from rdkit import Chem

def filter_by_substructure(molecules, smarts, exclude=False):
    '''
    Filter molecules by substructure presence/absence.

    Args:
        molecules: List of RDKit mol objects
        smarts: SMARTS pattern string
        exclude: If True, return molecules WITHOUT the pattern
    '''
    pattern = Chem.MolFromSmarts(smarts)
    if pattern is None:
        raise ValueError(f'Invalid SMARTS: {smarts}')

    filtered = []
    for mol in molecules:
        if mol is None:
            continue
        has_match = mol.HasSubstructMatch(pattern)
        if exclude:
            if not has_match:
                filtered.append(mol)
        else:
            if has_match:
                filtered.append(mol)

    return filtered

# Filter for amines
amines = filter_by_substructure(library, '[NX3;H2,H1,H0]')

# Exclude reactive groups
clean = filter_by_substructure(library, '[N+]([O-])=O', exclude=True)  # No nitro
```

## Multiple Pattern Filtering

**Goal:** Apply multiple inclusion and exclusion substructure filters to narrow a compound set.

**Approach:** Sequentially apply SMARTS-based inclusion filters (must match all) then exclusion filters (must match none) to progressively narrow the library.

```python
def filter_multiple_patterns(molecules, include_patterns=None, exclude_patterns=None):
    '''
    Filter by multiple inclusion and exclusion patterns.
    '''
    result = list(molecules)

    if include_patterns:
        for smarts in include_patterns:
            pattern = Chem.MolFromSmarts(smarts)
            result = [m for m in result if m and m.HasSubstructMatch(pattern)]

    if exclude_patterns:
        for smarts in exclude_patterns:
            pattern = Chem.MolFromSmarts(smarts)
            result = [m for m in result if m and not m.HasSubstructMatch(pattern)]

    return result

# Find compounds with both amine and carboxylic acid (amino acids)
amino_acids = filter_multiple_patterns(
    library,
    include_patterns=['[NX3;H2]', '[CX3](=O)[OX2H1]']
)
```

## Atom Mapping

```python
from rdkit import Chem

def get_substructure_atoms(mol, smarts):
    '''
    Get all atoms matching a pattern with their indices.
    '''
    pattern = Chem.MolFromSmarts(smarts)
    matches = mol.GetSubstructMatches(pattern)

    results = []
    for match in matches:
        atoms = [mol.GetAtomWithIdx(i) for i in match]
        results.append({
            'indices': match,
            'symbols': [a.GetSymbol() for a in atoms]
        })

    return results

# Find and characterize all aromatic rings
mol = Chem.MolFromSmiles('c1ccc2c(c1)cccc2')
rings = get_substructure_atoms(mol, 'c1ccccc1')
print(f'Found {len(rings)} aromatic 6-membered rings')
```

## Recursive SMARTS

```python
# Recursive SMARTS for complex patterns

# Phenyl attached to carbonyl
pattern = '[$(c1ccccc1C(=O))]'

# Ortho-substituted phenyl
ortho_pattern = '[$(c1ccc([*])cc1[*])]'

# Electron-withdrawing group on aromatic
ewg_aromatic = '[$(c[$(C(=O)),$(C#N),$(N(=O)=O)])]'

mol = Chem.MolFromSmiles('c1ccc(C(=O)O)cc1')
pattern = Chem.MolFromSmarts('[$(c1ccccc1C(=O))]')
print(mol.HasSubstructMatch(pattern))  # True
```

## Visualization with Highlighting

```python
from rdkit.Chem.Draw import rdMolDraw2D

def draw_with_highlights(mol, smarts, filename):
    '''Draw molecule with substructure highlighted.'''
    pattern = Chem.MolFromSmarts(smarts)
    match = mol.GetSubstructMatch(pattern)

    if not match:
        print('No match found')
        return

    drawer = rdMolDraw2D.MolDraw2DCairo(400, 300)
    drawer.DrawMolecule(mol, highlightAtoms=match)
    drawer.FinishDrawing()

    with open(filename, 'wb') as f:
        f.write(drawer.GetDrawingText())

# Highlight carboxylic acid
draw_with_highlights(mol, '[CX3](=O)[OX2H1]', 'highlighted.png')
```

## Related Skills

- molecular-io - Load molecules for searching
- similarity-searching - Fingerprint-based searching
- admet-prediction - Filter before ADMET analysis

Overview

This skill searches molecular libraries for substructure matches using SMARTS patterns and RDKit. It helps find compounds containing specific functional groups, pharmacophore features, or scaffolds and can return atom-mapped matches for precise localization. The skill supports inclusion/exclusion rules and sequential multi-pattern filtering for library triage. Examples target common patterns (hydroxyl, amine, carboxyl, aromatic rings) and include visualization with highlighted matches.

How this skill works

The skill parses SMARTS strings with RDKit (Chem.MolFromSmarts) and tests each molecule for matches using HasSubstructMatch and GetSubstructMatches. It supports single-pattern filters, sequential inclusion/exclusion sets, and returns atom indices and element symbols for matched substructures. Optional drawing utilities use rdMolDraw2D to render molecules with highlighted atoms for matched patterns.

When to use it

  • Filter a screening library for compounds that contain a required functional group (e.g., primary amine).
  • Remove compounds that contain reactive or undesirable groups (e.g., nitro, Michael acceptors).
  • Identify and map scaffolds or pharmacophores across a dataset for SAR or clustering.
  • Generate annotated images showing where a SMARTS pattern matches for reporting or review.
  • Combine with molecular I/O or ADMET pipelines to prefilter sets before modeling.

Best practices

  • Validate SMARTS patterns before running large searches; Chem.MolFromSmarts returns None for invalid patterns.
  • Test examples on a small subset of molecules to confirm expected matches and atom mapping.
  • Prefer sequential include-then-exclude filtering when applying multiple criteria to keep logic clear.
  • Be aware of RDKit version differences; confirm function signatures if you encounter ImportError/AttributeError.
  • Skip None or malformed molecules in libraries to avoid errors during batch processing.

Example use cases

  • Find all compounds with a carboxylic acid and a primary amine to identify potential amino acids or zwitterions.
  • Exclude nitro-containing molecules from a hit list using an exclusion SMARTS filter.
  • Map and export atom indices for all phenyl-carbonyl motifs in a library for downstream annotation.
  • Draw and save PNGs showing highlighted hydroxyl positions for a set of benzene derivatives to include in a report.
  • Apply multiple SMARTS filters to isolate compounds that are aromatic, halogen-free, and possess a tertiary amine.

FAQ

Which RDKit version is required?

Patterns and examples were tested with RDKit 2024.03+. If you use a different version, verify function signatures and adjust code accordingly.

What if a SMARTS string is invalid?

Chem.MolFromSmarts returns None for invalid SMARTS; the skill raises a ValueError so you can correct the pattern before processing large libraries.