home / skills / gptomics / bioskills / positive-selection

positive-selection skill

/comparative-genomics/positive-selection

This skill helps you detect positive selection in genes using dN/dS tests with PAML and HyPhy, identifying adaptive sites and branches.

npx playbooks add skill gptomics/bioskills --skill positive-selection

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
11.2 KB
---
name: bio-comparative-genomics-positive-selection
description: Detect positive selection using dN/dS (omega) tests with PAML codeml and HyPhy. Identify sites and branches under adaptive evolution through codon models and branch-site tests. Use when testing for adaptive evolution in gene families or identifying positively selected sites.
tool_type: mixed
primary_tool: PAML
---

## Version Compatibility

Reference examples tested with: BioPython 1.83+, HyPhy 2.5+, PAML 4.10+, scipy 1.12+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Positive Selection Analysis

**"Test my gene for positive selection"** → Detect adaptive evolution using dN/dS (omega) codon models including site models, branch models, and branch-site tests to identify positively selected sites.
- Python: PAML `codeml` for site and branch-site dN/dS tests
- CLI: `hyphy busted`, `hyphy meme` for HyPhy selection tests

## dN/dS Overview

```python
'''
dN/dS (omega, ω) interpretation:
- ω < 1: Purifying (negative) selection - deleterious mutations removed
- ω = 1: Neutral evolution - no selective pressure
- ω > 1: Positive (diversifying) selection - advantageous mutations favored

Most genes: ω << 1 (strong purifying selection)
Immune genes, reproduction: Often show ω > 1 at specific sites
'''
```

## PAML Codeml Analysis

**Goal:** Test for positive selection across a gene using codon-based site models.

**Approach:** Prepare a codon alignment in PHYLIP format, create codeml control files for nested models (M7 vs M8 or M1a vs M2a), run codeml, extract log-likelihoods, perform a likelihood ratio test, and identify positively selected sites from the BEB analysis.

```python
'''Run PAML codeml for selection analysis'''

import subprocess
import os
from Bio import SeqIO
from Bio.Seq import Seq


def prepare_codon_alignment(cds_fasta, output_phy):
    '''Prepare codon alignment in PHYLIP format

    Requirements:
    - CDS sequences (in-frame, no stop codons except terminal)
    - Multiple sequence alignment already performed
    - Sequence length divisible by 3
    '''
    records = list(SeqIO.parse(cds_fasta, 'fasta'))

    # Validate codon alignment
    for rec in records:
        if len(rec.seq) % 3 != 0:
            print(f'Warning: {rec.id} length not divisible by 3')

    # Write PHYLIP format
    n_seq = len(records)
    seq_len = len(records[0].seq)

    with open(output_phy, 'w') as f:
        f.write(f' {n_seq} {seq_len}\n')
        for rec in records:
            # PHYLIP names: 10 characters, padded
            name = rec.id[:10].ljust(10)
            f.write(f'{name}{str(rec.seq)}\n')

    return output_phy


def create_codeml_control(alignment_file, tree_file, output_dir, model='M8'):
    '''Create codeml control file

    Site models for detecting positive selection:
    - M0: One ratio (single ω for all sites)
    - M1a: Nearly neutral (ω0 < 1, ω1 = 1)
    - M2a: Positive selection (ω0 < 1, ω1 = 1, ω2 > 1)
    - M7: Beta (ω from beta distribution, 0 < ω < 1)
    - M8: Beta + ω > 1 (allows positive selection)
    - M8a: Beta + ω = 1 (null for M8 comparison)

    Standard comparison: M8 vs M7 or M8 vs M8a
    '''
    model_params = {
        'M0': {'NSsites': 0, 'model': 0},
        'M1a': {'NSsites': 1, 'model': 0},
        'M2a': {'NSsites': 2, 'model': 0},
        'M7': {'NSsites': 7, 'model': 0},
        'M8': {'NSsites': 8, 'model': 0},
        'M8a': {'NSsites': 8, 'model': 0, 'fix_omega': 1, 'omega': 1},
    }

    params = model_params.get(model, model_params['M8'])

    ctl_content = f'''
      seqfile = {alignment_file}
     treefile = {tree_file}
      outfile = {output_dir}/mlc

        noisy = 9
      verbose = 1
      runmode = 0
      seqtype = 1
    CodonFreq = 2
        model = {params.get('model', 0)}
      NSsites = {params.get('NSsites', 8)}
        icode = 0
    fix_kappa = 0
        kappa = 2
    fix_omega = {params.get('fix_omega', 0)}
        omega = {params.get('omega', 1)}
    '''

    ctl_file = f'{output_dir}/codeml_{model}.ctl'
    with open(ctl_file, 'w') as f:
        f.write(ctl_content)

    return ctl_file


def run_codeml(ctl_file):
    '''Run PAML codeml'''
    cmd = f'codeml {ctl_file}'
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)

    if result.returncode != 0:
        print(f'Codeml error: {result.stderr}')

    return result


def parse_codeml_output(mlc_file):
    '''Parse codeml output for likelihood and parameters'''
    results = {'lnL': None, 'omega': None, 'kappa': None, 'sites': []}

    with open(mlc_file) as f:
        content = f.read()

    # Extract log-likelihood
    for line in content.split('\n'):
        if 'lnL' in line and 'np' in line:
            parts = line.split()
            for i, p in enumerate(parts):
                if p == 'lnL':
                    results['lnL'] = float(parts[i + 2])
                    break

        # Extract omega values
        if 'omega' in line.lower() and '=' in line:
            parts = line.split('=')
            if len(parts) >= 2:
                try:
                    results['omega'] = float(parts[-1].strip().split()[0])
                except ValueError:
                    pass

    # Extract positively selected sites (BEB analysis)
    if 'Bayes Empirical Bayes' in content:
        beb_section = content.split('Bayes Empirical Bayes')[1]
        for line in beb_section.split('\n'):
            parts = line.split()
            if len(parts) >= 5:
                try:
                    site = int(parts[0])
                    aa = parts[1]
                    prob = float(parts[2])
                    # Sites with P > 0.95 considered significant
                    # Sites with P > 0.99 highly significant
                    if prob > 0.95:
                        results['sites'].append({
                            'position': site,
                            'amino_acid': aa,
                            'probability': prob,
                            'significance': '**' if prob > 0.99 else '*'
                        })
                except (ValueError, IndexError):
                    continue

    return results


def likelihood_ratio_test(lnL_null, lnL_alt, df=2):
    '''Perform likelihood ratio test

    For M8 vs M7: df = 2
    For M2a vs M1a: df = 2
    For branch-site test: df = 1

    Significance thresholds (chi-square):
    - df=1: 3.84 (p<0.05), 6.63 (p<0.01)
    - df=2: 5.99 (p<0.05), 9.21 (p<0.01)
    '''
    from scipy import stats

    lrt = 2 * (lnL_alt - lnL_null)
    p_value = 1 - stats.chi2.cdf(lrt, df)

    return {
        'LRT_statistic': lrt,
        'degrees_freedom': df,
        'p_value': p_value,
        'significant': p_value < 0.05
    }
```

## Branch-Site Models

**Goal:** Test for positive selection on a specific lineage (foreground branch) while allowing variable rates across the tree.

**Approach:** Mark the foreground lineage with #1 in the tree file, create branch-site model A control files for both alternative and null hypotheses, run codeml, and perform an LRT with df=1.

```python
def create_branch_site_control(alignment, tree, foreground_branch, output_dir):
    '''Create control file for branch-site test

    Branch-site model A tests for positive selection on
    specific branches (foreground) while allowing variation
    across the tree.

    Foreground branch: Mark with #1 in tree file
    Example: ((A,B),(C #1,D));

    Comparison: Model A vs null (fix_omega = 1)
    '''
    ctl_content = f'''
      seqfile = {alignment}
     treefile = {tree}
      outfile = {output_dir}/branch_site.mlc

      runmode = 0
      seqtype = 1
    CodonFreq = 2
        model = 2
      NSsites = 2
    fix_kappa = 0
        kappa = 2
    fix_omega = 0
        omega = 1
    '''

    ctl_file = f'{output_dir}/branch_site.ctl'
    with open(ctl_file, 'w') as f:
        f.write(ctl_content)

    return ctl_file


def mark_foreground_branch(tree_file, foreground_taxa, output_file):
    '''Mark foreground lineage in Newick tree

    For testing selection on specific lineage:
    - Add #1 after the clade of interest
    - Can mark multiple branches with different tags
    '''
    with open(tree_file) as f:
        tree = f.read().strip()

    # Simple marking - for complex trees use ete3
    for taxon in foreground_taxa:
        tree = tree.replace(taxon, f'{taxon} #1')

    with open(output_file, 'w') as f:
        f.write(tree)

    return output_file
```

## HyPhy Alternative

**Goal:** Detect positive selection using HyPhy's suite of methods for gene-wide and site-specific tests.

**Approach:** Run BUSTED for gene-wide episodic selection, MEME for site-specific episodic selection, or FEL for pervasive site selection, and parse the JSON output for p-values and selected sites.

```python
def run_hyphy_busted(alignment_file, tree_file, output_json):
    '''Run HyPhy BUSTED for gene-wide selection

    BUSTED (Branch-site Unrestricted Statistical Test for
    Episodic Diversification) tests whether a gene has
    experienced positive selection at any site on any branch.

    More sensitive than PAML for episodic selection
    '''
    cmd = f'hyphy busted --alignment {alignment_file} --tree {tree_file} --output {output_json}'
    subprocess.run(cmd, shell=True)

    return output_json


def run_hyphy_meme(alignment_file, tree_file, output_json):
    '''Run HyPhy MEME for site-specific selection

    MEME (Mixed Effects Model of Evolution) detects
    episodic diversifying selection at individual sites.

    Better than PAML M8 when selection is episodic
    (not constant across all branches)
    '''
    cmd = f'hyphy meme --alignment {alignment_file} --tree {tree_file} --output {output_json}'
    subprocess.run(cmd, shell=True)

    return output_json


def run_hyphy_fel(alignment_file, tree_file, output_json):
    '''Run HyPhy FEL for pervasive selection

    FEL (Fixed Effects Likelihood) tests for pervasive
    selection at each site across the entire phylogeny.

    Use when selection is expected to be constant
    '''
    cmd = f'hyphy fel --alignment {alignment_file} --tree {tree_file} --output {output_json}'
    subprocess.run(cmd, shell=True)

    return output_json


def parse_hyphy_json(json_file):
    '''Parse HyPhy JSON output'''
    import json

    with open(json_file) as f:
        results = json.load(f)

    # Extract test results
    test_results = results.get('test results', {})

    # Extract sites under selection (MEME/FEL)
    sites = []
    mle = results.get('MLE', {}).get('content', {})
    for site_data in mle.get('0', {}).values():
        if isinstance(site_data, list) and len(site_data) > 5:
            # Structure varies by method
            pass

    return {
        'p_value': test_results.get('p-value'),
        'LRT': test_results.get('LRT'),
        'sites': sites
    }
```

## Related Skills

- comparative-genomics/synteny-analysis - Find syntenic genes for selection tests
- comparative-genomics/ortholog-inference - Identify orthologs for analysis
- alignment/msa-parsing - Parse and manipulate codon alignments
- phylogenetics/modern-tree-inference - Generate trees for codeml

Overview

This skill detects positive selection in coding sequences using codon-based dN/dS methods implemented with PAML codeml and HyPhy. It runs site models, branch models, and branch-site tests to identify genes, branches, and specific codons under adaptive evolution. The outputs include likelihood ratio tests, Bayes Empirical Bayes site probabilities, and parsed HyPhy JSON results for MEME/BUSTED/FEL analyses.

How this skill works

Prepare an in-frame codon alignment and a Newick tree, then run PAML codeml with nested site models (e.g., M7 vs M8 or M1a vs M2a) to compare null and alternative hypotheses via likelihood ratio tests. For lineage-specific tests, mark the foreground branch and run branch-site model A with an LRT (df=1). Alternatively, call HyPhy commands (BUSTED, MEME, FEL) to detect episodic or pervasive selection and parse JSON outputs for site-level significance.

When to use it

  • Testing whether a gene family has experienced adaptive evolution across species.
  • Finding specific codons under positive selection (site-level inference).
  • Testing a particular lineage for episodic positive selection (branch-site tests).
  • Comparing gene-wide episodic selection with site models to capture different selection regimes.
  • Validating selection signals before downstream functional or structural follow-up.

Best practices

  • Use high-quality, in-frame codon alignments with lengths divisible by three and no internal stop codons.
  • Generate an accurate phylogenetic tree from the same sequences before running codon tests.
  • Run nested model comparisons (M7 vs M8 or M1a vs M2a) and report LRT statistic, df, and p-value.
  • Mark foreground branches explicitly in the tree for branch-site tests and compare model A vs null (fix_omega=1).
  • Complement PAML results with HyPhy methods (BUSTED/MEME/FEL) for episodic selection detection.

Example use cases

  • Compare M7 vs M8 in codeml to test for sites with ω>1 and list BEB sites with P>0.95.
  • Mark a disease-associated lineage as foreground and run branch-site model A to test adaptive changes on that branch.
  • Run HyPhy MEME to identify episodic diversifying sites missed by constant-site models.
  • Use BUSTED to screen multiple genes for any evidence of episodic selection before detailed site analyses.
  • Integrate results with structural models to prioritize mutations for experimental validation.

FAQ

Which test is best for episodic selection?

HyPhy MEME and BUSTED are more sensitive to episodic selection than PAML site models; use MEME for site-level episodic detection and BUSTED for gene-wide screening.

How do I interpret ω (dN/dS)?

ω < 1 implies purifying selection, ω = 1 neutral evolution, and ω > 1 indicates diversifying (positive) selection; significance should be supported by LRTs or p-values, not ω alone.