Zero-Shot Prediction

Overview

Zero-shot predictors in AIDE can assess protein variants without requiring training data. These models leverage different types of information:

  • Pretrained language models that capture protein sequence patterns

  • Multiple sequence alignments that capture evolutionary information

  • Structural information for 3D context and conservation signals

  • Combinations of these approaches for more robust predictions

  • Note that the examples below are all single point mutations, but many of the models can also be used for multiple mutations.

Transformer-Based Models

ESM2

ESM2 uses masked language modeling to predict mutation effects based on the likelihood of amino acids in context:

from aide_predict import ESM2LikelihoodWrapper, ProteinSequence

# Setup wild type sequence
wt = ProteinSequence(
    "MKLLVLGLPGAGKGT",
    id="wild_type"
)

# Choose marginal method for computing likelihoods
model = ESM2LikelihoodWrapper(
    wt=wt,
    marginal_method="masked_marginal",  # or "wildtype_marginal" or "mutant_marginal"
    pool=True  # True to get single score per sequence
)

# No training needed
model.fit()

# Score mutations
mutants = wt.saturation_mutagenesis()
scores = model.predict(mutants)

The marginal method determines how likelihoods are computed:

  • masked_marginal: Masks each position to compute direct probability

  • wildtype_marginal: Uses wild type context only

  • mutant_marginal: Uses mutant sequence context

MSA Transformer

MSA Transformer extends ESM’s approach by incorporating evolutionary information from multiple sequence alignments:

from aide_predict import MSATransformerLikelihoodWrapper, ProteinSequence

# Setup wild type sequence with MSA
wt = ProteinSequence.from_a3m("protein_family.a3m")

# Create model with MSA context
model = MSATransformerLikelihoodWrapper(
    wt=wt,
    marginal_method="masked_marginal",
    n_msa_seqs=360  # Number of MSA sequences to use
)

# Fit to MSA 
model.fit()

# Score mutations
mutants = wt.saturation_mutagenesis()
scores = model.predict(mutants)

MSA Transformer combines the power of language models with evolutionary information, often improving predictions for proteins with rich evolutionary profiles.

VESPA

VESPA uses a pretrained model head on top of transformer embeddings specifically trained to predict variant effects:

from aide_predict import VESPAWrapper, ProteinSequence

# Setup wild type sequence
wt = ProteinSequence(
    "MKLLVLGLPGAGKGT",
    id="wild_type"
)

# Create VESPA model (light version by default)
model = VESPAWrapper(
    wt=wt,
    light=True  # Use lighter VESPAl model instead of full VESPA
)

# No training needed
model.fit()

# Score single mutations (VESPA is only for single mutations)
mutants = wt.saturation_mutagenesis()
scores = model.predict(mutants)

VESPA was trained on human disease variants and is particularly useful for predicting pathogenicity of human protein variants.

Structure-Aware Models

SaProt

SaProt incorporates protein structure information with sequence to improve predictions:

from aide_predict import SaProtLikelihoodWrapper, ProteinStructure

# Load sequences and map structures
wt = ProteinStructure.from_pdb("structures/structure.pdb")

# Create model
model = SaProtLikelihoodWrapper(
    wt=wt,
    marginal_method="masked_marginal"
)

# No training needed
model.fit()

# Score mutations with structure info
mutatnts = wt.saturation_mutagenesis()
scores = model.predict(mutants)

SaProt is particularly valuable for proteins where structural context plays a significant role in function or stability.

SSEmb

SSEmb combines structure and sequence information through a joint embedding approach:

from aide_predict import SSEmbWrapper, ProteinSequence, StructureMapper

# Setup environment variables first
# os.environ['SSEMB_CONDA_ENV'] = 'ssemb_env'
# os.environ['SSEMB_REPO'] = '/path/to/ssemb/repo'

# Setup wild type with structure and MSA
wt = ProteinSequence.from_a3m("protein_family.a3m")
wt.structure = "structures/structure.pdb"

# Create model
model = SSEmbWrapper(wt=wt)

# Fit using MSA
model.fit()

# Score mutations
mutants = wt.saturation_mutagenesis()
scores = model.predict(mutants)

SSEmb is especially effective for scoring mutations in proteins with known structures and rich evolutionary information.

Evolutionary Models

HMM

Hidden Markov Models capture position-specific amino acid preferences from MSAs:

from aide_predict import HMMWrapper, ProteinSequences

# Load MSA
msa = ProteinSequences.from_fasta("protein_family.a3m")

# Create and fit model
model = HMMWrapper(threshold=100)  # bit score threshold
model.fit(msa)

# Score new sequences
sequences = ProteinSequences.from_fasta("variants.fasta")
scores = model.predict(sequences)

HMMs are fast and interpretable but don’t capture dependencies between positions.

EVMutation

EVMutation analyzes co-evolution patterns in MSAs to capture epistatic effects:

from aide_predict import EVMutationWrapper

# Load MSA and wild type
wt = ProteinSequence.from_a3m("protein_family.a3m")

# Create and fit model
model = EVMutationWrapper(
    wt=wt,
    theta=0.8,  # Sequence weighting parameter
    protocol="standard"  # or "complex" or "mean_field"
)

# Fit using MSA
model.fit()

# Score mutations
mutants = wt.saturation_mutagenesis()
scores = model.predict(mutants)

EVMutation captures pairwise dependencies between positions, making it effective for predicting epistatic effects where multiple mutations interact.

EVE

EVE constructs a posterior latent distribution over an MSA and scores how “in-distribution” a sequence is:

from aide_predict import EVEWrapper

# Load MSA and wild type
wt = ProteinSequence.from_a3m("protein_family.a3m")

# Create model with custom parameters
model = EVEWrapper(
    wt=wt,
    encoder_z_dim=50,  # Dimensionality of latent space
    training_steps=400000  # Number of training steps
)

# Fit using MSA
model.fit()

# Score mutations
mutants = wt.saturation_mutagenesis()
scores = model.predict(mutants)

Contributing

If you have a zero-shot method you would like to have added, please reach out:

evan.komp (at) nrel.gov