Zero-Shot Prediction
Overview
Zero-shot predictors in AIDE can assess protein variants without requiring training data. These models leverage different types of information:
Pretrained language models that capture protein sequence patterns
Multiple sequence alignments that capture evolutionary information
Structural information for 3D context and conservation signals
Combinations of these approaches for more robust predictions
Note that the examples below are all single point mutations, but many of the models can also be used for multiple mutations.
Transformer-Based Models
ESM2
ESM2 uses masked language modeling to predict mutation effects based on the likelihood of amino acids in context:
from aide_predict import ESM2LikelihoodWrapper, ProteinSequence
# Setup wild type sequence
wt = ProteinSequence(
"MKLLVLGLPGAGKGT",
id="wild_type"
)
# Choose marginal method for computing likelihoods
model = ESM2LikelihoodWrapper(
wt=wt,
marginal_method="masked_marginal", # or "wildtype_marginal" or "mutant_marginal"
pool=True # True to get single score per sequence
)
# No training needed
model.fit()
# Score mutations
mutants = wt.saturation_mutagenesis()
scores = model.predict(mutants)
The marginal method determines how likelihoods are computed:
masked_marginal
: Masks each position to compute direct probabilitywildtype_marginal
: Uses wild type context onlymutant_marginal
: Uses mutant sequence context
MSA Transformer
MSA Transformer extends ESM’s approach by incorporating evolutionary information from multiple sequence alignments:
from aide_predict import MSATransformerLikelihoodWrapper, ProteinSequence
# Setup wild type sequence with MSA
wt = ProteinSequence.from_a3m("protein_family.a3m")
# Create model with MSA context
model = MSATransformerLikelihoodWrapper(
wt=wt,
marginal_method="masked_marginal",
n_msa_seqs=360 # Number of MSA sequences to use
)
# Fit to MSA
model.fit()
# Score mutations
mutants = wt.saturation_mutagenesis()
scores = model.predict(mutants)
MSA Transformer combines the power of language models with evolutionary information, often improving predictions for proteins with rich evolutionary profiles.
VESPA
VESPA uses a pretrained model head on top of transformer embeddings specifically trained to predict variant effects:
from aide_predict import VESPAWrapper, ProteinSequence
# Setup wild type sequence
wt = ProteinSequence(
"MKLLVLGLPGAGKGT",
id="wild_type"
)
# Create VESPA model (light version by default)
model = VESPAWrapper(
wt=wt,
light=True # Use lighter VESPAl model instead of full VESPA
)
# No training needed
model.fit()
# Score single mutations (VESPA is only for single mutations)
mutants = wt.saturation_mutagenesis()
scores = model.predict(mutants)
VESPA was trained on human disease variants and is particularly useful for predicting pathogenicity of human protein variants.
Structure-Aware Models
SaProt
SaProt incorporates protein structure information with sequence to improve predictions:
from aide_predict import SaProtLikelihoodWrapper, ProteinStructure
# Load sequences and map structures
wt = ProteinStructure.from_pdb("structures/structure.pdb")
# Create model
model = SaProtLikelihoodWrapper(
wt=wt,
marginal_method="masked_marginal"
)
# No training needed
model.fit()
# Score mutations with structure info
mutatnts = wt.saturation_mutagenesis()
scores = model.predict(mutants)
SaProt is particularly valuable for proteins where structural context plays a significant role in function or stability.
SSEmb
SSEmb combines structure and sequence information through a joint embedding approach:
from aide_predict import SSEmbWrapper, ProteinSequence, StructureMapper
# Setup environment variables first
# os.environ['SSEMB_CONDA_ENV'] = 'ssemb_env'
# os.environ['SSEMB_REPO'] = '/path/to/ssemb/repo'
# Setup wild type with structure and MSA
wt = ProteinSequence.from_a3m("protein_family.a3m")
wt.structure = "structures/structure.pdb"
# Create model
model = SSEmbWrapper(wt=wt)
# Fit using MSA
model.fit()
# Score mutations
mutants = wt.saturation_mutagenesis()
scores = model.predict(mutants)
SSEmb is especially effective for scoring mutations in proteins with known structures and rich evolutionary information.
Evolutionary Models
HMM
Hidden Markov Models capture position-specific amino acid preferences from MSAs:
from aide_predict import HMMWrapper, ProteinSequences
# Load MSA
msa = ProteinSequences.from_fasta("protein_family.a3m")
# Create and fit model
model = HMMWrapper(threshold=100) # bit score threshold
model.fit(msa)
# Score new sequences
sequences = ProteinSequences.from_fasta("variants.fasta")
scores = model.predict(sequences)
HMMs are fast and interpretable but don’t capture dependencies between positions.
EVMutation
EVMutation analyzes co-evolution patterns in MSAs to capture epistatic effects:
from aide_predict import EVMutationWrapper
# Load MSA and wild type
wt = ProteinSequence.from_a3m("protein_family.a3m")
# Create and fit model
model = EVMutationWrapper(
wt=wt,
theta=0.8, # Sequence weighting parameter
protocol="standard" # or "complex" or "mean_field"
)
# Fit using MSA
model.fit()
# Score mutations
mutants = wt.saturation_mutagenesis()
scores = model.predict(mutants)
EVMutation captures pairwise dependencies between positions, making it effective for predicting epistatic effects where multiple mutations interact.
EVE
EVE constructs a posterior latent distribution over an MSA and scores how “in-distribution” a sequence is:
from aide_predict import EVEWrapper
# Load MSA and wild type
wt = ProteinSequence.from_a3m("protein_family.a3m")
# Create model with custom parameters
model = EVEWrapper(
wt=wt,
encoder_z_dim=50, # Dimensionality of latent space
training_steps=400000 # Number of training steps
)
# Fit using MSA
model.fit()
# Score mutations
mutants = wt.saturation_mutagenesis()
scores = model.predict(mutants)
Contributing
If you have a zero-shot method you would like to have added, please reach out:
evan.komp (at) nrel.gov