Model Compatibility

Understanding Model Requirements

AIDE models have different requirements and capabilities that determine whether they can be used with your data. Key considerations include:

  • Whether the model requires training data (supervised vs zero-shot)

  • Whether sequences must be aligned or of fixed length

  • Whether the model requires a Multiple Sequence Alignment (MSA)

  • Whether the model requires or can use structural information

  • Whether the model needs a wild-type sequence for comparison

  • Whether the model can handle variable-length sequences

Checking Model Compatibility

AIDE provides a utility function to check which models are compatible with your data:

from aide_predict.utils.checks import check_model_compatibility
from aide_predict import ProteinSequences, ProteinSequence

# Example setup
sequences = ProteinSequences.from_fasta("my_sequences.fasta")
msa = ProteinSequences.from_fasta("family_msa.fasta")
wt = ProteinSequence("MKLLVLGLPGAGKGT", id="wild_type")
wt.msa = msa 

# Check compatibility
compatibility = check_model_compatibility(
    training_sequences=sequences,  # Optional: sequences for supervised learning
    testing_sequences=None,        # Optional: test sequences if different from training
    wt=wt                          # Optional: wild-type sequence, may have structure, MSA
)

print("Compatible models:", compatibility["compatible"])
print("Incompatible models:", compatibility["incompatible"])

The compatibility checker performs several validation steps:

  • Verifies if structure information is available (either in sequences or wild-type)

  • Checks if MSAs are available and properly aligned

  • Validates that sequence lengths match requirements

  • Ensures wild-type sequences are available when needed

  • Verifies that per-sequence MSAs match sequence lengths when required

You can also check which tools are available in your current installation:

from aide_predict.utils.checks import get_supported_tools
print(get_supported_tools())

Model Categories

AIDE models fall into several categories:

1. Zero-Shot Predictors

These models don’t require training data but may have other requirements:

# ESM2 - Requires only sequences
from aide_predict import ESM2LikelihoodWrapper
model = ESM2LikelihoodWrapper(wt=wt)
model.fit()  # No training needed
scores = model.predict(sequences)

# MSATransformer - Requires MSA for the WT
from aide_predict import MSATransformerLikelihoodWrapper
model = MSATransformerLikelihoodWrapper(wt=wt) 
model.fit() 
scores = model.predict(sequences)

# SaProt - Can use structural information
from aide_predict import SaProtLikelihoodWrapper
model = SaProtLikelihoodWrapper(wt=wt) # wt must have structure
model.fit()
scores = model.predict(sequences)  # Will use structure if available

Other zero-shot predictors include:

  • HMM: Creates Hidden Markov Models from MSAs

  • EVMutation: Uses evolutionary couplings from MSAs

  • VESPA: Pre-trained model for variant effect prediction

  • EVE: Evolutionary model using latent space representations

  • SSEmb: Structure and sequence-based variant effect predictor

2. Embedding Models

These models convert sequences into numerical features for downstream ML:

# Simple one-hot encoding
from aide_predict import OneHotProteinEmbedding
embedder = OneHotProteinEmbedding()
X = embedder.fit_transform(sequences)

# Advanced language model embeddings
from aide_predict import ESM2Embedding
embedder = ESM2Embedding(pool=True)  # pool=True for sequence-level embeddings
X = embedder.fit_transform(sequences)

# K-mer based embeddings
from aide_predict import KmerEmbedding
embedder = KmerEmbedding(k=3)
X = embedder.fit_transform(sequences)

Other embedding models include:

  • MSATransformerEmbedding: Produces embeddings using MSAs

  • SaProtEmbedding: Structure-aware protein language model embeddings

  • OneHotAlignedEmbedding: One-hot encodings for aligned sequences

Importance of Data Structure

The compatibility of models depends heavily on the structure of your data:

Data Characteristic

Compatible Models

Incompatible Models

Fixed-length sequences

All models

-

Variable-length sequences

Models without RequiresFixedLengthMixin

Models with RequiresFixedLengthMixin

Has MSA

All models

-

No MSA

Models without MSA requirements

MSATransformer, EVMutation, EVE

Has structure

All models

-

No structure

Models without structure requirements

SaProt, SSEmb

Has wild-type

All models

-

No wild-type

Models without WT requirements

Models with RequiresWTToFunctionMixin

Using the appropriate data structure for your specific modeling task ensures that AIDE can provide the most accurate predictions.