Position-Specific Models
Overview
Some protein models can generate outputs for each amino acid position in a sequence. These models use the PositionSpecificMixin
to handle position selection and output formatting. EG. lanmguage models or one hot encodings. You might want to do this if only a few positions are changing among variants or you have a specific hypothesis about the importance of certain positions.
Using Position-Specific Models
Position-specific models have three key parameters that control their output. Flatten and pool are mutually exclusive.
from aide_predict import ESM2Embedding
# Basic usage - outputs pooled across all positions
model = ESM2Embedding(
positions=None, # Consider all positions
pool='mean', # Average across positions
flatten=False # because pooling by mean
)
# Position-specific - get embeddings for specific positions
model = ESM2Embedding(
positions=[0, 1, 2], # Only these positions
pool=False, # Keep positions separate
flatten=True # Flatten features for each position so we get a single vector
)
Output Shapes
The output shape depends on the parameter combination:
# Example with ESM2 (1280-dimensional embeddings)
X = ProteinSequences.from_fasta("sequences.fasta")
# Default: pooled across positions
model = ESM2Embedding(pool=True)
output = model.transform(X) # Shape: (n_sequences, 1280)
# Selected positions, no pooling
model = ESM2Embedding(
positions=[0, 1, 2],
pool=False
)
output = model.transform(X) # Shape: (n_sequences, 3, 1280)
# Selected positions, no pooling, flattened
model = ESM2Embedding(
positions=[0, 1, 2],
pool=False,
flatten=True
)
output = model.transform(X) # Shape: (n_sequences, 3*1280)
Position Specificity for Variable Length Sequences
In some cases models can be position specific even if not all sequences are the same length, such as when working with homologs. However, to map positions between sequences properly, we need to:
Know the positions of interest in a reference sequence (usually wild type)
Align all sequences
Map the reference positions to positions in the alignment
AIDE provides tools to handle this workflow:
# Start with unaligned sequences
X = ProteinSequences.from_fasta("sequences.fasta")
wt = X['wt']
wt_positions = [1, 2, 3] # 0-indexed positions of interest in wild type
# Align sequences
X = X.align_all()
wt.msa = X
# Get alignment mapping and convert positions
alignment_mapping = X.get_alignment_mapping()
wt_alignment_mapping = alignment_mapping[wt.id] # or use str(hash(wt)) if no ID
aligned_positions = wt_alignment_mapping[wt_positions]
# Now use these positions in any position-specific model
model = MSATransformerEmbedding(
positions=aligned_positions,
pool=False,
wt=wt, # used to get the alignment to align incoming sequence to. Alternative, wt can be None if all seqs in X have the msa attribute set to X
)
model.fit()
embeddings = model.transform(X)
Implementation Notes
If
positions
is specified butpool=True
, the model will first select the positions then pool across themflatten=True
only applies whenpool=False
and there are multiple dimensionsModels will raise an error if
positions
are specified but the sequences are not aligned or of fixed length