Position-Specific Models

Overview

Some protein models can generate outputs for each amino acid position in a sequence. These models use the PositionSpecificMixin to handle position selection and output formatting. EG. lanmguage models or one hot encodings. You might want to do this if only a few positions are changing among variants or you have a specific hypothesis about the importance of certain positions.

Using Position-Specific Models

Position-specific models have three key parameters that control their output. Flatten and pool are mutually exclusive.

from aide_predict import ESM2Embedding

# Basic usage - outputs pooled across all positions
model = ESM2Embedding(
    positions=None,  # Consider all positions
    pool='mean',      # Average across positions
    flatten=False    # because pooling by mean
)

# Position-specific - get embeddings for specific positions
model = ESM2Embedding(
    positions=[0, 1, 2],  # Only these positions
    pool=False,          # Keep positions separate
    flatten=True         # Flatten features for each position so we get a single vector
)

Output Shapes

The output shape depends on the parameter combination:

# Example with ESM2 (1280-dimensional embeddings)
X = ProteinSequences.from_fasta("sequences.fasta")

# Default: pooled across positions
model = ESM2Embedding(pool=True)
output = model.transform(X)  # Shape: (n_sequences, 1280)

# Selected positions, no pooling
model = ESM2Embedding(
    positions=[0, 1, 2],
    pool=False
)
output = model.transform(X)  # Shape: (n_sequences, 3, 1280)

# Selected positions, no pooling, flattened
model = ESM2Embedding(
    positions=[0, 1, 2],
    pool=False,
    flatten=True
)
output = model.transform(X)  # Shape: (n_sequences, 3*1280)

Position Specificity for Variable Length Sequences

In some cases models can be position specific even if not all sequences are the same length, such as when working with homologs. However, to map positions between sequences properly, we need to:

  1. Know the positions of interest in a reference sequence (usually wild type)

  2. Align all sequences

  3. Map the reference positions to positions in the alignment

AIDE provides tools to handle this workflow:

# Start with unaligned sequences
X = ProteinSequences.from_fasta("sequences.fasta")
wt = X['wt']
wt_positions = [1, 2, 3]  # 0-indexed positions of interest in wild type

# Align sequences
X = X.align_all()
wt.msa = X

# Get alignment mapping and convert positions
alignment_mapping = X.get_alignment_mapping()
wt_alignment_mapping = alignment_mapping[wt.id]  # or use str(hash(wt)) if no ID
aligned_positions = wt_alignment_mapping[wt_positions]

# Now use these positions in any position-specific model
model = MSATransformerEmbedding(
    positions=aligned_positions,
    pool=False,
    wt=wt,  # used to get the alignment to align incoming sequence to. Alternative, wt can be None if all seqs in X have the msa attribute set to X 

)
model.fit()
embeddings = model.transform(X)

Implementation Notes

  • If positions is specified but pool=True, the model will first select the positions then pool across them

  • flatten=True only applies when pool=False and there are multiple dimensions

  • Models will raise an error if positions are specified but the sequences are not aligned or of fixed length