Generating MSAs with MMseqs2
For problems where you have not already determined an MSA with another tool (eg. Jackhmmer, EVCouplings, MMseqs, etc.) AIDE provides a high lavel wrapper for generating Multiple Sequence Alignments (MSAs) using MMseqs2, implementing the sensitive search similar to colabfold. This can be useful when you need MSAs for models like EVMutation, MSATransformer, or EVE. This is literally just calling MMseqs with a few parameters set - all credit should go to the authors of MMseqs and Colabfold:
Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017).
Mirdita, M., Schütze, K., Moriwaki, Y. et al. ColabFold: making protein folding accessible to all. Nat Methods 19, 679–682 (2022). https://doi.org/10.1038/s41592-022-01488-1
Installation
Ensure MMseqs2 is installed and available in your PATH:
conda install -c bioconda mmseqs2
Download the ColabFold database(s): https://colabfold.mmseqs.com/. You will need to point towards this database to run the search.
Basic Usage
Python Interface
from aide_predict import ProteinSequences
from aide_predict.utils.mmseqs_msa_search import run_mmseqs_search
# Load sequences
sequences = ProteinSequences.from_fasta("proteins.fasta")
# Generate MSAs
msa_paths = run_mmseqs_search(
sequences=sequences,
uniref_db="path/to/uniref30_2302",
output_dir="./msas"
)
# Load MSAs for use with models
from aide_predict import ProteinSequences
msas = [ProteinSequences.from_a3m(path) for path in msa_paths]
Command Line Interface
You can also run MSA generation directly from the command line:
python -m aide_predict.utils.mmseqs_msa_search \
proteins.fasta \
path/to/uniref30_2302 \
./msas
Advanced Options
The search can be customized with several parameters:
msa_paths = run_mmseqs_search(
sequences=sequences,
uniref_db="path/to/uniref30_2302",
output_dir="./msas",
mode='sensitive', # Search sensitivity: 'fast', 'standard', or 'sensitive'
threads=8, # Number of CPU threads
)
Command line equivalents:
python -m aide_predict.utils.mmseqs_msa_search \
proteins.fasta \
path/to/uniref30_2302 \
./msas \
--mode sensitive \
--threads 8 \
--keep-tmp
Search Modes
Three sensitivity modes are available:
fast
: Quick search with sensitivity 4.0standard
: Balanced approach with sensitivity 5.7 (default)sensitive
: More thorough search with sensitivity 7.5
Higher sensitivity will find more distant homologs but takes longer to run.
Output Format
MSAs are generated in A3M format, one file per input sequence. The files are named based on the sequence IDs in your input FASTA file. These files can be directly used with AIDE’s MSA-based models:
# Use MSA with a model
from aide_predict import MSATransformerLikelihoodWrapper
msa = ProteinSequences.from_a3m("msas/sequence1.a3m")
model = MSATransformerLikelihoodWrapper(wt=wt)
model.fit(msa)