Contributing Models to AIDE
Overview
AIDE is designed to make it easy to wrap new protein prediction models into a scikit-learn compatible interface. This guide walks through the process of contributing a new model.
1. Setting Up Development Environment
git clone https://github.com/beckham-lab/aide_predict
cd aide_predict
conda env create -f environment.yaml
conda activate aide_predict
pip install -e ".[dev]" # Installs in editable mode with development dependencies
2. Understanding Model Dependencies
AIDE uses a tiered dependency system to minimize conflicts and installation complexity:
Base Dependencies: If your model only needs numpy, scipy, scikit-learn, etc., it can be included in the base package.
Optional Dependencies: If your model needs additional pip-installable packages:
Create or update a
requirements-<feature>.txt
fileExample:
requirements-transformers.txt
for models using HuggingFace transformers
Complex Dependencies: If your model requires a specific environment or complex setup:
Package should be installed separately
AIDE will call it via subprocess
Model checks for environment variables pointing to installation
Example: EVE model checking for
EVE_REPO
andEVE_CONDA_ENV
3. Creating the Model Class
Models should be placed in one of two directories:
aide_predict/bespoke_models/embedders/
: For models that create numerical featuresaide_predict/bespoke_models/predictors/
: For models that predict protein properties
Basic structure:
from aide_predict.bespoke_models.base import ProteinModelWrapper
from aide_predict.utils.common import MessageBool
# Check dependencies
try:
import some_required_package
AVAILABLE = MessageBool(True, "Model is available")
except ImportError:
AVAILABLE = MessageBool(False, "Requires some_required_package")
class MyModel(ProteinModelWrapper):
"""Documentation in NumPy style.
Parameters
----------
param1 : type
Description
metadata_folder : str, optional
Directory for model files
wt : ProteinSequence, optional
Wild-type sequence for comparative predictions
Attributes
----------
fitted_ : bool
Whether model has been fitted
"""
_available = AVAILABLE # Class attribute for availability
def __init__(self, param1, metadata_folder=None, wt=None, **kwargs):
super().__init__(metadata_folder=metadata_folder, wt=wt, **kwargs)
self.param1 = param1 # Save user parameters as attributes
def _fit(self, X, y=None):
"""Fit the model. Called by public fit() method."""
# Implementation
self.fitted_ = True # Mark as fitted
return self
def _transform(self, X):
"""Transform sequences. Called by public transform() method."""
# Implementation
return features
4. Adding Model Requirements with Mixins
AIDE uses mixins to declare model requirements and capabilities. Common mixins:
# Input requirements
RequiresMSAForFitMixin # Needs MSA for fit method. If not found, will attempt to fall back to WT sequence msa
RequiresWTMSAMixin # Needs a WT sequence with an msa
RequiresMSAPerSequenceMixin # The model needs msas, but can handle having different MSAs for each input. If inputs do not have MSAs, will attempt fall back to WT sequence msa
RequiresFixedLengthMixin # Sequences must be same length
RequiresStructureMixin # Uses structural information
RequiresWTToFunctionMixin # Needs wild-type sequence
RequiresWTDuringInferenceMixin # Model does its own normalization to any WT internally. If not inheritted, aide will automatically normalize outputs to any WT sequence provided
# Output capabilities
CanRegressMixin # Can predict numeric values
PositionSpecificMixin # Outputs per-position scores or embeddings
# Processing behavior
CacheMixin # Enables result caching
AcceptsLowerCaseMixin # Handles lowercase sequences
ExpectsNoFitMixin # Does not require any inputs to the fit method
ShouldRefitOnSequencesMixin # restore sklearn default behaviour to refit when fit is called or params are set. Be default, models do not refit.
Example with mixins:
class MyModel(
RequiresMSAMixin, # Needs MSA for training
CanRegressMixin, # Makes predictions
PositionSpecificMixin, # Per-position outputs
CacheMixin, # Caches results
ProteinModelWrapper # Always last
):
pass
Ensure that the _avialable
attribute is set to a valid MessageBool
object that is computed on import based on the availability of the model’s dependencies.
5. Testing Your Model
If applicable, add scientific validation tests in tests/test_not_base_models/
:
from aide_predict.bespoke_models.embedders.my_model import MyModel
def test_my_model_benchmark():
"""Test against published benchmark."""
model = MyModel()
score = model.score(benchmark_data)
assert score >= expected_performance
Run the tests with pytest tests/test_not_base_models/test_my_model.py
, and copy the results.
Ensure that this test is not tracked by coverage, as we do not run CI on non-base models that have additional dependencies:
Update .coveragerc
:
omit =
... other omitted files are here ...
aide_predict/bespoke_models/embedders/my_model.py
7. Expose your model so that AIDE can find it and test it against user data
Update aide_predict/bespoke_models/__init__.py
to include your model in the TOOLS
list:
from .embedders.my_model import MyModel
TOOLS = [
...other tools are here...
MyModel
]
7. Submitting Your Contribution
Create a new branch
Implement your model in its own module
Add any tests
Submit a pull request, add any test results to the pull request so the expected performance can be verified