--- title: Contributing Models to AIDE --- # Contributing Models to AIDE ## Overview AIDE is designed to make it easy to wrap new protein prediction models into a scikit-learn compatible interface. This guide walks through the process of contributing a new model. ### 1. Setting Up Development Environment ```bash git clone https://github.com/beckham-lab/aide_predict cd aide_predict conda env create -f environment.yaml conda activate aide_predict pip install -e ".[dev]" # Installs in editable mode with development dependencies ``` ### 2. Understanding Model Dependencies AIDE uses a tiered dependency system to minimize conflicts and installation complexity: 1. **Base Dependencies**: If your model only needs numpy, scipy, scikit-learn, etc., it can be included in the base package. 2. **Optional Dependencies**: If your model needs additional pip-installable packages: - Create or update a `requirements-.txt` file - Example: `requirements-transformers.txt` for models using HuggingFace transformers 3. **Complex Dependencies**: If your model requires a specific environment or complex setup: - Package should be installed separately - AIDE will call it via subprocess - Model checks for environment variables pointing to installation - Example: EVE model checking for `EVE_REPO` and `EVE_CONDA_ENV` ### 3. Creating the Model Class Models should be placed in one of two directories: - `aide_predict/bespoke_models/embedders/`: For models that create numerical features - `aide_predict/bespoke_models/predictors/`: For models that predict protein properties Basic structure: ```python from aide_predict.bespoke_models.base import ProteinModelWrapper from aide_predict.utils.common import MessageBool # Check dependencies try: import some_required_package AVAILABLE = MessageBool(True, "Model is available") except ImportError: AVAILABLE = MessageBool(False, "Requires some_required_package") class MyModel(ProteinModelWrapper): """Documentation in NumPy style. Parameters ---------- param1 : type Description metadata_folder : str, optional Directory for model files wt : ProteinSequence, optional Wild-type sequence for comparative predictions Attributes ---------- fitted_ : bool Whether model has been fitted """ _available = AVAILABLE # Class attribute for availability def __init__(self, param1, metadata_folder=None, wt=None, **kwargs): super().__init__(metadata_folder=metadata_folder, wt=wt, **kwargs) self.param1 = param1 # Save user parameters as attributes def _fit(self, X, y=None): """Fit the model. Called by public fit() method.""" # Implementation self.fitted_ = True # Mark as fitted return self def _transform(self, X): """Transform sequences. Called by public transform() method.""" # Implementation return features ``` ### 4. Adding Model Requirements with Mixins AIDE uses mixins to declare model requirements and capabilities. Common mixins: ```python # Input requirements RequiresMSAForFitMixin # Needs MSA for fit method. If not found, will attempt to fall back to WT sequence msa RequiresWTMSAMixin # Needs a WT sequence with an msa RequiresMSAPerSequenceMixin # The model needs msas, but can handle having different MSAs for each input. If inputs do not have MSAs, will attempt fall back to WT sequence msa RequiresFixedLengthMixin # Sequences must be same length RequiresStructureMixin # Uses structural information RequiresWTToFunctionMixin # Needs wild-type sequence RequiresWTDuringInferenceMixin # Model does its own normalization to any WT internally. If not inheritted, aide will automatically normalize outputs to any WT sequence provided # Output capabilities CanRegressMixin # Can predict numeric values PositionSpecificMixin # Outputs per-position scores or embeddings # Processing behavior CacheMixin # Enables result caching AcceptsLowerCaseMixin # Handles lowercase sequences ExpectsNoFitMixin # Does not require any inputs to the fit method ShouldRefitOnSequencesMixin # restore sklearn default behaviour to refit when fit is called or params are set. Be default, models do not refit. ``` Example with mixins: ```python class MyModel( RequiresMSAMixin, # Needs MSA for training CanRegressMixin, # Makes predictions PositionSpecificMixin, # Per-position outputs CacheMixin, # Caches results ProteinModelWrapper # Always last ): pass ``` Ensure that the `_avialable` attribute is set to a valid `MessageBool` object that is computed on import based on the availability of the model's dependencies. ### 5. Testing Your Model If applicable, add scientific validation tests in `tests/test_not_base_models/`: ```python from aide_predict.bespoke_models.embedders.my_model import MyModel def test_my_model_benchmark(): """Test against published benchmark.""" model = MyModel() score = model.score(benchmark_data) assert score >= expected_performance ``` Run the tests with `pytest tests/test_not_base_models/test_my_model.py`, and copy the results. Ensure that this test is not tracked by coverage, as we do not run CI on non-base models that have additional dependencies: Update `.coveragerc`: ``` omit = ... other omitted files are here ... aide_predict/bespoke_models/embedders/my_model.py ``` ### 7. Expose your model so that AIDE can find it and test it against user data Update `aide_predict/bespoke_models/__init__.py` to include your model in the `TOOLS` list: ```python from .embedders.my_model import MyModel TOOLS = [ ...other tools are here... MyModel ] ``` ### 7. Submitting Your Contribution 1. Create a new branch 2. Implement your model in its own module 3. Add any tests 4. Submit a pull request, add any test results to the pull request so the expected performance can be verified