Building ML Pipelines

AIDE models can be combined with standard scikit-learn components into pipelines. Here’s an example that combines one-hot encoding and ESM2 ZS predictions with a random forest:

from aide_predict import OneHotProteinEmbedding, ESM2LikelihoodWrapper, ProteinSequence, ProteinSequences
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.ensemble import RandomForestRegressor

# Load data
sequences = ProteinSequences.from_fasta("sequences.fasta")
y = np.load("activity_values.npy")

# Create wild type reference
wt = sequences["wild_type"]

# Create feature union that combines raw OHE with scaled ESM2 scores
features = FeatureUnion([
    # One-hot encoding (keep as binary)
    ('ohe', OneHotProteinEmbedding(flatten=True)),
    
    # ESM2 features (apply scaling)
    ('esm2', Pipeline([
        ('predictor', ESM2LikelihoodWrapper(wt=wt, marginal_method="masked_marginal")),
        ('reshaper', FunctionTransformer(lambda x: x.reshape(-1, 1))),
        ('scaler', StandardScaler())
    ]))
])

# Create and train pipeline
pipeline = Pipeline([
    ('features', features),
    ('rf', RandomForestRegressor())
])

pipeline.fit(sequences, y)
predictions = pipeline.predict(sequences)

The pipeline can be saved and loaded like any scikit-learn model:

from joblib import dump, load
dump(pipeline, 'protein_model.joblib')

All standard scikit-learn tools like GridSearchCV or cross_val_score can be used with these pipelines.