--- title: Sequence Optimization towards target function with BADASS --- # Protein Optimization with BADASS ## Overview AIDE integrates BADASS, an adaptive simulated annealing algorithm that efficiently explores protein sequence space to find variants with optimal properties. The BADASS algorithm was introduced in [this paper](https://www.biorxiv.org/content/10.1101/2024.10.25.620340v1) and has been adapted in AIDE to work with any of its protein prediction models. ## Installation To use BADASS with AIDE, install the required dependencies: ```bash pip install -r requirements-badass.txt ``` ## Basic Usage Here's a complete example of using BADASS with an ESM2 zero-shot predictor: ```python from aide_predict import ProteinSequence, ESM2LikelihoodWrapper from aide_predict.utils.badass import BADASSOptimizer, BADASSOptimizerParams # 1. Define your protein sequence and prediction model wt = ProteinSequence("MKLLVLGLPGAGKGTQAEKIVAAYGIPHISTGDMFRAAMKEGTPLGLQAKQYMDEGDLVPDEVTIGIVRERLSKDDCQNGFLLDGFPRTVAQAEALETMLADIASRLSALPPATQTRMILMVEDELRNLHRGQVLPSENTFRVADDNEETIKKIRQKYGNSSGVI") # 2. Set up a prediction model # Note that this can be a supervised model. In general, any ProteinModel or # scikit-learn pipeline whose input models are ProteinModelWrapper can be used. model = ESM2LikelihoodWrapper(wt=wt) model.fit([]) # No training needed for zero-shot model # 3. Configure optimization parameters params = BADASSOptimizerParams( num_mutations=3, # Maximum mutations per variant num_iter=100, # Number of optimization iterations seqs_per_iter=200 # Sequences evaluated per iteration ) # 4. Create and run the optimizer optimizer = BADASSOptimizer( predictor=model.predict, reference_sequence=wt, params=params ) # 5. Run optimization # This returns protein variants as well as scores from the optimizer # (which may be scaled and not equal to direct model outputs) results_df, stats_df = optimizer.optimize() # 6. Visualize the optimization process optimizer.plot() # 7. Print top variants print(results_df.sort_values('scores', ascending=False).head(10)) ``` ## Optimization Parameters BADASS behavior can be extensively customized through the `BADASSOptimizerParams` class: ```python params = BADASSOptimizerParams( # Core parameters seqs_per_iter=500, # Sequences per iteration num_iter=200, # Total optimization iterations num_mutations=5, # Maximum mutations per variant init_score_batch_size=500, # Batch size for initial scoring # Algorithm behavior temperature=1.5, # Initial temperature cooling_rate=0.92, # Cooling rate for SA seed=42, # Random seed gamma=0.5, # Variance boosting weight # Constraints sites_to_ignore=[1, 2, 3], # Positions to exclude from mutation (1-indexed) # Advanced options normalize_scores=True, # Normalize scores simple_simulated_annealing=False, # Use simple SA without adaptation cool_then_heat=False, # Use cooling-then-heating schedule adaptive_upper_threshold=None, # Threshold for adaptivity (float for quantile, int for top N) n_seqs_to_keep=None, # Number of sequences to keep in results score_threshold=None, # Score threshold for phase transitions (auto-computed if None) reversal_threshold=None # Score threshold for phase reversals (auto-computed if None) ) ``` ## How BADASS Works BADASS operates through the following key mechanisms: 1. **Initialization**: Computes a score matrix of all single-point mutations 2. **Sampling**: Uses Boltzmann sampling to generate candidate sequences 3. **Scoring**: Evaluates candidates with the provided predictor function 4. **Phase detection**: Identifies when the optimizer has found a promising region 5. **Adaptive temperature**: Adjusts temperature to balance exploration/exploitation 6. **Score normalization**: Standardizes scores for better comparison During optimization, BADASS maintains several tracking matrices: - Score matrix for each amino acid at each position - Observation counts for statistical significance - Variance estimates for uncertainty quantification ## Optimization Results The `optimize()` method returns two DataFrames: 1. `results_df`: Contains information about all evaluated sequences: - `sequences`: Compact mutation representation (e.g., "M1L-K5R") - `scores`: Predicted fitness scores - `full_sequence`: Complete protein sequence - `counts`: Number of times each sequence was evaluated - `num_mutations`: Number of mutations in each sequence - `iteration`: When the sequence was first observed 2. `stats_df`: Contains statistics for each iteration: - `iteration`: Iteration number - `avg_score`: Average score per iteration - `var_score`: Variance of scores - `n_eff_joint`: Effective number of joint samples - `n_eff_sites`: Effective number of sites explored - `n_eff_aa`: Effective number of amino acids explored - `T`: Temperature at each iteration - `n_seqs`: Number of sequences evaluated - `n_new_seqs`: Number of new sequences evaluated - `num_phase_transitions`: Cumulative number of phase transitions ## Analyzing Results After optimization, BADASS offers several visualization and analysis options: ```python # Plot optimization progress optimizer.plot() # Creates multiple plots showing optimization trajectory # Save results to CSV optimizer.save_results("optimization_run") # Get best sequences best_sequences = results_df.sort_values('scores', ascending=False).head(10) # Create a ProteinSequences object from best variants from aide_predict import ProteinSequences top_variants = ProteinSequences(best_sequences['full_sequence'].tolist()) # Further analyze with other AIDE tools from aide_predict.utils.plotting import plot_mutation_heatmap mutations = [seq.get_mutations(wt)[0] for seq in top_variants] scores = best_sequences['scores'].values plot_mutation_heatmap(mutations, scores) ``` The visualization includes: 1. Statistics by iteration (scores, effective samples, temperature) 2. Score distributions vs temperature 3. Score density distributions across early and late iterations ## Performance Considerations - BADASS evaluates thousands of sequences, so efficient predictors are important - For computationally expensive models, consider: - Using model caching (via `CacheMixin`) - Reducing `seqs_per_iter` and `num_iter` - Using batch processing in custom predictors - Increasing `init_score_batch_size` for better initial sampling ## References - BADASS: [biphasic annealing for diverse adaptive sequence sampling](https://www.biorxiv.org/content/10.1101/2024.10.25.620340v1)