Related papers: Combination of digital signal processing and assembled predictive models facilitates the rational design of proteins

Combination of digital signal processing and assembled predictive models facilitates the rational design of proteins

URL: http://arxiv.org/abs/2010.03516v1
Date: Wed, 7 Oct 2020 16:35:02 GMT
Title: Combination of digital signal processing and assembled predictive models facilitates the rational design of proteins
Authors: David Medina-Ortiz and Sebastian Contreras and Juan Amado-Hinojosa and Jorge Torres-Almonacid and Juan A. Asenjo and Marcelo Navarrete and \'Alvaro Olivera-Nappa
Abstract summary: Predicting the effect of mutations in proteins is one of the most critical challenges in protein engineering. We use clustering, embedding, and dimensionality reduction techniques to select combinations of physicochemical properties for the encoding stage. We then select the best performing predictive models in each set of properties and create an assembled model.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Predicting the effect of mutations in proteins is one of the most critical challenges in protein engineering; by knowing the effect a substitution of one (or several) residues in the protein's sequence has on its overall properties, could design a variant with a desirable function. New strategies and methodologies to create predictive models are continually being developed. However, those that claim to be general often do not reach adequate performance, and those that aim to a particular task improve their predictive performance at the cost of the method's generality. Moreover, these approaches typically require a particular decision to encode the amino acidic sequence, without an explicit methodological agreement in such endeavor. To address these issues, in this work, we applied clustering, embedding, and dimensionality reduction techniques to the AAIndex database to select meaningful combinations of physicochemical properties for the encoding stage. We then used the chosen set of properties to obtain several encodings of the same sequence, to subsequently apply the Fast Fourier Transform (FFT) on them. We perform an exploratory stage of Machine-Learning models in the frequency space, using different algorithms and hyperparameters. Finally, we select the best performing predictive models in each set of properties and create an assembled model. We extensively tested the proposed methodology on different datasets and demonstrated that the generated assembled model achieved notably better performance metrics than those models based on a single encoding and, in most cases, better than those previously reported. The proposed method is available as a Python library for non-commercial use under the GNU General Public License (GPLv3) license.

Related papers

Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models [68.57424628540907]
Large language models (LLMs) often develop learned mechanisms specialized to specific datasets.<n>We introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms.<n>Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance.
arXiv Detail & Related papers (2025-07-12T08:10:10Z)
Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction [19.164841536081568]
We introduce Prot2Token, a unified framework that overcomes challenges by converting a wide spectrum of protein-related predictions.<n>At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens.<n>We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Tokens strong predictive power in different types of protein-prediction tasks.
arXiv Detail & Related papers (2025-05-26T23:50:36Z)
Steering Generative Models with Experimental Data for Protein Fitness Optimization [22.131533900376457]
Protein fitness optimization involves finding a sequence that maximizes desired quantitative properties in a large design space of possible sequences.<n>Recent developments in steering protein generative models (e.g. diffusion models, language models) offer a promising approach.<n>We show that plug-and-play guidance strategies offer advantages compared to alternatives such as reinforcement learning with protein language models.
arXiv Detail & Related papers (2025-05-21T04:30:48Z)
Functional Graphical Models: Structure Enables Offline Data-Driven Optimization [111.28605744661638]
We show how structure can enable sample-efficient data-driven optimization. We also present a data-driven optimization algorithm that infers the FGM structure itself.
arXiv Detail & Related papers (2024-01-08T22:33:14Z)
Best-Subset Selection in Generalized Linear Models: A Fast and Consistent Algorithm via Splicing Technique [0.6338047104436422]
Best subset section has been widely regarded as the Holy Grail of problems of this type. We proposed and illustrated an algorithm for best subset recovery in mild conditions. Our implementation achieves approximately a fourfold speedup compared to popular variable selection toolkits.
arXiv Detail & Related papers (2023-08-01T03:11:31Z)
Protein Design with Guided Discrete Diffusion [67.06148688398677]
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. We propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models. NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods.
arXiv Detail & Related papers (2023-05-31T16:31:24Z)
HyperImpute: Generalized Iterative Imputation with Automatic Model Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models. We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z)
Fourier Representations for Black-Box Optimization over Categorical Variables [34.0277529502051]
We propose to use existing methods in conjunction with a surrogate model for the black-box evaluations over purely categorical variables. To learn such representations, we consider two different settings to update our surrogate model. Numerical experiments over synthetic benchmarks as well as real-world RNA sequence optimization and design problems demonstrate the representational power of the proposed methods.
arXiv Detail & Related papers (2022-02-08T08:14:58Z)
Conservative Objective Models for Effective Offline Model-Based Optimization [78.19085445065845]
Computational design problems arise in a number of settings, from synthetic biology to computer architectures. We propose a method that learns a model of the objective function that lower bounds the actual value of the ground-truth objective on out-of-distribution inputs. COMs are simple to implement and outperform a number of existing methods on a wide range of MBO problems.
arXiv Detail & Related papers (2021-07-14T17:55:28Z)
Adaptive machine learning for protein engineering [0.4568777157687961]
We discuss how to use a sequence-to-function machine-learning surrogate model to select sequences for experimental measurement. First, we discuss how to select sequences through a single round of machine-learning optimization. Then, we discuss sequential optimization, where the goal is to discover optimized sequences and improve the model across multiple rounds of training, optimization, and experimental measurement.
arXiv Detail & Related papers (2021-06-10T02:56:35Z)
Evolutionary Variational Optimization of Generative Models [0.0]
We combine two popular optimization approaches to derive learning algorithms for generative models: variational optimization and evolutionary algorithms. We show that evolutionary algorithms can effectively and efficiently optimize the variational bound. In the category of "zero-shot" learning, we observed the evolutionary variational algorithm to significantly improve the state-of-the-art in many benchmark settings.
arXiv Detail & Related papers (2020-12-22T19:06:33Z)
AdaLead: A simple and robust adaptive greedy search algorithm for sequence design [55.41644538483948]
We develop an easy-to-directed, scalable, and robust evolutionary greedy algorithm (AdaLead) AdaLead is a remarkably strong benchmark that out-competes more complex state of the art approaches in a variety of biologically motivated sequence design challenges.
arXiv Detail & Related papers (2020-10-05T16:40:38Z)
Fast differentiable DNA and protein sequence optimization for molecular design [0.0]
Machine learning models that accurately predict biological fitness from sequence are becoming a powerful tool for molecular design. Here, we build on a previously proposed straight-through approximation method to optimize through discrete sequence samples. The resulting algorithm, which we call Fast SeqPropProp, achieves up to 100-fold faster convergence compared to previous versions.
arXiv Detail & Related papers (2020-05-22T17:03:55Z)
Stepwise Model Selection for Sequence Prediction via Deep Kernel Learning [100.83444258562263]
We propose a novel Bayesian optimization (BO) algorithm to tackle the challenge of model selection in this setting. In order to solve the resulting multiple black-box function optimization problem jointly and efficiently, we exploit potential correlations among black-box functions. We are the first to formulate the problem of stepwise model selection (SMS) for sequence prediction, and to design and demonstrate an efficient joint-learning algorithm for this purpose.
arXiv Detail & Related papers (2020-01-12T09:42:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.