Related papers: A primer on model-guided exploration of fitness landscapes for biological sequence design

A primer on model-guided exploration of fitness landscapes for biological sequence design

URL: http://arxiv.org/abs/2010.10614v2
Date: Fri, 23 Oct 2020 14:25:05 GMT
Title: A primer on model-guided exploration of fitness landscapes for biological sequence design
Authors: Sam Sinai and Eric D Kelsic
Abstract summary: In this primer we highlight that algorithms for experimental design, what we call "exploration strategies", are a related, yet distinct problem from building good models of sequence-to-function maps. This primer can serve as a starting point for researchers from different domains that are interested in the problem of searching a sequence space with a model.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine learning methods are increasingly employed to address challenges faced by biologists. One area that will greatly benefit from this cross-pollination is the problem of biological sequence design, which has massive potential for therapeutic applications. However, significant inefficiencies remain in communication between these fields which result in biologists finding the progress in machine learning inaccessible, and hinder machine learning scientists from contributing to impactful problems in bioengineering. Sequence design can be seen as a search process on a discrete, high-dimensional space, where each sequence is associated with a function. This sequence-to-function map is known as a "Fitness Landscape". Designing a sequence with a particular function is hence a matter of "discovering" such a (often rare) sequence within this space. Today we can build predictive models with good interpolation ability due to impressive progress in the synthesis and testing of biological sequences in large numbers, which enables model training and validation. However, it often remains a challenge to find useful sequences with the properties that we like using these models. In particular, in this primer we highlight that algorithms for experimental design, what we call "exploration strategies", are a related, yet distinct problem from building good models of sequence-to-function maps. We review advances and insights from current literature -- by no means a complete treatment -- while highlighting desirable features of optimal model-guided exploration, and cover potential pitfalls drawn from our own experience. This primer can serve as a starting point for researchers from different domains that are interested in the problem of searching a sequence space with a model, but are perhaps unaware of approaches that originate outside their field.

Related papers

GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
ProtGO: A Transformer based Fusion Model for accurately predicting Gene Ontology (GO) Terms from full scale Protein Sequences [0.11049608786515838]
We propose a transformer-based fusion model capable of predicting Gene Ontology terms from full-scale protein sequences. The model is able to understand both short and long term dependencies within the enzyme's structure and can precisely identify the motifs associated with the various GO terms.
arXiv Detail & Related papers (2024-12-08T02:09:45Z)
A Learning Search Algorithm for the Restricted Longest Common Subsequence Problem [40.64116457007417]
The Restricted Longest Common Subsequence (RLCS) problem has significant applications in bioinformatics. This paper introduces two novel approaches designed to enhance the search process by steering it towards promising regions. An important contribution of this paper is found in the generation of real-world instances where scientific abstracts serve as input strings.
arXiv Detail & Related papers (2024-10-15T20:02:15Z)
Towards Statistically Significant Taxonomy Aware Co-location Pattern Detection [4.095979270829907]
The goal is to find subsets of feature types or their parents whose spatial interaction is statistically significant. The problem is computationally challenging due to the exponential number of candidate co-location patterns generated by the taxonomy. This paper introduces two methods for incorporating and assessing the statistical significance of co-location patterns.
arXiv Detail & Related papers (2024-06-29T04:48:39Z)
Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues. We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space. A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z)
ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab [67.24684071577211]
The challenge of replicating research results has posed a significant impediment to the field of molecular biology. We first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective. Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings.
arXiv Detail & Related papers (2023-11-01T14:44:01Z)
Causal machine learning for single-cell genomics [94.28105176231739]
We discuss the application of machine learning techniques to single-cell genomics and their challenges. We first present the model that underlies most of current causal approaches to single-cell biology. We then identify open problems in the application of causal approaches to single-cell data.
arXiv Detail & Related papers (2023-10-23T13:35:24Z)
Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test. We train a variational inference model to predict the causal structure from observational/interventional data. Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z)
Interpretable Structured Learning with Sparse Gated Sequence Encoder for Protein-Protein Interaction Prediction [2.9488233765621295]
Predicting protein-protein interactions (PPIs) by learning informative representations from amino acid sequences is a challenging yet important problem in biology. We present a novel deep framework to model and predict PPIs from sequence alone. Our model incorporates a bidirectional gated recurrent unit to learn sequence representations by leveraging contextualized and sequential information from sequences.
arXiv Detail & Related papers (2020-10-16T17:13:32Z)
AdaLead: A simple and robust adaptive greedy search algorithm for sequence design [55.41644538483948]
We develop an easy-to-directed, scalable, and robust evolutionary greedy algorithm (AdaLead) AdaLead is a remarkably strong benchmark that out-competes more complex state of the art approaches in a variety of biologically motivated sequence design challenges.
arXiv Detail & Related papers (2020-10-05T16:40:38Z)
A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.