A primer on model-guided exploration of fitness landscapes for
biological sequence design
- URL: http://arxiv.org/abs/2010.10614v2
- Date: Fri, 23 Oct 2020 14:25:05 GMT
- Title: A primer on model-guided exploration of fitness landscapes for
biological sequence design
- Authors: Sam Sinai and Eric D Kelsic
- Abstract summary: In this primer we highlight that algorithms for experimental design, what we call "exploration strategies", are a related, yet distinct problem from building good models of sequence-to-function maps.
This primer can serve as a starting point for researchers from different domains that are interested in the problem of searching a sequence space with a model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning methods are increasingly employed to address challenges
faced by biologists. One area that will greatly benefit from this
cross-pollination is the problem of biological sequence design, which has
massive potential for therapeutic applications. However, significant
inefficiencies remain in communication between these fields which result in
biologists finding the progress in machine learning inaccessible, and hinder
machine learning scientists from contributing to impactful problems in
bioengineering. Sequence design can be seen as a search process on a discrete,
high-dimensional space, where each sequence is associated with a function. This
sequence-to-function map is known as a "Fitness Landscape". Designing a
sequence with a particular function is hence a matter of "discovering" such a
(often rare) sequence within this space. Today we can build predictive models
with good interpolation ability due to impressive progress in the synthesis and
testing of biological sequences in large numbers, which enables model training
and validation. However, it often remains a challenge to find useful sequences
with the properties that we like using these models. In particular, in this
primer we highlight that algorithms for experimental design, what we call
"exploration strategies", are a related, yet distinct problem from building
good models of sequence-to-function maps. We review advances and insights from
current literature -- by no means a complete treatment -- while highlighting
desirable features of optimal model-guided exploration, and cover potential
pitfalls drawn from our own experience. This primer can serve as a starting
point for researchers from different domains that are interested in the problem
of searching a sequence space with a model, but are perhaps unaware of
approaches that originate outside their field.
Related papers
- A Learning Search Algorithm for the Restricted Longest Common Subsequence Problem [40.64116457007417]
The Restricted Longest Common Subsequence (RLCS) problem has significant applications in bioinformatics.
This paper introduces two novel approaches designed to enhance the search process by steering it towards promising regions.
An important contribution of this paper is found in the generation of real-world instances where scientific abstracts serve as input strings.
arXiv Detail & Related papers (2024-10-15T20:02:15Z) - Towards Statistically Significant Taxonomy Aware Co-location Pattern Detection [4.095979270829907]
The goal is to find subsets of feature types or their parents whose spatial interaction is statistically significant.
The problem is computationally challenging due to the exponential number of candidate co-location patterns generated by the taxonomy.
This paper introduces two methods for incorporating and assessing the statistical significance of co-location patterns.
arXiv Detail & Related papers (2024-06-29T04:48:39Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab [67.24684071577211]
The challenge of replicating research results has posed a significant impediment to the field of molecular biology.
We first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective.
Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings.
arXiv Detail & Related papers (2023-11-01T14:44:01Z) - Causal machine learning for single-cell genomics [94.28105176231739]
We discuss the application of machine learning techniques to single-cell genomics and their challenges.
We first present the model that underlies most of current causal approaches to single-cell biology.
We then identify open problems in the application of causal approaches to single-cell data.
arXiv Detail & Related papers (2023-10-23T13:35:24Z) - Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test.
We train a variational inference model to predict the causal structure from observational/interventional data.
Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z) - Interpretable Structured Learning with Sparse Gated Sequence Encoder for
Protein-Protein Interaction Prediction [2.9488233765621295]
Predicting protein-protein interactions (PPIs) by learning informative representations from amino acid sequences is a challenging yet important problem in biology.
We present a novel deep framework to model and predict PPIs from sequence alone.
Our model incorporates a bidirectional gated recurrent unit to learn sequence representations by leveraging contextualized and sequential information from sequences.
arXiv Detail & Related papers (2020-10-16T17:13:32Z) - AdaLead: A simple and robust adaptive greedy search algorithm for
sequence design [55.41644538483948]
We develop an easy-to-directed, scalable, and robust evolutionary greedy algorithm (AdaLead)
AdaLead is a remarkably strong benchmark that out-competes more complex state of the art approaches in a variety of biologically motivated sequence design challenges.
arXiv Detail & Related papers (2020-10-05T16:40:38Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.