Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine
Learning
- URL: http://arxiv.org/abs/2208.05341v1
- Date: Wed, 10 Aug 2022 13:30:58 GMT
- Title: Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine
Learning
- Authors: Siba Moussa, Michael Kilgour, Clara Jans, Alex Hernandez-Garcia,
Miroslava Cuperlovic-Culf, Yoshua Bengio, and Lena Simine
- Abstract summary: Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria.
We propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity.
- Score: 54.247560894146105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inverse design of short single-stranded RNA and DNA sequences (aptamers) is
the task of finding sequences that satisfy a set of desired criteria. Relevant
criteria may be, for example, the presence of specific folding motifs, binding
to molecular ligands, sensing properties, etc. Most practical approaches to
aptamer design identify a small set of promising candidate sequences using
high-throughput experiments (e.g. SELEX), and then optimize performance by
introducing only minor modifications to the empirically found candidates.
Sequences that possess the desired properties but differ drastically in
chemical composition will add diversity to the search space and facilitate the
discovery of useful nucleic acid aptamers. Systematic diversification protocols
are needed. Here we propose to use an unsupervised machine learning model known
as the Potts model to discover new, useful sequences with controllable sequence
diversity. We start by training a Potts model using the maximum entropy
principle on a small set of empirically identified sequences unified by a
common feature. To generate new candidate sequences with a controllable degree
of diversity, we take advantage of the model's spectral feature: an energy
bandgap separating sequences that are similar to the training set from those
that are distinct. By controlling the Potts energy range that is sampled, we
generate sequences that are distinct from the training set yet still likely to
have the encoded features. To demonstrate performance, we apply our approach to
design diverse pools of sequences with specified secondary structure motifs in
30-mer RNA and DNA aptamers.
Related papers
- Reinforcement Learning for Sequence Design Leveraging Protein Language Models [14.477268882311991]
We propose to use protein language models (PLMs) as a reward function to generate new sequences.
We perform extensive experiments on various sequence lengths to benchmark RL-based approaches.
We provide comprehensive evaluations along biological plausibility and diversity of the protein.
arXiv Detail & Related papers (2024-07-03T14:31:36Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - A Pareto-optimal compositional energy-based model for sampling and
optimization of protein sequences [55.25331349436895]
Deep generative models have emerged as a popular machine learning-based approach for inverse problems in the life sciences.
These problems often require sampling new designs that satisfy multiple properties of interest in addition to learning the data distribution.
arXiv Detail & Related papers (2022-10-19T19:04:45Z) - Reprogramming Pretrained Language Models for Antibody Sequence Infilling [72.13295049594585]
Computational design of antibodies involves generating novel and diverse sequences, while maintaining structural consistency.
Recent deep learning models have shown impressive results, however the limited number of known antibody sequence/structure pairs frequently leads to degraded performance.
In our work we address this challenge by leveraging Model Reprogramming (MR), which repurposes pretrained models on a source language to adapt to the tasks that are in a different language and have scarce data.
arXiv Detail & Related papers (2022-10-05T20:44:55Z) - Retrieval-based Controllable Molecule Generation [63.44583084888342]
We propose a new retrieval-based framework for controllable molecule generation.
We use a small set of molecules to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria.
Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning.
arXiv Detail & Related papers (2022-08-23T17:01:16Z) - Biological Sequence Design with GFlowNets [75.1642973538266]
Design of de novo biological sequences with desired properties often involves an active loop with several rounds of molecule ideation and expensive wet-lab evaluations.
This makes the diversity of proposed candidates a key consideration in the ideation phase.
We propose an active learning algorithm leveraging uncertainty estimation and the recently proposed GFlowNets as a generator of diverse candidate solutions.
arXiv Detail & Related papers (2022-03-02T15:53:38Z) - HpGAN: Sequence Search with Generative Adversarial Networks [21.770047587104923]
This article proposes a novel method, called HpGAN, to search desired sequences algorithmically using generative adversarial networks (GAN)
HpGAN is based on the idea of zero-sum game to train a generative model, which can generate sequences with characteristics similar to the training sequences.
arXiv Detail & Related papers (2020-12-10T13:05:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.