Biological Sequence Design with GFlowNets
- URL: http://arxiv.org/abs/2203.04115v3
- Date: Wed, 24 May 2023 11:21:54 GMT
- Title: Biological Sequence Design with GFlowNets
- Authors: Moksh Jain, Emmanuel Bengio, Alex-Hernandez Garcia, Jarrid
Rector-Brooks, Bonaventure F. P. Dossou, Chanakya Ekbote, Jie Fu, Tianyu
Zhang, Micheal Kilgour, Dinghuai Zhang, Lena Simine, Payel Das, Yoshua Bengio
- Abstract summary: Design of de novo biological sequences with desired properties often involves an active loop with several rounds of molecule ideation and expensive wet-lab evaluations.
This makes the diversity of proposed candidates a key consideration in the ideation phase.
We propose an active learning algorithm leveraging uncertainty estimation and the recently proposed GFlowNets as a generator of diverse candidate solutions.
- Score: 75.1642973538266
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Design of de novo biological sequences with desired properties, like protein
and DNA sequences, often involves an active loop with several rounds of
molecule ideation and expensive wet-lab evaluations. These experiments can
consist of multiple stages, with increasing levels of precision and cost of
evaluation, where candidates are filtered. This makes the diversity of proposed
candidates a key consideration in the ideation phase. In this work, we propose
an active learning algorithm leveraging epistemic uncertainty estimation and
the recently proposed GFlowNets as a generator of diverse candidate solutions,
with the objective to obtain a diverse batch of useful (as defined by some
utility function, for example, the predicted anti-microbial activity of a
peptide) and informative candidates after each round. We also propose a scheme
to incorporate existing labeled datasets of candidates, in addition to a reward
function, to speed up learning in GFlowNets. We present empirical results on
several biological sequence design tasks, and we find that our method generates
more diverse and novel batches with high scoring candidates compared to
existing approaches.
Related papers
- Efficient Biological Data Acquisition through Inference Set Design [3.9633147697178996]
In this work, we aim to select the smallest set of candidates in order to achieve some desired level of accuracy for the system as a whole.
We call this mechanism inference set design, and propose the use of a confidence-based active learning solution to prune out challenging examples.
arXiv Detail & Related papers (2024-10-25T15:34:03Z) - Sample-efficient Multi-objective Molecular Optimization with GFlowNets [5.030493242666028]
We propose a multi-objective Bayesian optimization (MOBO) algorithm leveraging the hypernetwork-based GFlowNets (HN-GFN)
Using a single preference-conditioned hypernetwork, HN-GFN learns to explore various trade-offs between objectives.
Experiments in various real-world settings demonstrate that our framework predominantly outperforms existing methods in terms of candidate quality and sample efficiency.
arXiv Detail & Related papers (2023-02-08T13:30:28Z) - Multi-Objective GFlowNets [59.16787189214784]
We study the problem of generating diverse candidates in the context of Multi-Objective Optimization.
In many applications of machine learning such as drug discovery and material design, the goal is to generate candidates which simultaneously optimize a set of potentially conflicting objectives.
We propose Multi-Objective GFlowNets (MOGFNs), a novel method for generating diverse optimal solutions, based on GFlowNets.
arXiv Detail & Related papers (2022-10-23T16:15:36Z) - Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine
Learning [54.247560894146105]
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria.
We propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity.
arXiv Detail & Related papers (2022-08-10T13:30:58Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z) - Flow Network based Generative Models for Non-Iterative Diverse Candidate
Generation [110.09855163856326]
This paper is about the problem of learning a policy for generating an object from a sequence of actions.
We propose GFlowNet, based on a view of the generative process as a flow network.
We prove that any global minimum of the proposed objectives yields a policy which samples from the desired distribution.
arXiv Detail & Related papers (2021-06-08T14:21:10Z) - Online Active Model Selection for Pre-trained Classifiers [72.84853880948894]
We design an online selective sampling approach that actively selects informative examples to label and outputs the best model with high probability at any round.
Our algorithm can be used for online prediction tasks for both adversarial and streams.
arXiv Detail & Related papers (2020-10-19T19:53:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.