Active Learning on Synthons for Molecular Design
- URL: http://arxiv.org/abs/2505.12913v1
- Date: Mon, 19 May 2025 09:48:02 GMT
- Title: Active Learning on Synthons for Molecular Design
- Authors: Tom George Grigg, Mason Burlage, Oliver Brook Scott, Adam Taouil, Dominique Sydow, Liam Wilbraham,
- Abstract summary: We introduce Scalable Active Learning via Synthon Acquisition (SALSA), a simple algorithm applicable to multi-vector expansion.<n>SALSA extends pool-based active learning to non-enumerable spaces by factoring modeling and acquisition over synthon or fragment choices.<n>We show that SALSA-generated molecules have comparable chemical property profiles to known bioactives, and exhibit greater diversity and higher scores over an industry-leading generative approach.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Exhaustive virtual screening is highly informative but often intractable against the expensive objective functions involved in modern drug discovery. This problem is exacerbated in combinatorial contexts such as multi-vector expansion, where molecular spaces can quickly become ultra-large. Here, we introduce Scalable Active Learning via Synthon Acquisition (SALSA): a simple algorithm applicable to multi-vector expansion which extends pool-based active learning to non-enumerable spaces by factoring modeling and acquisition over synthon or fragment choices. Through experiments on ligand- and structure-based objectives, we highlight SALSA's sample efficiency, and its ability to scale to spaces of trillions of compounds. Further, we demonstrate application toward multi-parameter objective design tasks on three protein targets - finding SALSA-generated molecules have comparable chemical property profiles to known bioactives, and exhibit greater diversity and higher scores over an industry-leading generative approach.
Related papers
- Active Learning-Guided Seq2Seq Variational Autoencoder for Multi-target Inhibitor Generation [0.0]
We propose a structured active learning paradigm to balance chemical diversity, molecular quality, and multi-target affinity.<n>Our method alternates between expanding chemically feasible regions of latent space and progressively constraining molecules.<n>We demonstrate that careful timing and strategic placement of chemical filters within this active learning pipeline markedly enhance exploration of beneficial chemical space.
arXiv Detail & Related papers (2025-06-18T09:39:51Z) - Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [51.316001071698224]
We introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset.<n>This dataset can bridge the gap between large language models (LLMs) and complex biological sequences-related tasks.<n>We also develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline.
arXiv Detail & Related papers (2024-12-26T12:12:23Z) - Many-Shot In-Context Learning for Molecular Inverse Design [56.65345962071059]
Large Language Models (LLMs) have demonstrated great performance in few-shot In-Context Learning (ICL)
We develop a new semi-supervised learning method that overcomes the lack of experimental data available for many-shot ICL.
As we show, the new method greatly improves upon existing ICL methods for molecular design while being accessible and easy to use for scientists.
arXiv Detail & Related papers (2024-07-26T21:10:50Z) - RGFN: Synthesizable Molecular Generation Using GFlowNets [51.33672611338754]
We propose Reaction-GFlowNet, an extension of the GFlowNet framework that operates directly in the space of chemical reactions.
RGFN allows out-of-the-box synthesizability while maintaining comparable quality of generated candidates.
We demonstrate the effectiveness of the proposed approach across a range of oracle models, including pretrained proxy models and GPU-accelerated docking.
arXiv Detail & Related papers (2024-06-01T13:11:11Z) - Latent Chemical Space Searching for Plug-in Multi-objective Molecule Generation [9.442146563809953]
We develop a versatile 'plug-in' molecular generation model that incorporates objectives related to target affinity, drug-likeness, and synthesizability.
We identify PSO-ENP as the optimal variant for multi-objective molecular generation and optimization.
arXiv Detail & Related papers (2024-04-10T02:37:24Z) - Active Causal Learning for Decoding Chemical Complexities with Targeted Interventions [0.0]
We introduce an active learning approach that discerns underlying cause-effect relationships through strategic sampling.
This method identifies the smallest subset of the dataset capable of encoding the most information representative of a much larger chemical space.
The identified causal relations are then leveraged to conduct systematic interventions, optimizing the design task within a chemical space that the models have not encountered previously.
arXiv Detail & Related papers (2024-04-05T17:15:48Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - Diversifying Design of Nucleic Acid Aptamers Using Unsupervised Machine
Learning [54.247560894146105]
Inverse design of short single-stranded RNA and DNA sequences (aptamers) is the task of finding sequences that satisfy a set of desired criteria.
We propose to use an unsupervised machine learning model known as the Potts model to discover new, useful sequences with controllable sequence diversity.
arXiv Detail & Related papers (2022-08-10T13:30:58Z) - De novo design of protein target specific scaffold-based Inhibitors via
Reinforcement Learning [8.210294479991118]
Current approaches to develop molecules for a target protein are intuition-driven, hampered by slow iterative design-test cycles.
We propose a novel framework, called 3D-MolGNN$_RL$, coupling reinforcement learning to a deep generative model based on 3D-Scaffold.
Our approach can serve as an interpretable artificial intelligence (AI) tool for lead optimization with optimized activity, potency, and biophysical properties.
arXiv Detail & Related papers (2022-05-21T00:47:35Z) - ChemoVerse: Manifold traversal of latent spaces for novel molecule
discovery [0.7742297876120561]
It is essential to identify molecular structures with the desired chemical properties.
Recent advances in generative models using neural networks and machine learning are being widely used to design virtual libraries of drug-like compounds.
arXiv Detail & Related papers (2020-09-29T12:11:40Z) - Stochastic-based Neural Network hardware acceleration for an efficient
ligand-based virtual screening [0.6431253679501663]
Virtual Screening studies how to identify molecular compounds with the highest probability to present biological activity for a therapeutic target.
Due to the vast number of small organic compounds and the thousands of targets for which such large-scale screening can potentially be carried out, there has been an increasing interest in the research community to increase both, processing speed and energy efficiency in the screening of molecular databases.
In this work, we present a classification model describing each molecule with a single energy-based vector and propose a machine-learning system based on the use of ANNs.
arXiv Detail & Related papers (2020-06-03T20:18:15Z) - Learning To Navigate The Synthetically Accessible Chemical Space Using
Reinforcement Learning [75.95376096628135]
We propose a novel forward synthesis framework powered by reinforcement learning (RL) for de novo drug design.
In this setup, the agent learns to navigate through the immense synthetically accessible chemical space.
We describe how the end-to-end training in this study represents an important paradigm in radically expanding the synthesizable chemical space.
arXiv Detail & Related papers (2020-04-26T21:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.