Pretrained Joint Predictions for Scalable Batch Bayesian Optimization of Molecular Designs
- URL: http://arxiv.org/abs/2511.10590v2
- Date: Fri, 14 Nov 2025 15:52:01 GMT
- Title: Pretrained Joint Predictions for Scalable Batch Bayesian Optimization of Molecular Designs
- Authors: Miles Wang-Henderson, Benjamin Kaufman, Edward Williams, Ryan Pederson, Matteo Rossi, Owen Howell, Carl Underkoffler, Narbe Mardirossian, John Parkhill,
- Abstract summary: We show how to obtain scalable probabilistic surrogates of binding affinity for use in Batch Bayesian Optimization.<n>This demands parallel acquisition functions that hedge between designs and the ability to rapidly sample from a joint predictive density to approximate them.<n>Key to this work is an investigation into the importance of prior networks in ENNs and how to pretrain them on synthetic data to improve downstream performance.
- Score: 1.3505106522886807
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Batched synthesis and testing of molecular designs is the key bottleneck of drug development. There has been great interest in leveraging biomolecular foundation models as surrogates to accelerate this process. In this work, we show how to obtain scalable probabilistic surrogates of binding affinity for use in Batch Bayesian Optimization (Batch BO). This demands parallel acquisition functions that hedge between designs and the ability to rapidly sample from a joint predictive density to approximate them. Through the framework of Epistemic Neural Networks (ENNs), we obtain scalable joint predictive distributions of binding affinity on top of representations taken from large structure-informed models. Key to this work is an investigation into the importance of prior networks in ENNs and how to pretrain them on synthetic data to improve downstream performance in Batch BO. Their utility is demonstrated by rediscovering known potent EGFR inhibitors on a semi-synthetic benchmark in up to 5x fewer iterations, as well as potent inhibitors from a real-world small-molecule library in up to 10x fewer iterations, offering a promising solution for large-scale drug discovery applications.
Related papers
- Learning a Generative Meta-Model of LLM Activations [75.30161960337892]
We create "meta-models" that learn the distribution of a network's internal states.<n>Applying the meta-model's learned prior to steering interventions improves fluency, with larger gains as loss decreases.<n>These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions.
arXiv Detail & Related papers (2026-02-06T18:59:56Z) - Generative Multi-Objective Bayesian Optimization with Scalable Batch Evaluations for Sample-Efficient De Novo Molecular Design [1.8517039579627974]
This work introduces an alternative, modular "generate-then-optimize" framework for de novo molecular design/discovery.<n>We benchmark the framework against state-of-the-art latent-space and discrete molecular optimization methods.<n>Specifically, in a case study related to sustainable energy storage, we show that our approach quickly uncovers novel, diverse, and high-performing organic (quinone-based) cathode materials.
arXiv Detail & Related papers (2025-12-19T14:59:27Z) - Informing Acquisition Functions via Foundation Models for Molecular Discovery [23.39033856519757]
We propose a likelihood-free BO method that bypasses explicit surrogate modeling and directly leverages priors from general LLMs and chemistry-specific foundation models to inform acquisition functions.<n>Our method also learns a tree-structured partition of the molecular search space with local acquisition functions, enabling efficient candidate selection via Monte Carlo Tree Search.
arXiv Detail & Related papers (2025-12-15T22:19:21Z) - Reimagining Target-Aware Molecular Generation through Retrieval-Enhanced Aligned Diffusion [22.204642926984526]
READ is introduced, which is the first to merge molecular Retrieval-Augmented Generation with an SE(3)-equivariant diffusion model.<n>It can achieve very competitive performance in CBGBench, surpassing state-of-the-art generative models and even native scaffolds.
arXiv Detail & Related papers (2025-06-17T13:09:11Z) - BetterBodies: Reinforcement Learning guided Diffusion for Antibody
Sequence Design [10.6101758867776]
Antibodies offer great potential for the treatment of various diseases.
The discovery of therapeutic antibodies through traditional wet lab methods is expensive and time-consuming.
The use of generative models in designing antibodies holds great promise, as it can reduce the time and resources required.
arXiv Detail & Related papers (2024-09-09T13:06:01Z) - Aligning Target-Aware Molecule Diffusion Models with Exact Energy Optimization [147.7899503829411]
AliDiff is a novel framework to align pretrained target diffusion models with preferred functional properties.
It can generate molecules with state-of-the-art binding energies with up to -7.07 Avg. Vina Score.
arXiv Detail & Related papers (2024-07-01T06:10:29Z) - YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention [9.018408514318631]
Traditional methods often miss complex molecular structures, leading to inaccuracies.
We introduce the YZS-Model, a deep learning framework integrating Graph Convolutional Networks (GCN), Transformer architectures, and Long Short-Term Memory (LSTM) networks.
YZS-Model achieved an $R2$ of 0.59 and an RMSE of 0.57, outperforming benchmark models.
arXiv Detail & Related papers (2024-06-27T12:40:29Z) - Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood [64.95663299945171]
Training energy-based models (EBMs) on high-dimensional data can be both challenging and time-consuming.
There exists a noticeable gap in sample quality between EBMs and other generative frameworks like GANs and diffusion models.
We propose cooperative diffusion recovery likelihood (CDRL), an effective approach to tractably learn and sample from a series of EBMs.
arXiv Detail & Related papers (2023-09-10T22:05:24Z) - Protein Design with Guided Discrete Diffusion [67.06148688398677]
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling.
We propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models.
NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods.
arXiv Detail & Related papers (2023-05-31T16:31:24Z) - Drug Synergistic Combinations Predictions via Large-Scale Pre-Training
and Graph Structure Learning [82.93806087715507]
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation.
Deep learning models have emerged as an efficient way to discover synergistic combinations.
Our framework achieves state-of-the-art results in comparison with other deep learning-based methods.
arXiv Detail & Related papers (2023-01-14T15:07:43Z) - LIMO: Latent Inceptionism for Targeted Molecule Generation [14.391216237573369]
We present Latent Inceptionism on Molecules (LIMO), which significantly accelerates molecule generation with an inceptionism-like technique.
Comprehensive experiments show that LIMO performs competitively on benchmark tasks.
One of our generated drug-like compounds has a predicted $K_D$ of $6 cdot 10-14$ M against the human estrogen receptor.
arXiv Detail & Related papers (2022-06-17T21:05:58Z) - MIMOSA: Multi-constraint Molecule Sampling for Molecule Optimization [51.00815310242277]
generative models and reinforcement learning approaches made initial success, but still face difficulties in simultaneously optimizing multiple drug properties.
We propose the MultI-constraint MOlecule SAmpling (MIMOSA) approach, a sampling framework to use input molecule as an initial guess and sample molecules from the target distribution.
arXiv Detail & Related papers (2020-10-05T20:18:42Z) - Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost
Functions [80.12620331438052]
deep learning has become an important tool for rapid screening of billions of molecules in silico for potential hits containing desired chemical features.
Despite its importance, substantial challenges persist in training these models, such as severe class imbalance, high decision thresholds, and lack of ground truth labels in some datasets.
We argue in favor of directly optimizing the receiver operating characteristic (ROC) in such cases, due to its robustness to class imbalance.
arXiv Detail & Related papers (2020-06-25T08:46:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.