Informing Acquisition Functions via Foundation Models for Molecular Discovery
- URL: http://arxiv.org/abs/2512.13935v1
- Date: Mon, 15 Dec 2025 22:19:21 GMT
- Title: Informing Acquisition Functions via Foundation Models for Molecular Discovery
- Authors: Qi Chen, Fabio Ramos, Alán Aspuru-Guzik, Florian Shkurti,
- Abstract summary: We propose a likelihood-free BO method that bypasses explicit surrogate modeling and directly leverages priors from general LLMs and chemistry-specific foundation models to inform acquisition functions.<n>Our method also learns a tree-structured partition of the molecular search space with local acquisition functions, enabling efficient candidate selection via Monte Carlo Tree Search.
- Score: 23.39033856519757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bayesian Optimization (BO) is a key methodology for accelerating molecular discovery by estimating the mapping from molecules to their properties while seeking the optimal candidate. Typically, BO iteratively updates a probabilistic surrogate model of this mapping and optimizes acquisition functions derived from the model to guide molecule selection. However, its performance is limited in low-data regimes with insufficient prior knowledge and vast candidate spaces. Large language models (LLMs) and chemistry foundation models offer rich priors to enhance BO, but high-dimensional features, costly in-context learning, and the computational burden of deep Bayesian surrogates hinder their full utilization. To address these challenges, we propose a likelihood-free BO method that bypasses explicit surrogate modeling and directly leverages priors from general LLMs and chemistry-specific foundation models to inform acquisition functions. Our method also learns a tree-structured partition of the molecular search space with local acquisition functions, enabling efficient candidate selection via Monte Carlo Tree Search. By further incorporating coarse-grained LLM-based clustering, it substantially improves scalability to large candidate sets by restricting acquisition function evaluations to clusters with statistically higher property values. We show through extensive experiments and ablations that the proposed method substantially improves scalability, robustness, and sample efficiency in LLM-guided BO for molecular discovery.
Related papers
- DrugR: Optimizing Molecular Drugs through LLM-based Explicit Reasoning [24.70952870676648]
DrugR is a large language model that introduces explicit, step-by-step pharmacological reasoning into the optimization process.<n>Our approach integrates domain-specific continual pretraining, supervised fine-tuning via reverse data engineering, and self-balanced multi-granular reinforcement learning.<n> Experimental results demonstrate that DrugR achieves comprehensive enhancement across multiple properties without compromising structural similarity or target binding affinity.
arXiv Detail & Related papers (2026-02-09T02:26:25Z) - Generative Multi-Objective Bayesian Optimization with Scalable Batch Evaluations for Sample-Efficient De Novo Molecular Design [1.8517039579627974]
This work introduces an alternative, modular "generate-then-optimize" framework for de novo molecular design/discovery.<n>We benchmark the framework against state-of-the-art latent-space and discrete molecular optimization methods.<n>Specifically, in a case study related to sustainable energy storage, we show that our approach quickly uncovers novel, diverse, and high-performing organic (quinone-based) cathode materials.
arXiv Detail & Related papers (2025-12-19T14:59:27Z) - Can Molecular Foundation Models Know What They Don't Know? A Simple Remedy with Preference Optimization [54.22711328577149]
We introduce Molecular-Aligned Preference Instance Ranking (Mole-PAIR), a plug-and-play module that can be flexibly integrated with existing foundation models.<n>We show that our approach significantly improves the OOD detection capabilities of existing molecular foundation models.
arXiv Detail & Related papers (2025-09-29T21:06:52Z) - ChemBOMAS: Accelerated BO in Chemistry with LLM-Enhanced Multi-Agent System [72.63341091857959]
We introduce ChemBOMAS: a large language model (LLM)-enhanced multi-agent system that accelerates Bayesian optimization.<n>Data-driven strategy involves an 8B-scale LLM regressor fine-tuned on a mere 1% labeled samples.<n>The knowledge-driven strategy employs a hybrid Retrieval-Augmented Generation approach to guide LLM in dividing the search space.<n>ChemBOMAS set a new state-of-the-art, accelerating optimization efficiency by up to 5-fold compared to baseline methods.
arXiv Detail & Related papers (2025-09-10T16:24:08Z) - Understanding Reinforcement Learning-Based Fine-Tuning of Diffusion Models: A Tutorial and Review [63.31328039424469]
This tutorial provides a comprehensive survey of methods for fine-tuning diffusion models to optimize downstream reward functions.
We explain the application of various RL algorithms, including PPO, differentiable optimization, reward-weighted MLE, value-weighted sampling, and path consistency learning.
arXiv Detail & Related papers (2024-07-18T17:35:32Z) - Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [82.7679132059169]
Reinforcement learning from human feedback has emerged as a central tool for language model alignment.
We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO)
XPO enjoys the strongest known provable guarantees and promising empirical performance.
arXiv Detail & Related papers (2024-05-31T17:39:06Z) - Model Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to expedite alignment training with human preferences.<n>We demonstrate that ExPO boosts a DPO model trained with only 20% steps to outperform the fully-trained one.<n>We show that ExPO notably improves existing open-source LLMs on the leading AlpacaEval 2.0 and MT-Bench benchmarks.
arXiv Detail & Related papers (2024-04-25T17:39:50Z) - Large Language Models to Enhance Bayesian Optimization [57.474613739645605]
We present LLAMBO, a novel approach that integrates the capabilities of Large Language Models (LLM) within Bayesian optimization.
At a high level, we frame the BO problem in natural language, enabling LLMs to iteratively propose and evaluate promising solutions conditioned on historical evaluations.
Our findings illustrate that LLAMBO is effective at zero-shot warmstarting, and enhances surrogate modeling and candidate sampling, especially in the early stages of search when observations are sparse.
arXiv Detail & Related papers (2024-02-06T11:44:06Z) - Accelerating Black-Box Molecular Property Optimization by Adaptively
Learning Sparse Subspaces [0.0]
We show that our proposed method substantially outperforms existing MPO methods on a variety of benchmark and real-world problems.
Specifically, we show that our method can routinely find near-optimal molecules out of a set of more than $>100$k alternatives within 100 or fewer expensive queries.
arXiv Detail & Related papers (2024-01-02T18:34:29Z) - DrugAssist: A Large Language Model for Molecule Optimization [29.95488215594247]
DrugAssist is an interactive molecule optimization model that performs optimization through human-machine dialogue.
DrugAssist has achieved leading results in both single and multiple property optimization.
We publicly release a large instruction-based dataset called MolOpt-Instructions for fine-tuning language models on molecule optimization tasks.
arXiv Detail & Related papers (2023-12-28T10:46:56Z) - Beam Enumeration: Probabilistic Explainability For Sample Efficient
Self-conditioned Molecular Design [0.4769602527256662]
Generative molecular design has moved from proof-of-concept to real-world applicability.
Key challenges in explainability and sample efficiency present opportunities to enhance generative design.
Beamion is generally applicable to any language-based molecular generative model.
arXiv Detail & Related papers (2023-09-25T08:43:13Z) - Efficient Model-Free Exploration in Low-Rank MDPs [76.87340323826945]
Low-Rank Markov Decision Processes offer a simple, yet expressive framework for RL with function approximation.
Existing algorithms are either (1) computationally intractable, or (2) reliant upon restrictive statistical assumptions.
We propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs.
arXiv Detail & Related papers (2023-07-08T15:41:48Z) - ALMERIA: Boosting pairwise molecular contrasts with scalable methods [0.0]
ALMERIA is a tool for estimating compound similarities and activity prediction based on pairwise molecular contrasts.
It has been implemented using scalable software and methods to exploit large volumes of data.
Experiments show state-of-the-art performance for molecular activity prediction.
arXiv Detail & Related papers (2023-04-28T16:27:06Z) - CASTELO: Clustered Atom Subtypes aidEd Lead Optimization -- a combined
machine learning and molecular modeling method [2.8381402107366034]
We propose a combined machine learning and molecular modeling approach that automates lead optimization workflow.
Our method provides new hints for drug modification hotspots which can be used to improve drug efficacy.
arXiv Detail & Related papers (2020-11-27T15:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.