GP-MoLFormer-Sim: Test Time Molecular Optimization through Contextual Similarity Guidance
- URL: http://arxiv.org/abs/2506.05628v1
- Date: Thu, 05 Jun 2025 23:09:33 GMT
- Title: GP-MoLFormer-Sim: Test Time Molecular Optimization through Contextual Similarity Guidance
- Authors: Jiri Navratil, Jarret Ross, Payel Das, Youssef Mroueh, Samuel C Hoffman, Vijil Chenthamarakshan, Brian Belgodere,
- Abstract summary: The ability to design molecules while preserving similarity to a target molecule and/or property is crucial for various applications in drug discovery, chemical design, and biology.<n>We introduce in this paper an efficient training-free method for navigating and sampling from the molecular space with a generative Chemical Language Model (CLM)<n>Our method leverages the contextual representations learned from the CLM itself to estimate the molecular similarity, which is then used to adjust the autoregressive sampling strategy of the CLM.
- Score: 29.578666490023057
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The ability to design molecules while preserving similarity to a target molecule and/or property is crucial for various applications in drug discovery, chemical design, and biology. We introduce in this paper an efficient training-free method for navigating and sampling from the molecular space with a generative Chemical Language Model (CLM), while using the molecular similarity to the target as a guide. Our method leverages the contextual representations learned from the CLM itself to estimate the molecular similarity, which is then used to adjust the autoregressive sampling strategy of the CLM. At each step of the decoding process, the method tracks the distance of the current generations from the target and updates the logits to encourage the preservation of similarity in generations. We implement the method using a recently proposed $\sim$47M parameter SMILES-based CLM, GP-MoLFormer, and therefore refer to the method as GP-MoLFormer-Sim, which enables a test-time update of the deep generative policy to reflect the contextual similarity to a set of guide molecules. The method is further integrated into a genetic algorithm (GA) and tested on a set of standard molecular optimization benchmarks involving property optimization, molecular rediscovery, and structure-based drug design. Results show that, GP-MoLFormer-Sim, combined with GA (GP-MoLFormer-Sim+GA) outperforms existing training-free baseline methods, when the oracle remains black-box. The findings in this work are a step forward in understanding and guiding the generative mechanisms of CLMs.
Related papers
- DrugR: Optimizing Molecular Drugs through LLM-based Explicit Reasoning [24.70952870676648]
DrugR is a large language model that introduces explicit, step-by-step pharmacological reasoning into the optimization process.<n>Our approach integrates domain-specific continual pretraining, supervised fine-tuning via reverse data engineering, and self-balanced multi-granular reinforcement learning.<n> Experimental results demonstrate that DrugR achieves comprehensive enhancement across multiple properties without compromising structural similarity or target binding affinity.
arXiv Detail & Related papers (2026-02-09T02:26:25Z) - Multi-Constrained Evolutionary Molecular Design Framework: An Interpretable Drug Design Method Combining Rule-Based Evolution and Molecular Crossover [11.76474263237491]
MCEMOL (Multi-Constrained Evolutionary Molecular Design Framework) is a molecular optimization approach integrating rule-based evolution with molecular crossover.<n>Unlike deep learning methods requiring large datasets and extensive training, our algorithm evolves efficiently from minimal starting molecules with low computational overhead.<n> Experimental results demonstrate that MCEMOL provides transparent design pathways through its evolutionary mechanism while generating valid, diverse, target-compliant molecules.
arXiv Detail & Related papers (2026-01-15T06:37:07Z) - MolGuidance: Advanced Guidance Strategies for Conditional Molecular Generation with Flow Matching [6.649784863468093]
Key objectives in conditional molecular generation include ensuring chemical validity, aligning generated molecules with target properties, and enabling efficient sampling for discovery.<n>Recent advances in computer vision introduced a range of new guidance strategies for generative models.<n>We integrate state-of-the-art guidance methods in a leading molecule generation framework built on an SE(3)-equivariant flow matching process.
arXiv Detail & Related papers (2025-12-13T06:05:09Z) - Breaking the Modality Barrier: Generative Modeling for Accurate Molecule Retrieval from Mass Spectra [60.08608779794957]
We propose GLMR, a Generative Language Model-based Retrieval framework.<n>In the pre-retrieval stage, a contrastive learning-based model identifies top candidate molecules as contextual priors for the input mass spectrum.<n>In the generative retrieval stage, these candidate molecules are integrated with the input mass spectrum to guide a generative model in producing refined molecular structures.
arXiv Detail & Related papers (2025-11-09T07:25:53Z) - Graph-based Molecular In-context Learning Grounded on Morgan Fingerprints [28.262593876388397]
In-context learning (ICL) conditions large language models (LLMs) for molecular tasks, such as property prediction and molecule captioning, by embedding carefully selected demonstration examples into the input prompt.<n>However, current prompt retrieval methods for molecular tasks have relied on molecule feature similarity, such as Morgan fingerprints, which do not adequately capture the global molecular and atom-binding relationships.<n>We propose a self-supervised learning technique, GAMIC, which aligns global molecular structures, represented by graph neural networks (GNNs), with textual captions (descriptions) while leveraging local feature similarity through Morgan fingerprints.
arXiv Detail & Related papers (2025-02-08T02:46:33Z) - Conditional Synthesis of 3D Molecules with Time Correction Sampler [58.0834973489875]
Time-Aware Conditional Synthesis (TACS) is a novel approach to conditional generation on diffusion models.
It integrates adaptively controlled plug-and-play "online" guidance into a diffusion model, driving samples toward the desired properties.
arXiv Detail & Related papers (2024-11-01T12:59:25Z) - Text-Guided Multi-Property Molecular Optimization with a Diffusion Language Model [20.250683535089617]
We propose a text-guided multi-property molecular optimization method utilizing transformer-based diffusion language model (TransDLM)<n>By fusing physically and chemically detailed semantics with specialized molecular representations, TransDLM effectively integrates diverse information sources to guide precise optimization.
arXiv Detail & Related papers (2024-10-17T14:30:27Z) - WGFormer: An SE(3)-Transformer Driven by Wasserstein Gradient Flows for Molecular Ground-State Conformation Prediction [24.18678055892153]
Ground-state conformation prediction is crucial for many chemical applications.<n>We propose a novel and effective method to bridge the energy-based simulation and the learning-based strategy.<n>Our method consistently outperforms state-of-the-art competitors.
arXiv Detail & Related papers (2024-10-13T10:48:22Z) - Quantum-inspired Reinforcement Learning for Synthesizable Drug Design [20.00111975801053]
We introduce a novel approach using the reinforcement learning method with quantum-inspired simulated annealing policy neural network to navigate the vast discrete space of chemical structures intelligently.
Specifically, we employ a deterministic REINFORCE algorithm using policy neural networks to output transitional probability to guide state transitions and local search.
Our methods are evaluated with the Practical Molecular Optimization (PMO) benchmark framework with a 10K query budget.
arXiv Detail & Related papers (2024-09-13T20:43:16Z) - Many-Shot In-Context Learning for Molecular Inverse Design [56.65345962071059]
Large Language Models (LLMs) have demonstrated great performance in few-shot In-Context Learning (ICL)
We develop a new semi-supervised learning method that overcomes the lack of experimental data available for many-shot ICL.
As we show, the new method greatly improves upon existing ICL methods for molecular design while being accessible and easy to use for scientists.
arXiv Detail & Related papers (2024-07-26T21:10:50Z) - Data-Efficient Molecular Generation with Hierarchical Textual Inversion [48.816943690420224]
We introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method.
HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution.
Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution.
arXiv Detail & Related papers (2024-05-05T08:35:23Z) - MolTC: Towards Molecular Relational Modeling In Language Models [28.960416816491392]
We propose a novel framework for Molecular inTeraction prediction following Chain-of-Thought (CoT) theory termed MolTC.
Our experiments, conducted across various datasets involving over 4,000,000 molecular pairs, exhibit the superiority of our method over current GNN and LLM-based baselines.
arXiv Detail & Related papers (2024-02-06T07:51:56Z) - Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks.
In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation.
We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z) - Retrieval-based Controllable Molecule Generation [63.44583084888342]
We propose a new retrieval-based framework for controllable molecule generation.
We use a small set of molecules to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria.
Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning.
arXiv Detail & Related papers (2022-08-23T17:01:16Z) - Improving Small Molecule Generation using Mutual Information Machine [0.0]
MolMIM is a probabilistic auto-encoder for small molecule drug discovery.
We demonstrate MolMIM's superior generation as measured in terms of validity, uniqueness, and novelty.
We then utilize CMA-ES, a naive black-box and gradient free search algorithm, over MolMIM's latent space for the task of property guided molecule optimization.
arXiv Detail & Related papers (2022-08-18T18:32:48Z) - Molecular Attributes Transfer from Non-Parallel Data [57.010952598634944]
We formulate molecular optimization as a style transfer problem and present a novel generative model that could automatically learn internal differences between two groups of non-parallel data.
Experiments on two molecular optimization tasks, toxicity modification and synthesizability improvement, demonstrate that our model significantly outperforms several state-of-the-art methods.
arXiv Detail & Related papers (2021-11-30T06:10:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.