Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition
- URL: http://arxiv.org/abs/2602.16684v1
- Date: Wed, 18 Feb 2026 18:27:21 GMT
- Title: Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition
- Authors: Bo Pan, Peter Zhiping Zhang, Hao-Wei Pang, Alex Zhu, Xiang Yu, Liying Zhang, Liang Zhao,
- Abstract summary: We propose a variable-to-variable formulation of analog generation and train a foundation model on large-scale MMP transformations.<n>We develop prompting mechanisms that let the users specify preferred transformation patterns during generation.<n>Experiments on general chemical corpora and patent-specific datasets demonstrate improved diversity, novelty, and controllability.
- Score: 11.475465740098683
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Matched molecular pairs (MMPs) capture the local chemical edits that medicinal chemists routinely use to design analogs, but existing ML approaches either operate at the whole-molecule level with limited edit controllability or learn MMP-style edits from restricted settings and small models. We propose a variable-to-variable formulation of analog generation and train a foundation model on large-scale MMP transformations (MMPTs) to generate diverse variables conditioned on an input variable. To enable practical control, we develop prompting mechanisms that let the users specify preferred transformation patterns during generation. We further introduce MMPT-RAG, a retrieval-augmented framework that uses external reference analogs as contextual guidance to steer generation and generalize from project-specific series. Experiments on general chemical corpora and patent-specific datasets demonstrate improved diversity, novelty, and controllability, and show that our method recovers realistic analog structures in practical discovery scenarios.
Related papers
- Transformer-Based Approach for Automated Functional Group Replacement in Chemical Compounds [12.414301421345227]
We develop a novel two-stage transformer model for functional group removal and replacement.<n>Unlike one-shot approaches that generate entire molecules in a single pass, our method generates the functional group to be removed and appended sequentially.
arXiv Detail & Related papers (2026-01-12T19:01:11Z) - Task-Specific Sparse Feature Masks for Molecular Toxicity Prediction with Chemical Language Models [5.563119267291969]
We propose a novel multi-task learning (MTL) framework to jointly enhance accuracy and interpretability.<n>Our architecture integrates a shared chemical language model with task-specific attention modules.<n>By imposing an L1 sparsity penalty on these modules, the framework is constrained to focus on a minimal set of salient molecular fragments for each distinct toxicity endpoint.
arXiv Detail & Related papers (2025-12-12T09:41:04Z) - MoRE: Batch-Robust Multi-Omics Representations from Frozen Pre-trained Transformers [0.0]
We present MoRE (Multi-Omics Representation Embedding), a framework that repurposes frozen pre-trained transformers to align heterogeneous assays into a shared latent space.<n>Specifically, MoRE attaches lightweight, modality-specific adapters and a task-adaptive fusion layer to the frozen backbone.<n>We benchmark MoRE against established baselines, including scGPT, scVI, and Harmony with Scrublet, evaluating integration fidelity, rare population detection, and modality transfer.
arXiv Detail & Related papers (2025-11-25T15:04:06Z) - GP-MoLFormer-Sim: Test Time Molecular Optimization through Contextual Similarity Guidance [29.578666490023057]
The ability to design molecules while preserving similarity to a target molecule and/or property is crucial for various applications in drug discovery, chemical design, and biology.<n>We introduce in this paper an efficient training-free method for navigating and sampling from the molecular space with a generative Chemical Language Model (CLM)<n>Our method leverages the contextual representations learned from the CLM itself to estimate the molecular similarity, which is then used to adjust the autoregressive sampling strategy of the CLM.
arXiv Detail & Related papers (2025-06-05T23:09:33Z) - Human-level molecular optimization driven by mol-gene evolution [5.409648262203544]
This study introduces the Deep Genetic Modification Algorithm (DGMM), which brings structure modification to the level of medicinal chemists.
A discrete variational autoencoder (D-VAE) is used in DGMM to encode molecules as quantization code, mol-gene, which incorporates deep learning into genetic algorithms for flexible structural optimization.
arXiv Detail & Related papers (2024-06-13T01:06:03Z) - Learning Invariant Molecular Representation in Latent Discrete Space [52.13724532622099]
We propose a new framework for learning molecular representations that exhibit invariance and robustness against distribution shifts.
Our model achieves stronger generalization against state-of-the-art baselines in the presence of various distribution shifts.
arXiv Detail & Related papers (2023-10-22T04:06:44Z) - Learning Modulated Transformation in GANs [69.95217723100413]
We equip the generator in generative adversarial networks (GANs) with a plug-and-play module, termed as modulated transformation module (MTM)
MTM predicts spatial offsets under the control of latent codes, based on which the convolution operation can be applied at variable locations.
It is noteworthy that towards human generation on the challenging TaiChi dataset, we improve the FID of StyleGAN3 from 21.36 to 13.60, demonstrating the efficacy of learning modulated geometry transformation.
arXiv Detail & Related papers (2023-08-29T17:51:22Z) - Str2Str: A Score-based Framework for Zero-shot Protein Conformation
Sampling [23.74897713386661]
The dynamic nature of proteins is crucial for determining their biological functions and properties.
Existing learning-based approaches perform direct sampling yet heavily rely on target-specific simulation data for training.
We propose Str2Str, a novel structure-to-structure translation framework capable of zero-shot conformation sampling.
arXiv Detail & Related papers (2023-06-05T15:19:06Z) - Retrieval-based Controllable Molecule Generation [63.44583084888342]
We propose a new retrieval-based framework for controllable molecule generation.
We use a small set of molecules to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria.
Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning.
arXiv Detail & Related papers (2022-08-23T17:01:16Z) - Local manifold learning and its link to domain-based physics knowledge [53.15471241298841]
In many reacting flow systems, the thermo-chemical state-space is assumed to evolve close to a low-dimensional manifold (LDM)
We show that PCA applied in local clusters of data (local PCA) is capable of detecting the intrinsic parameterization of the thermo-chemical state-space.
arXiv Detail & Related papers (2022-07-01T09:06:25Z) - Improving Molecular Representation Learning with Metric
Learning-enhanced Optimal Transport [49.237577649802034]
We develop a novel optimal transport-based algorithm termed MROT to enhance their generalization capability for molecular regression problems.
MROT significantly outperforms state-of-the-art models, showing promising potential in accelerating the discovery of new substances.
arXiv Detail & Related papers (2022-02-13T04:56:18Z) - Geometric Transformer for End-to-End Molecule Properties Prediction [92.28929858529679]
We introduce a Transformer-based architecture for molecule property prediction, which is able to capture the geometry of the molecule.
We modify the classical positional encoder by an initial encoding of the molecule geometry, as well as a learned gated self-attention mechanism.
arXiv Detail & Related papers (2021-10-26T14:14:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.