Efficient and Programmable Exploration of Synthesizable Chemical Space
- URL: http://arxiv.org/abs/2512.00384v1
- Date: Sat, 29 Nov 2025 08:21:21 GMT
- Title: Efficient and Programmable Exploration of Synthesizable Chemical Space
- Authors: Shitong Luo, Connor W. Coley,
- Abstract summary: We present PrexSyn, an efficient and programmable model for molecular discovery within synthesizable chemical space.<n>PrexSyn is based on a decoder-only transformer trained on a billion-scale datastream of synthesizable pathways paired with molecular properties.<n>By exploiting this property-based querying capability, PrexSyn can efficiently optimize molecules against black-box oracle functions.
- Score: 19.94593615043411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The constrained nature of synthesizable chemical space poses a significant challenge for sampling molecules that are both synthetically accessible and possess desired properties. In this work, we present PrexSyn, an efficient and programmable model for molecular discovery within synthesizable chemical space. PrexSyn is based on a decoder-only transformer trained on a billion-scale datastream of synthesizable pathways paired with molecular properties, enabled by a real-time, high-throughput C++-based data generation engine. The large-scale training data allows PrexSyn to reconstruct the synthesizable chemical space nearly perfectly at a high inference speed and learn the association between properties and synthesizable molecules. Based on its learned property-pathway mappings, PrexSyn can generate synthesizable molecules that satisfy not only single-property conditions but also composite property queries joined by logical operators, thereby allowing users to ``program'' generation objectives. Moreover, by exploiting this property-based querying capability, PrexSyn can efficiently optimize molecules against black-box oracle functions via iterative query refinement, achieving higher sampling efficiency than even synthesis-agnostic baselines, making PrexSyn a powerful general-purpose molecular optimization tool. Overall, PrexSyn pushes the frontier of synthesizable molecular design by setting a new state of the art in synthesizable chemical space coverage, molecular sampling efficiency, and inference speed.
Related papers
- Agentic reinforcement learning empowers next-generation chemical language models for molecular design and synthesis [51.83339196548892]
ChemCraft is a novel framework that decouples chemical reasoning from knowledge storage.<n>ChemCraft achieves superior performance with minimal inference costs.<n>This work establishes a cost-effective and privacy-preserving paradigm for AI-aided chemistry.
arXiv Detail & Related papers (2026-01-25T04:23:34Z) - Rethinking Molecule Synthesizability with Chain-of-Reaction [47.744071119775676]
We introduce ReaSyn, a generative framework for synthesizable projection.<n>We propose a novel perspective that views synthetic pathways akin to reasoning paths in large language models (LLMs)<n>With the CoR notation, ReaSyn can get dense supervision in every reaction step to explicitly learn chemical reaction rules.
arXiv Detail & Related papers (2025-09-19T15:29:57Z) - Synthesizable by Design: A Retrosynthesis-Guided Framework for Molecular Analog Generation [0.5852077003870417]
We introduce SynTwins, a novel retrosynthesis-guided molecular analog design framework.<n>In comparative evaluations, SynTwins demonstrates superior performance in generating synthetically accessible analogs.<n>Our benchmarking across diverse molecular datasets demonstrates that SynTwins effectively bridges the gap between computational design and experimental synthesis.
arXiv Detail & Related papers (2025-07-03T16:14:57Z) - ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z) - Scaling Laws of Synthetic Data for Language Models [125.41600201811417]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models [3.750173223006525]
We present a novel approach by fine-tuning Meta's Llama3 Large Language Models to create SynLlama.<n> SynLlama generates full synthetic pathways made of commonly accessible building blocks and robust organic reaction templates.<n>We find that SynLlama, even without training on external building blocks, can effectively generalize to unseen yet purchasable building blocks.
arXiv Detail & Related papers (2025-03-16T18:30:56Z) - Text-Guided Multi-Property Molecular Optimization with a Diffusion Language Model [20.250683535089617]
We propose a text-guided multi-property molecular optimization method utilizing transformer-based diffusion language model (TransDLM)<n>By fusing physically and chemically detailed semantics with specialized molecular representations, TransDLM effectively integrates diverse information sources to guide precise optimization.
arXiv Detail & Related papers (2024-10-17T14:30:27Z) - Generative Artificial Intelligence for Navigating Synthesizable Chemical Space [25.65907958071386]
We introduce SynFormer, a generative modeling framework designed to efficiently explore and navigate synthesizable chemical space.
By incorporating a scalable transformer architecture and a diffusion module for building block selection, SynFormer surpasses existing models in synthesizable molecular design.
arXiv Detail & Related papers (2024-10-04T15:09:05Z) - SynthFormer: Equivariant Pharmacophore-based Generation of Synthesizable Molecules for Ligand-Based Drug Design [19.578382119811238]
We introduce SynthFormer, a novel machine learning model that generates fully synthesizable molecules, structured as synthetic trees, by introducing both 3D information and pharmacophores as input.<n>It is a first-of-its-kind approach that could provide capabilities for designing active molecules based on pharmacophores.
arXiv Detail & Related papers (2024-10-03T17:38:46Z) - BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction [65.93303145891628]
BatGPT-Chem is a large language model with 15 billion parameters, tailored for enhanced retrosynthesis prediction.
Our model captures a broad spectrum of chemical knowledge, enabling precise prediction of reaction conditions.
This development empowers chemists to adeptly address novel compounds, potentially expediting the innovation cycle in drug manufacturing and materials science.
arXiv Detail & Related papers (2024-08-19T05:17:40Z) - Learning To Navigate The Synthetically Accessible Chemical Space Using
Reinforcement Learning [75.95376096628135]
We propose a novel forward synthesis framework powered by reinforcement learning (RL) for de novo drug design.
In this setup, the agent learns to navigate through the immense synthetically accessible chemical space.
We describe how the end-to-end training in this study represents an important paradigm in radically expanding the synthesizable chemical space.
arXiv Detail & Related papers (2020-04-26T21:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.