Related papers: When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

URL: http://arxiv.org/abs/2602.03554v1
Date: Tue, 03 Feb 2026 14:03:32 GMT
Title: When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs
Authors: Bogdan Zagribelnyy, Ivan Ilin, Maksim Kuznetsov, Nikita Bondarev, Roman Schutski, Thomas MacDougall, Rim Shayakhmetov, Zulfat Miftakhutdinov, Mikolaj Mizera, Vladimir Aladinskiy, Alex Aliper, Alex Zhavoronkov,
Abstract summary: We propose a new benchmarking framework for single-step retrosynthesis.<n>By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices.<n>We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training.
Score: 3.973137925060284
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.

Related papers

RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis [78.32151470154422]
We introduce RAVEL, an agentic framework that enables the testers to autonomously plan and execute typical synthesis operations.<n>We present C3EBench, a benchmark comprising 1,258 samples derived from professional human writings.<n>By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM's reasoning capability.
arXiv Detail & Related papers (2026-02-28T14:47:34Z)
InfoSynth: Information-Guided Benchmark Synthesis for LLMs [69.80981631587501]
Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation.<n>Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming.<n>This work introduces Info Synth, a novel framework for automatically generating and evaluating reasoning benchmarks.
arXiv Detail & Related papers (2026-01-02T05:26:27Z)
Synthelite: Chemist-aligned and feasibility-aware synthesis planning with LLMs [3.7129661557601854]
We introduce Synthelite, a synthesis planning framework that uses large language models to propose retrosynthetic transformations.<n> Synthelite can generate end-to-end synthesis routes by harnessing the intrinsic chemical knowledge and reasoning capabilities of LLMs.<n>Our experiments demonstrate that Synthelite can flexibly adapt its planning trajectory to diverse user-specified constraints, achieving up to 95% success rates.
arXiv Detail & Related papers (2025-12-18T11:24:30Z)
A Scientific Reasoning Model for Organic Synthesis Procedure Generation [12.609346156252393]
We present QFANG, a scientific reasoning language model capable of generating precise, structured experimental procedures.<n>We introduce a Chemistry-Guided Reasoning (CGR) framework that produces chain-of-thought data grounded in chemical knowledge at scale.<n>We apply Reinforcement Learning from Verifiable Rewards (RLVR) to further enhance procedural accuracy.
arXiv Detail & Related papers (2025-12-15T18:55:39Z)
AOT*: Efficient Synthesis Planning via LLM-Empowered AND-OR Tree Search [22.026497456502806]
AOT* is a framework that transforms retrosynthetic planning by integrating LLM-generated chemical synthesis pathways with systematic AND-OR tree search.<n>AOT* exhibits competitive solve rates using 3-5$times$ fewer iterations than existing LLM-based approaches.
arXiv Detail & Related papers (2025-09-25T10:30:37Z)
DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning [0.0]
DeepRetro is a novel, open-source framework that tightly integrates large language models (LLMs), traditional retrosynthetic engines, and expert human feedback in an iterative design loop.<n>By releasing DeepRetro as an open-source tool, we aim to empower chemists to tackle increasingly ambitious synthetic targets.
arXiv Detail & Related papers (2025-07-07T19:41:39Z)
ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z)
BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction [65.93303145891628]
BatGPT-Chem is a large language model with 15 billion parameters, tailored for enhanced retrosynthesis prediction. Our model captures a broad spectrum of chemical knowledge, enabling precise prediction of reaction conditions. This development empowers chemists to adeptly address novel compounds, potentially expediting the innovation cycle in drug manufacturing and materials science.
arXiv Detail & Related papers (2024-08-19T05:17:40Z)
Mitigating Catastrophic Forgetting in Large Language Models with Self-Synthesized Rehearsal [49.24054920683246]
Large language models (LLMs) suffer from catastrophic forgetting during continual learning. We propose a framework called Self-Synthesized Rehearsal (SSR) that uses the LLM to generate synthetic instances for rehearsal.
arXiv Detail & Related papers (2024-03-02T16:11:23Z)
Re-evaluating Retrosynthesis Algorithms with Syntheseus [13.384695742156152]
We present a synthesis planning library with an extensive benchmarking framework, called syntheseus. We demonstrate the capabilities of syntheseus by re-evaluating several previous retrosynthesis algorithms. We end with guidance for future works in this area, and call the community to engage in the discussion on how to improve benchmarks for synthesis planning.
arXiv Detail & Related papers (2023-10-30T17:59:04Z)
FusionRetro: Molecule Representation Fusion via In-Context Learning for Retrosynthetic Planning [58.47265392465442]
Retrosynthetic planning aims to devise a complete multi-step synthetic route from starting materials to a target molecule. Current strategies use a decoupled approach of single-step retrosynthesis models and search algorithms. We propose a novel framework that utilizes context information for improved retrosynthetic planning.
arXiv Detail & Related papers (2022-09-30T08:44:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.