Related papers: Procrustean Bed for AI-Driven Retrosynthesis: A Unified Framework for Reproducible Evaluation

Procrustean Bed for AI-Driven Retrosynthesis: A Unified Framework for Reproducible Evaluation

URL: http://arxiv.org/abs/2512.07079v1
Date: Mon, 08 Dec 2025 01:26:39 GMT
Title: Procrustean Bed for AI-Driven Retrosynthesis: A Unified Framework for Reproducible Evaluation
Authors: Anton Morgunov, Victor S. Batista,
Abstract summary: RetroCast is a unified evaluation suite that standardizes heterogeneous model outputs into a common schema.<n>We evaluate leading search-based and sequence-based algorithms on a new suite of standardized benchmarks.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Progress in computer-aided synthesis planning (CASP) is obscured by the lack of standardized evaluation infrastructure and the reliance on metrics that prioritize topological completion over chemical validity. We introduce RetroCast, a unified evaluation suite that standardizes heterogeneous model outputs into a common schema to enable statistically rigorous, apples-to-apples comparison. The framework includes a reproducible benchmarking pipeline with stratified sampling and bootstrapped confidence intervals, accompanied by SynthArena, an interactive platform for qualitative route inspection. We utilize this infrastructure to evaluate leading search-based and sequence-based algorithms on a new suite of standardized benchmarks. Our analysis reveals a divergence between "solvability" (stock-termination rate) and route quality; high solvability scores often mask chemical invalidity or fail to correlate with the reproduction of experimental ground truths. Furthermore, we identify a "complexity cliff" in which search-based methods, despite high solvability rates, exhibit a sharp performance decay in reconstructing long-range synthetic plans compared to sequence-based approaches. We release the full framework, benchmark definitions, and a standardized database of model predictions to support transparent and reproducible development in the field.

Related papers

RubricBench: Aligning Model-Generated Rubrics with Human Standards [37.33662546555801]
Reward Models are shifting from simple completions to complex, highly sophisticated generation to mitigate surface-level biases.<n>Existing benchmarks lack both the discriminative complexity and the ground-truth annotations required for rigorous analysis.<n>We introduce a benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation.
arXiv Detail & Related papers (2026-03-02T07:39:49Z)
From Monolith to Microservices: A Comparative Evaluation of Decomposition Frameworks [1.516795490965608]
This work presents a unified evaluation of state-of-the-art microservice decomposition approaches spanning static, dynamic, and hybrid techniques.<n>We assess the decomposition quality across widely used benchmark systems (JPetStore, AcmeAir, DayTrader, and Plants) using Structural Modularity (SM), Interface Number(IFN), Inter-partition Communication (ICP), Non-Extreme Distribution (NED), and related indicators.<n>Findings indicate that the hierarchical clustering-based methods, particularly HDBScan, produce the most consistently balanced decompositions across benchmarks.
arXiv Detail & Related papers (2026-01-30T16:28:47Z)
Bridging Information Asymmetry: A Hierarchical Framework for Deterministic Blind Face Restoration [31.878334664450776]
We present textbfPrefRestore, a hierarchical framework that integrates discrete semantic logic with continuous texture generation.<n>Our methodology fundamentally addresses this information disparity through two complementary strategies.<n>Pref-Restore achieves state-of-the-art performance across synthetic and real-world benchmarks.
arXiv Detail & Related papers (2026-01-27T11:50:31Z)
DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z)
EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning [63.03672166010434]
We introduce an evolutionary, task-agnostic, strategy-guided, executably-checkable data synthesis framework.<n>It jointly synthesizes problems, diverse candidate solutions, and verification artifacts.<n>It iteratively discovers strategies via a consistency-based evaluator that enforces agreement between human-annotated and strategy-induced checks.
arXiv Detail & Related papers (2025-10-20T11:56:35Z)
OmniQuality-R: Advancing Reward Models Through All-Encompassing Quality Assessment [55.59322229889159]
We propose OmniQuality-R, a unified reward modeling framework that transforms multi-task quality reasoning into continuous and interpretable reward signals.<n>We use a reasoning-enhanced reward modeling dataset to form a reliable chain-of-thought dataset for supervised fine-tuning.<n>We evaluate OmniQuality-R on three key IQA tasks: aesthetic quality assessment, technical quality evaluation, and text-image alignment.
arXiv Detail & Related papers (2025-10-12T13:46:28Z)
Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination [77.69093448529455]
We present an empirical study using an infinitely scalable framework to synthesize research-level QA directly from arXiv papers.<n>We evaluate a lack of significant performance decay near knowledge cutoff dates for models of various sizes, developers, and release dates.<n>We hypothesize that the multi-step reasoning required by our synthesis pipeline offered additional complexity that goes deeper than shallow memorization.
arXiv Detail & Related papers (2025-08-26T16:41:37Z)
On Evaluating Performance of LLM Inference Serving Systems [11.712948114304925]
We identify recurring anti-patterns across three key dimensions: Baseline Fairness, Evaluation setup, and Metric Design.<n>These anti-patterns are uniquely problematic for Large Language Model (LLM) inference due to its dual-phase nature.<n>We provide a comprehensive checklist derived from our analysis, establishing a framework for recognizing and avoiding these anti-patterns.
arXiv Detail & Related papers (2025-07-11T20:58:21Z)
EVA-MILP: Towards Standardized Evaluation of MILP Instance Generation [13.49043811341421]
Mixed-Integer Linear Programming (MILP) is fundamental to solving complex decision-making problems.<n>The proliferation of MILP instance generation methods, driven by machine learning's demand for diverse datasets, has significantly outpaced standardized evaluation techniques.<n>This paper introduces a comprehensive benchmark framework designed for the systematic and objective evaluation of MILP instance generation methods.
arXiv Detail & Related papers (2025-05-30T16:42:15Z)
Generalization is not a universal guarantee: Estimating similarity to training data with an ensemble out-of-distribution metric [0.09363323206192666]
Failure of machine learning models to generalize to new data is a core problem limiting the reliability of AI systems.<n>We propose a standardized approach for assessing data similarity by constructing a supervised autoencoder for generalizability estimation (SAGE)<n>We show that out-of-the-box model performance increases after SAGE score filtering, even when applied to data from the model's own training and test datasets.
arXiv Detail & Related papers (2025-02-22T19:21:50Z)
Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers.<n>We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs.<n>Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z)
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z)
A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions [60.06461883533697]
We first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill.<n>We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.<n>Our analysis also offers insights into defending against neural Trojans by utilizing the attributions.
arXiv Detail & Related papers (2024-05-02T13:48:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.