Procrustean Bed for AI-Driven Retrosynthesis: A Unified Framework for Reproducible Evaluation
- URL: http://arxiv.org/abs/2512.07079v1
- Date: Mon, 08 Dec 2025 01:26:39 GMT
- Title: Procrustean Bed for AI-Driven Retrosynthesis: A Unified Framework for Reproducible Evaluation
- Authors: Anton Morgunov, Victor S. Batista,
- Abstract summary: RetroCast is a unified evaluation suite that standardizes heterogeneous model outputs into a common schema.<n>We evaluate leading search-based and sequence-based algorithms on a new suite of standardized benchmarks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Progress in computer-aided synthesis planning (CASP) is obscured by the lack of standardized evaluation infrastructure and the reliance on metrics that prioritize topological completion over chemical validity. We introduce RetroCast, a unified evaluation suite that standardizes heterogeneous model outputs into a common schema to enable statistically rigorous, apples-to-apples comparison. The framework includes a reproducible benchmarking pipeline with stratified sampling and bootstrapped confidence intervals, accompanied by SynthArena, an interactive platform for qualitative route inspection. We utilize this infrastructure to evaluate leading search-based and sequence-based algorithms on a new suite of standardized benchmarks. Our analysis reveals a divergence between "solvability" (stock-termination rate) and route quality; high solvability scores often mask chemical invalidity or fail to correlate with the reproduction of experimental ground truths. Furthermore, we identify a "complexity cliff" in which search-based methods, despite high solvability rates, exhibit a sharp performance decay in reconstructing long-range synthetic plans compared to sequence-based approaches. We release the full framework, benchmark definitions, and a standardized database of model predictions to support transparent and reproducible development in the field.
Related papers
- RubricBench: Aligning Model-Generated Rubrics with Human Standards [37.33662546555801]
Reward Models are shifting from simple completions to complex, highly sophisticated generation to mitigate surface-level biases.<n>Existing benchmarks lack both the discriminative complexity and the ground-truth annotations required for rigorous analysis.<n>We introduce a benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation.
arXiv Detail & Related papers (2026-03-02T07:39:49Z) - From Monolith to Microservices: A Comparative Evaluation of Decomposition Frameworks [1.516795490965608]
This work presents a unified evaluation of state-of-the-art microservice decomposition approaches spanning static, dynamic, and hybrid techniques.<n>We assess the decomposition quality across widely used benchmark systems (JPetStore, AcmeAir, DayTrader, and Plants) using Structural Modularity (SM), Interface Number(IFN), Inter-partition Communication (ICP), Non-Extreme Distribution (NED), and related indicators.<n>Findings indicate that the hierarchical clustering-based methods, particularly HDBScan, produce the most consistently balanced decompositions across benchmarks.
arXiv Detail & Related papers (2026-01-30T16:28:47Z) - Bridging Information Asymmetry: A Hierarchical Framework for Deterministic Blind Face Restoration [31.878334664450776]
We present textbfPrefRestore, a hierarchical framework that integrates discrete semantic logic with continuous texture generation.<n>Our methodology fundamentally addresses this information disparity through two complementary strategies.<n>Pref-Restore achieves state-of-the-art performance across synthetic and real-world benchmarks.
arXiv Detail & Related papers (2026-01-27T11:50:31Z) - DeepSynth-Eval: Objectively Evaluating Information Consolidation in Deep Survey Writing [53.85037373860246]
We introduce Deep Synth-Eval, a benchmark designed to objectively evaluate information consolidation capabilities.<n>We propose a fine-grained evaluation protocol using General Checklists (for factual coverage) and Constraint Checklists (for structural organization)<n>Our results demonstrate that agentic plan-and-write significantly outperform single-turn generation.
arXiv Detail & Related papers (2026-01-07T03:07:52Z) - EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning [63.03672166010434]
We introduce an evolutionary, task-agnostic, strategy-guided, executably-checkable data synthesis framework.<n>It jointly synthesizes problems, diverse candidate solutions, and verification artifacts.<n>It iteratively discovers strategies via a consistency-based evaluator that enforces agreement between human-annotated and strategy-induced checks.
arXiv Detail & Related papers (2025-10-20T11:56:35Z) - OmniQuality-R: Advancing Reward Models Through All-Encompassing Quality Assessment [55.59322229889159]
We propose OmniQuality-R, a unified reward modeling framework that transforms multi-task quality reasoning into continuous and interpretable reward signals.<n>We use a reasoning-enhanced reward modeling dataset to form a reliable chain-of-thought dataset for supervised fine-tuning.<n>We evaluate OmniQuality-R on three key IQA tasks: aesthetic quality assessment, technical quality evaluation, and text-image alignment.
arXiv Detail & Related papers (2025-10-12T13:46:28Z) - Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination [77.69093448529455]
We present an empirical study using an infinitely scalable framework to synthesize research-level QA directly from arXiv papers.<n>We evaluate a lack of significant performance decay near knowledge cutoff dates for models of various sizes, developers, and release dates.<n>We hypothesize that the multi-step reasoning required by our synthesis pipeline offered additional complexity that goes deeper than shallow memorization.
arXiv Detail & Related papers (2025-08-26T16:41:37Z) - On Evaluating Performance of LLM Inference Serving Systems [11.712948114304925]
We identify recurring anti-patterns across three key dimensions: Baseline Fairness, Evaluation setup, and Metric Design.<n>These anti-patterns are uniquely problematic for Large Language Model (LLM) inference due to its dual-phase nature.<n>We provide a comprehensive checklist derived from our analysis, establishing a framework for recognizing and avoiding these anti-patterns.
arXiv Detail & Related papers (2025-07-11T20:58:21Z) - EVA-MILP: Towards Standardized Evaluation of MILP Instance Generation [13.49043811341421]
Mixed-Integer Linear Programming (MILP) is fundamental to solving complex decision-making problems.<n>The proliferation of MILP instance generation methods, driven by machine learning's demand for diverse datasets, has significantly outpaced standardized evaluation techniques.<n>This paper introduces a comprehensive benchmark framework designed for the systematic and objective evaluation of MILP instance generation methods.
arXiv Detail & Related papers (2025-05-30T16:42:15Z) - Generalization is not a universal guarantee: Estimating similarity to training data with an ensemble out-of-distribution metric [0.09363323206192666]
Failure of machine learning models to generalize to new data is a core problem limiting the reliability of AI systems.<n>We propose a standardized approach for assessing data similarity by constructing a supervised autoencoder for generalizability estimation (SAGE)<n>We show that out-of-the-box model performance increases after SAGE score filtering, even when applied to data from the model's own training and test datasets.
arXiv Detail & Related papers (2025-02-22T19:21:50Z) - Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers.<n>We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs.<n>Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions [60.06461883533697]
We first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill.<n>We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.<n>Our analysis also offers insights into defending against neural Trojans by utilizing the attributions.
arXiv Detail & Related papers (2024-05-02T13:48:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.