Related papers: Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

URL: http://arxiv.org/abs/2601.21008v1
Date: Wed, 28 Jan 2026 20:02:44 GMT
Title: Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research
Authors: Ruicheng Ao, David Simchi-Levi, Xinshang Wang,
Abstract summary: Operations Research practitioners routinely debug infeasible models through an iterative process.<n>We introduce two benchmarks that place the textbfsolver in the evaluation loop<n>We find that domain-specific RLVR training enables an 8B model to surpass frontier APIs.
Score: 19.31559944205485
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Operations Research practitioners routinely debug infeasible models through an iterative process: analyzing Irreducible Infeasible Subsystems (\IIS{}), identifying constraint conflicts, and systematically repairing formulations until feasibility is achieved. Yet existing LLM benchmarks evaluate OR as one-shot translation -- given a problem description, generate solver code -- ignoring this diagnostic loop entirely. We introduce two benchmarks that place the \textbf{solver in the evaluation loop}. \textbf{\ORDebug{}} evaluates iterative self-correction through 5,000+ problems spanning 9 error types; each repair action triggers solver re-execution and \IIS{} recomputation, providing deterministic, verifiable feedback. \textbf{\ORBias{}} evaluates behavioral rationality through 2,000 newsvendor instances (1,000 ID + 1,000 OOD), measuring systematic deviations from closed-form optimal policies. Across 26 models and 12,000+ samples, we find that domain-specific RLVR training enables an 8B model to surpass frontier APIs: 95.3\% vs 86.2\% recovery rate (+9.1\%), 62.4\% vs 47.8\% diagnostic accuracy (+14.6\%), and 2.25 vs 3.78 steps to resolution (1.7$\times$ faster). On \ORBias{}, curriculum training achieves the only negative ID$\rightarrow$OOD bias drift among models evaluated (-9.6\%), reducing systematic bias by 48\% (from 20.0\% to 10.4\%). These results demonstrate that process-level evaluation with verifiable oracles enables targeted training that outperforms scale.

Related papers

In-Context Environments Induce Evaluation-Awareness in Language Models [0.12691047660244334]
Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task.<n>We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment.<n>We show that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.
arXiv Detail & Related papers (2026-03-04T08:22:02Z)
OptiRepair: Closed-Loop Diagnosis and Repair of Supply Chain Optimization Models with LLM Agents [19.31559944205485]
Supply chain optimization models frequently become infeasible because of modeling errors.<n>We decompose this task into two phases: a domain-agnostic feasibility phase and a domain-specific validation phase.<n>We test 22 API models from seven families on 976 multi-echelon supply chain problems.
arXiv Detail & Related papers (2026-02-23T02:19:05Z)
How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs [49.61011897610774]
How2Everything is a framework to evaluate and improve goal-conditioned procedure generation.<n>Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics.<n>How2Score is an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal.
arXiv Detail & Related papers (2026-02-09T15:47:14Z)
ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces [3.151184728006369]
We present ACAR, a measurement framework for studying multi-model orchestration under auditable conditions.<n>ACAR uses self-consistency variance (sigma) computed from N=3 probe samples to route tasks across single-model, two-model, and three-model execution modes.<n>We evaluate ACAR on 1,510 tasks spanning four benchmarks, producing more than 7,550 auditable runs.
arXiv Detail & Related papers (2026-02-06T23:27:17Z)
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains [97.5573252172065]
We train a family of Automatic Reasoning Evaluators (FARE) with a simple iterative rejection-sampling supervised finetuning approach.<n>FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators.<n>As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH.
arXiv Detail & Related papers (2025-10-20T17:52:06Z)
Reliable Fine-Grained Evaluation of Natural Language Math Proofs [30.992321135182905]
We propose a systematic methodology for developing evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs.<n>We introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions.<n>Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method.
arXiv Detail & Related papers (2025-10-14T02:59:07Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems [20.846301581161978]
Failure attribution in multi-agent systems is a critical yet unsolved challenge.<n>Current methods treat this as a pattern recognition task over long conversation logs.<n>A2P Scaffolding transforms failure attribution from pattern recognition into a structured causal inference task.
arXiv Detail & Related papers (2025-09-12T16:51:15Z)
Probing for Arithmetic Errors in Language Models [86.8227317662622]
Internal activations in language models can be used to detect arithmetic errors.<n>We show that simple probes can accurately decode both the model's predicted output and the correct answer from hidden states.<n>We train lightweight error detectors that predict model correctness with over 90% accuracy.
arXiv Detail & Related papers (2025-07-16T16:27:50Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits [31.98028879922584]
We introduce SummExecEdit, a novel pipeline and benchmark to assess models on their ability to both detect factual errors and provide accurate explanations.<n>The top-performing model, Claude3-Opus, achieves a joint detection and explanation score of only 0.49 in our benchmark.<n>We identify four primary types of explanation errors, with 45.4% of them involving a focus on completely unrelated parts of the summary.
arXiv Detail & Related papers (2024-12-17T23:26:44Z)
Preference Optimization for Reasoning with Pseudo Feedback [100.62603571434167]
We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases.<n>We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
arXiv Detail & Related papers (2024-11-25T12:44:02Z)
Enhancing Mathematical Reasoning in LLMs by Stepwise Correction [39.67266805233599]
Best-of-N decoding methods instruct large language models (LLMs) to generate multiple solutions, score each using a scoring function, and select the highest scored as the final answer to mathematical reasoning problems. We propose a novel prompting method named Stepwise Correction (StepCo) that helps LLMs identify and revise incorrect steps in their generated reasoning paths. The verify-then-revise process not only improves answer correctness but also reduces token consumption with fewer paths needed to generate.
arXiv Detail & Related papers (2024-10-16T18:18:42Z)
xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation [9.22621553566816]
This paper shows that optimizing the key answer extraction module improves extraction accuracy and enhances evaluation reliability.<n>We propose xFinder, a novel evaluator for answer extraction and matching in large language models (LLMs) evaluation.<n>Generalization tests and real-world evaluations show that the smallest xFinder model, with only 500 million parameters, achieves an average extraction accuracy of 93.42%.<n>The final judgment accuracy of xFinder reaches 97.61%, outperforming existing evaluation frameworks and judge models.
arXiv Detail & Related papers (2024-05-20T08:30:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.