Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research
- URL: http://arxiv.org/abs/2601.21008v1
- Date: Wed, 28 Jan 2026 20:02:44 GMT
- Title: Solver-in-the-Loop: MDP-Based Benchmarks for Self-Correction and Behavioral Rationality in Operations Research
- Authors: Ruicheng Ao, David Simchi-Levi, Xinshang Wang,
- Abstract summary: Operations Research practitioners routinely debug infeasible models through an iterative process.<n>We introduce two benchmarks that place the textbfsolver in the evaluation loop<n>We find that domain-specific RLVR training enables an 8B model to surpass frontier APIs.
- Score: 19.31559944205485
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Operations Research practitioners routinely debug infeasible models through an iterative process: analyzing Irreducible Infeasible Subsystems (\IIS{}), identifying constraint conflicts, and systematically repairing formulations until feasibility is achieved. Yet existing LLM benchmarks evaluate OR as one-shot translation -- given a problem description, generate solver code -- ignoring this diagnostic loop entirely. We introduce two benchmarks that place the \textbf{solver in the evaluation loop}. \textbf{\ORDebug{}} evaluates iterative self-correction through 5,000+ problems spanning 9 error types; each repair action triggers solver re-execution and \IIS{} recomputation, providing deterministic, verifiable feedback. \textbf{\ORBias{}} evaluates behavioral rationality through 2,000 newsvendor instances (1,000 ID + 1,000 OOD), measuring systematic deviations from closed-form optimal policies. Across 26 models and 12,000+ samples, we find that domain-specific RLVR training enables an 8B model to surpass frontier APIs: 95.3\% vs 86.2\% recovery rate (+9.1\%), 62.4\% vs 47.8\% diagnostic accuracy (+14.6\%), and 2.25 vs 3.78 steps to resolution (1.7$\times$ faster). On \ORBias{}, curriculum training achieves the only negative ID$\rightarrow$OOD bias drift among models evaluated (-9.6\%), reducing systematic bias by 48\% (from 20.0\% to 10.4\%). These results demonstrate that process-level evaluation with verifiable oracles enables targeted training that outperforms scale.
Related papers
- In-Context Environments Induce Evaluation-Awareness in Language Models [0.12691047660244334]
Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task.<n>We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment.<n>We show that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.
arXiv Detail & Related papers (2026-03-04T08:22:02Z) - OptiRepair: Closed-Loop Diagnosis and Repair of Supply Chain Optimization Models with LLM Agents [19.31559944205485]
Supply chain optimization models frequently become infeasible because of modeling errors.<n>We decompose this task into two phases: a domain-agnostic feasibility phase and a domain-specific validation phase.<n>We test 22 API models from seven families on 976 multi-echelon supply chain problems.
arXiv Detail & Related papers (2026-02-23T02:19:05Z) - How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs [49.61011897610774]
How2Everything is a framework to evaluate and improve goal-conditioned procedure generation.<n>Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics.<n>How2Score is an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal.
arXiv Detail & Related papers (2026-02-09T15:47:14Z) - ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces [3.151184728006369]
We present ACAR, a measurement framework for studying multi-model orchestration under auditable conditions.<n>ACAR uses self-consistency variance (sigma) computed from N=3 probe samples to route tasks across single-model, two-model, and three-model execution modes.<n>We evaluate ACAR on 1,510 tasks spanning four benchmarks, producing more than 7,550 auditable runs.
arXiv Detail & Related papers (2026-02-06T23:27:17Z) - Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains [97.5573252172065]
We train a family of Automatic Reasoning Evaluators (FARE) with a simple iterative rejection-sampling supervised finetuning approach.<n>FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators.<n>As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH.
arXiv Detail & Related papers (2025-10-20T17:52:06Z) - Reliable Fine-Grained Evaluation of Natural Language Math Proofs [30.992321135182905]
We propose a systematic methodology for developing evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs.<n>We introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions.<n>Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method.
arXiv Detail & Related papers (2025-10-14T02:59:07Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems [20.846301581161978]
Failure attribution in multi-agent systems is a critical yet unsolved challenge.<n>Current methods treat this as a pattern recognition task over long conversation logs.<n>A2P Scaffolding transforms failure attribution from pattern recognition into a structured causal inference task.
arXiv Detail & Related papers (2025-09-12T16:51:15Z) - Probing for Arithmetic Errors in Language Models [86.8227317662622]
Internal activations in language models can be used to detect arithmetic errors.<n>We show that simple probes can accurately decode both the model's predicted output and the correct answer from hidden states.<n>We train lightweight error detectors that predict model correctness with over 90% accuracy.
arXiv Detail & Related papers (2025-07-16T16:27:50Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits [31.98028879922584]
We introduce SummExecEdit, a novel pipeline and benchmark to assess models on their ability to both detect factual errors and provide accurate explanations.<n>The top-performing model, Claude3-Opus, achieves a joint detection and explanation score of only 0.49 in our benchmark.<n>We identify four primary types of explanation errors, with 45.4% of them involving a focus on completely unrelated parts of the summary.
arXiv Detail & Related papers (2024-12-17T23:26:44Z) - Preference Optimization for Reasoning with Pseudo Feedback [100.62603571434167]
We introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions as an evaluation against associated test cases.<n>We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks.
arXiv Detail & Related papers (2024-11-25T12:44:02Z) - Enhancing Mathematical Reasoning in LLMs by Stepwise Correction [39.67266805233599]
Best-of-N decoding methods instruct large language models (LLMs) to generate multiple solutions, score each using a scoring function, and select the highest scored as the final answer to mathematical reasoning problems.
We propose a novel prompting method named Stepwise Correction (StepCo) that helps LLMs identify and revise incorrect steps in their generated reasoning paths.
The verify-then-revise process not only improves answer correctness but also reduces token consumption with fewer paths needed to generate.
arXiv Detail & Related papers (2024-10-16T18:18:42Z) - xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation [9.22621553566816]
This paper shows that optimizing the key answer extraction module improves extraction accuracy and enhances evaluation reliability.<n>We propose xFinder, a novel evaluator for answer extraction and matching in large language models (LLMs) evaluation.<n>Generalization tests and real-world evaluations show that the smallest xFinder model, with only 500 million parameters, achieves an average extraction accuracy of 93.42%.<n>The final judgment accuracy of xFinder reaches 97.61%, outperforming existing evaluation frameworks and judge models.
arXiv Detail & Related papers (2024-05-20T08:30:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.