Related papers: ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization

ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization

URL: http://arxiv.org/abs/2602.15983v1
Date: Tue, 17 Feb 2026 20:20:33 GMT
Title: ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization
Authors: Junbo Jacob Lian, Yujun Sun, Huiling Chen, Chaoyu Zhang, Chung-Piaw Teo,
Abstract summary: Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk.<n>We introduce ReLoop, addressing silent failures from two complementary directions.
Score: 6.572539312871392
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that executes and returns solver-feasible solutions may encode semantically incorrect formulations, creating a feasibility-correctness gap of up to 90 percentage points on compositional problems. We introduce ReLoop, addressing silent failures from two complementary directions. Structured generation decomposes code production into a four-stage reasoning chain (understand, formalize, synthesize, verify) that mirrors expert modeling practice, with explicit variable-type reasoning and self-verification to prevent formulation errors at their source. Behavioral verification detects errors that survive generation by testing whether the formulation responds correctly to solver-based parameter perturbation, without requiring ground truth -- an external semantic signal that bypasses the self-consistency problem inherent in LLM-based code review. The two mechanisms are complementary: structured generation dominates on complex compositional problems, while behavioral verification becomes the largest single contributor on problems with localized formulation defects. Together with execution recovery via IIS-enhanced diagnostics, ReLoop raises correctness from 22.6% to 31.1% and execution from 72.1% to 100.0% on the strongest model, with consistent gains across five models spanning three paradigms (foundation, SFT, RL) and three benchmarks. We additionally release RetailOpt-190, 190 compositional retail optimization scenarios targeting the multi-constraint interactions where LLMs most frequently fail.

Related papers

Constructing Industrial-Scale Optimization Modeling Benchmark [26.61380804019141]
A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models.<n>We introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs.<n>Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks.
arXiv Detail & Related papers (2026-02-11T02:45:31Z)
Task-Awareness Improves LLM Generations and Uncertainty [48.857040212979484]
Bayes-optimal responses consistently outperform standard decoding methods like beam search.<n>Our decision-theoretic framework is applicable to any problem that admits a latent response structure.
arXiv Detail & Related papers (2026-01-29T10:16:23Z)
Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z)
ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization [73.0780809974414]
We propose a Reflective Autoformalization method that integrates semantic consistency evaluation into the autoformalization process.<n>This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors.<n>Experiments show that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines.
arXiv Detail & Related papers (2025-10-28T16:22:54Z)
Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation [0.0]
We present a component-level evaluation framework for large language models (LLMs)<n>We evaluate GPT-5, LLaMA 3.1 Instruct, and DeepSeek Math across optimization problems of varying complexity.<n>Results show GPT-5 consistently outperforms other models, with chain-of-thought, self-consistency, and modular prompting proving most effective.
arXiv Detail & Related papers (2025-10-19T17:47:59Z)
Optimization Modeling via Semantic Anchored Alignment [30.047608671041104]
We propose SAC-Opt, a backward-guided correction framework that grounds optimization modeling in problem semantics rather than solver feedback.<n>At each step, SAC-Opt aligns the original semantic anchors with those reconstructed from the generated code and selectively corrects only the mismatched components.<n> Empirical results on seven public datasets demonstrate that SAC-Opt improves average modeling accuracy by 7.8%, with gains of up to 21.9% on the ComplexLP dataset.
arXiv Detail & Related papers (2025-09-28T12:25:31Z)
EVALOOOP: A Self-Consistency-Centered Framework for Assessing Large Language Model Robustness in Programming [8.52533297070733]
EVALOOOP is an assessment framework that evaluates robustness from a self-consistency perspective.<n>We evaluate 96 popular large language models (LLMs) on the MBPP Plus benchmark.<n> EVALOOOP induces a 2.65%-47.62% absolute drop in pass@1 accuracy within ten loops.
arXiv Detail & Related papers (2025-05-18T01:02:33Z)
LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.<n>We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.<n>LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z)
Autoformulation of Mathematical Optimization Models Using LLMs [50.030647274271516]
This paper approaches the problem of $textitautoformulation$: the automated creation of solver-ready optimization models from natural language problem descriptions.<n>We identify three core challenges of autoformulation: $textit(1)$ the vast, problem-dependent hypothesis space, and $textit(2)$ efficient and diverse exploration of this space under uncertainty.<n>We present a novel method leveraging $textitLarge Language Models$ with $textitMonte-Carlo Tree Search$, exploiting the hierarchical nature of optimization modeling to generate and systematically explore possible formulations
arXiv Detail & Related papers (2024-11-03T20:41:38Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought [56.558892336235914]
Reversing Chain-of-Thought (RCoT) is a novel method to improve large language models' reasoning abilities. RCoT automatically detects and rectifys factual inconsistency in generated solutions. We show that manually written fine-grained feedback can dramatically improve LLMs' reasoning abilities.
arXiv Detail & Related papers (2023-05-19T08:02:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.