Constructing Industrial-Scale Optimization Modeling Benchmark
- URL: http://arxiv.org/abs/2602.10450v1
- Date: Wed, 11 Feb 2026 02:45:31 GMT
- Title: Constructing Industrial-Scale Optimization Modeling Benchmark
- Authors: Zhong Li, Hongliang Lu, Tao Wei, Wenyu Liu, Yuxuan Chen, Yuan Lan, Fan Zhang, Zaiwen Wen,
- Abstract summary: A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models.<n>We introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs.<n>Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks.
- Score: 26.61380804019141
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized or synthetic benchmarks, masking the difficulty of industrial problems with $10^{3}$--$10^{6}$ (or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model--data separation format, and (iii) performs iterative semantic validation through expert review and human--LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.
Related papers
- ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization [6.572539312871392]
Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk.<n>We introduce ReLoop, addressing silent failures from two complementary directions.
arXiv Detail & Related papers (2026-02-17T20:20:33Z) - Unlocking Reasoning Capability on Machine Translation in Large Language Models [57.60641851466707]
Reasoning-oriented large language models (RLMs) achieve strong gains on tasks such as mathematics and coding by generating explicit intermediate reasoning.<n>We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark.<n>We find that enabling explicit reasoning consistently degrades translation quality across languages and models.
arXiv Detail & Related papers (2026-02-16T14:05:59Z) - Structure-Aware Robust Counterfactual Explanations via Conditional Gaussian Network Classifiers [0.26999000177990923]
This work presents a structure-aware robustness-and-counterfactual search method based on conditional conditional graphs.<n>Results show that our method achieves strong consistency, with direct optimization of the original formulation providing especially stable dependencies.<n>The proposed framework lays the groundwork for future advances in counterfactual reasoning under noncyclic constraints.
arXiv Detail & Related papers (2026-02-08T15:51:45Z) - From Brute Force to Semantic Insight: Performance-Guided Data Transformation Design with LLMs [48.83701310501069]
Large language models (LLMs) have achieved notable performance in code synthesis.<n>We introduce a performance-aware, closed-loop solution that enables LLMs to autonomously engineer optimal transformations.<n>We fine-tune LLMs with Low-Rank Adaptation on a novel repository of more than 6,000 empirically evaluated PyTorch augmentation functions.
arXiv Detail & Related papers (2026-01-07T11:13:02Z) - ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling [18.8099769877788]
ORGEval is a graph-theoretic evaluation framework for assessing Large Language Models' capabilities in formulating linear and mixed-integer linear programs.<n>We show that ORGEval can successfully detect model equivalence and produce 100% consistent results across random parameter configurations.<n>Our results reveal that although optimization modeling remains challenging for all LLMs, DeepSeek-V3 and Claude-Opus-4 achieve the highest accuracies under direct prompting.
arXiv Detail & Related papers (2025-10-31T16:35:52Z) - Benchmarking Generative AI Against Bayesian Optimization for Constrained Multi-Objective Inverse Design [0.15293427903448018]
This paper investigates the performance of Large Language Models (LLMs) as generative feasibles for solving constrained multi-objective regression tasks.<n>The best-performing LLM (Math-7B) achieved a Generational Distance (GD) of 1.21, significantly outperforming the traditional BoTorch Ax baseline.<n>The findings have direct industrial applications in optimizing formulation design for resins, rheological, and chemical properties.
arXiv Detail & Related papers (2025-10-29T10:37:09Z) - Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation [0.0]
We present a component-level evaluation framework for large language models (LLMs)<n>We evaluate GPT-5, LLaMA 3.1 Instruct, and DeepSeek Math across optimization problems of varying complexity.<n>Results show GPT-5 consistently outperforms other models, with chain-of-thought, self-consistency, and modular prompting proving most effective.
arXiv Detail & Related papers (2025-10-19T17:47:59Z) - Autoformalizer with Tool Feedback [52.334957386319864]
Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
arXiv Detail & Related papers (2025-10-08T10:25:12Z) - Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments [70.42705564227548]
We propose an automated environment construction pipeline for large language models (LLMs)<n>This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools.<n>We also introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution.
arXiv Detail & Related papers (2025-08-12T09:45:19Z) - Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z) - Autoformulation of Mathematical Optimization Models Using LLMs [50.030647274271516]
This paper approaches the problem of $textitautoformulation$: the automated creation of solver-ready optimization models from natural language problem descriptions.<n>We identify three core challenges of autoformulation: $textit(1)$ the vast, problem-dependent hypothesis space, and $textit(2)$ efficient and diverse exploration of this space under uncertainty.<n>We present a novel method leveraging $textitLarge Language Models$ with $textitMonte-Carlo Tree Search$, exploiting the hierarchical nature of optimization modeling to generate and systematically explore possible formulations
arXiv Detail & Related papers (2024-11-03T20:41:38Z) - Symbolic Regression by Exhaustive Search: Reducing the Search Space
Using Syntactical Constraints and Efficient Semantic Structure Deduplication [2.055204980188575]
Symbolic regression is a powerful system identification technique in industrial scenarios where no prior knowledge on model structure is available.
In this chapter we introduce a deterministic symbolic regression algorithm specifically designed to address these issues.
A finite enumeration of all possible models is guaranteed by structural restrictions as well as a caching mechanism for detecting semantically equivalent solutions.
arXiv Detail & Related papers (2021-09-28T17:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.