Related papers: ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling

ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling

URL: http://arxiv.org/abs/2510.27610v1
Date: Fri, 31 Oct 2025 16:35:52 GMT
Title: ORGEval: Graph-Theoretic Evaluation of LLMs in Optimization Modeling
Authors: Zhuohan Wang, Ziwei Zhu, Ziniu Li, Congliang Chen, Yizhou Han, Yufeng Lin, Zhihang Lin, Angyang Gu, Xinglin Hu, Ruoyu Sun, Tian Ding,
Abstract summary: ORGEval is a graph-theoretic evaluation framework for assessing Large Language Models' capabilities in formulating linear and mixed-integer linear programs.<n>We show that ORGEval can successfully detect model equivalence and produce 100% consistent results across random parameter configurations.<n>Our results reveal that although optimization modeling remains challenging for all LLMs, DeepSeek-V3 and Claude-Opus-4 achieve the highest accuracies under direct prompting.
Score: 18.8099769877788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Formulating optimization problems for industrial applications demands significant manual effort and domain expertise. While Large Language Models (LLMs) show promise in automating this process, evaluating their performance remains difficult due to the absence of robust metrics. Existing solver-based approaches often face inconsistency, infeasibility issues, and high computational costs. To address these issues, we propose ORGEval, a graph-theoretic evaluation framework for assessing LLMs' capabilities in formulating linear and mixed-integer linear programs. ORGEval represents optimization models as graphs, reducing equivalence detection to graph isomorphism testing. We identify and prove a sufficient condition, when the tested graphs are symmetric decomposable (SD), under which the Weisfeiler-Lehman (WL) test is guaranteed to correctly detect isomorphism. Building on this, ORGEval integrates a tailored variant of the WL-test with an SD detection algorithm to evaluate model equivalence. By focusing on structural equivalence rather than instance-level configurations, ORGEval is robust to numerical variations. Experimental results show that our method can successfully detect model equivalence and produce 100\% consistent results across random parameter configurations, while significantly outperforming solver-based methods in runtime, especially on difficult problems. Leveraging ORGEval, we construct the Bench4Opt dataset and benchmark state-of-the-art LLMs on optimization modeling. Our results reveal that although optimization modeling remains challenging for all LLMs, DeepSeek-V3 and Claude-Opus-4 achieve the highest accuracies under direct prompting, outperforming even leading reasoning models.

Related papers

Constructing Industrial-Scale Optimization Modeling Benchmark [26.61380804019141]
A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models.<n>We introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs.<n>Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks.
arXiv Detail & Related papers (2026-02-11T02:45:31Z)
Mixture-of-Experts Models in Vision: Routing, Optimization, and Generalization [0.0]
We study MoE behavior in an image classification setting, focusing on predictive performance, expert utilization, and generalization.<n>We compare dense, SoftMoE, and SparseMoE classifier heads on the CIFAR10 dataset under comparable model capacity.<n>Both MoE variants achieve slightly higher validation accuracy than the dense baseline while maintaining balanced expert utilization through regularization.<n>We find that SoftMoE exhibits higher sharpness by these metrics, while Dense and SparseMoE lie in a similar curvature regime, despite all models achieving comparable generalization performance.
arXiv Detail & Related papers (2026-01-21T14:22:25Z)
Self-Boost via Optimal Retraining: An Analysis via Approximate Message Passing [58.52119063742121]
Retraining a model using its own predictions together with the original, potentially noisy labels is a well-known strategy for improving the model performance.<n>This paper addresses the question of how to optimally combine the model's predictions and the provided labels.<n>Our main contribution is the derivation of the Bayes optimal aggregator function to combine the current model's predictions and the given labels.
arXiv Detail & Related papers (2025-05-21T07:16:44Z)
Solver-Informed RL: Grounding Large Language Models for Authentic Optimization Modeling [3.253908111652627]
Large Language Models (LLMs) often struggle to generate formally correct and usable models against hallucinations.<n>We present a novel framework that significantly improves the authenticity of LLMs for optimization modeling using Reinforcement Learning with Verifiable Reward.
arXiv Detail & Related papers (2025-05-17T02:32:03Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling [9.617742955894247]
Lack of high-quality optimization modeling datasets hampers large language models.<n>We propose a scalable framework for synthesizing a high-quality dataset, named OptMATH.<n>We demonstrate that models of various sizes trained on OptMATH achieve superior results on multiple modeling benchmarks.
arXiv Detail & Related papers (2025-02-16T12:38:37Z)
Autoformulation of Mathematical Optimization Models Using LLMs [50.030647274271516]
This paper approaches the problem of $textitautoformulation$: the automated creation of solver-ready optimization models from natural language problem descriptions.<n>We identify three core challenges of autoformulation: $textit(1)$ the vast, problem-dependent hypothesis space, and $textit(2)$ efficient and diverse exploration of this space under uncertainty.<n>We present a novel method leveraging $textitLarge Language Models$ with $textitMonte-Carlo Tree Search$, exploiting the hierarchical nature of optimization modeling to generate and systematically explore possible formulations
arXiv Detail & Related papers (2024-11-03T20:41:38Z)
A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models [63.949883238901414]
We present a unique angle of gradient analysis of loss functions that simultaneously reward good examples and penalize bad ones in LMs. We find that ExMATE serves as a superior surrogate for MLE, and that combining DPO with ExMATE instead of MLE further enhances both the statistical (5-7%) and generative (+18% win rate) performance.
arXiv Detail & Related papers (2024-08-29T17:46:18Z)
Estimate-Then-Optimize versus Integrated-Estimation-Optimization versus Sample Average Approximation: A Stochastic Dominance Perspective [21.945745750737952]
We show that a reverse behavior appears when the model class is well-specified and there is sufficient data.<n>We also demonstrate how standard sample average approximation (SAA) performs the worst when the model class is well-specified in terms of regret.
arXiv Detail & Related papers (2023-04-13T21:54:53Z)
Goal-directed Generation of Discrete Structures with Conditional Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward. We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z)
Automatically Learning Compact Quality-aware Surrogates for Optimization Problems [55.94450542785096]
Solving optimization problems with unknown parameters requires learning a predictive model to predict the values of the unknown parameters and then solving the problem using these values. Recent work has shown that including the optimization problem as a layer in a complex training model pipeline results in predictions of iteration of unobserved decision making. We show that we can improve solution quality by learning a low-dimensional surrogate model of a large optimization problem.
arXiv Detail & Related papers (2020-06-18T19:11:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.