Related papers: OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning

OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning

URL: http://arxiv.org/abs/2511.09092v1
Date: Thu, 13 Nov 2025 01:31:16 GMT
Title: OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning
Authors: Zezhen Ding, Zhen Tan, Jiheng Zhang, Tianlong Chen,
Abstract summary: We present OR-R1, a data-efficient training framework for automated optimization modeling and solving.<n>Experiments show that OR-R1 achieves state-of-the-art performance with an average solving accuracy of $67.7%$.
Score: 44.346973471913856
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Optimization modeling and solving are fundamental to the application of Operations Research (OR) in real-world decision making, yet the process of translating natural language problem descriptions into formal models and solver code remains highly expertise intensive. While recent advances in large language models (LLMs) have opened new opportunities for automation, the generalization ability and data efficiency of existing LLM-based methods are still limited, asmost require vast amounts of annotated or synthetic data, resulting in high costs and scalability barriers. In this work, we present OR-R1, a data-efficient training framework for automated optimization modeling and solving. OR-R1 first employs supervised fine-tuning (SFT) to help the model acquire the essential reasoning patterns for problem formulation and code generation from limited labeled data. In addition, it improves the capability and consistency through Test-Time Group Relative Policy Optimization (TGRPO). This two-stage design enables OR-R1 to leverage both scarce labeled and abundant unlabeled data for effective learning. Experiments show that OR-R1 achieves state-of-the-art performance with an average solving accuracy of $67.7\%$, using only $1/10$ the synthetic data required by prior methods such as ORLM, exceeding ORLM's solving accuracy by up to $4.2\%$. Remarkably, OR-R1 outperforms ORLM by over $2.4\%$ with just $100$ synthetic samples. Furthermore, TGRPO contributes an additional $3.1\%-6.4\%$ improvement in accuracy, significantly narrowing the gap between single-attempt (Pass@1) and multi-attempt (Pass@8) performance from $13\%$ to $7\%$. Extensive evaluations across diverse real-world benchmarks demonstrate that OR-R1 provides a robust, scalable, and cost-effective solution for automated OR optimization problem modeling and solving, lowering the expertise and data barriers for industrial OR applications.

Related papers

MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research [15.28095645151852]
MIRROR is a fine-tuning-free, end-to-end multi-agent framework for operations research.<n>It translates natural language optimization problems into mathematical models and solver code.<n>Experiments show that MIRROR outperforms existing methods on standard Operations Research benchmarks.
arXiv Detail & Related papers (2026-02-03T09:46:56Z)
Budget-Aware Anytime Reasoning with LLM-Synthesized Preference Data [57.996437077411315]
We study the reasoning behavior of large language models (LLMs) under limited computation budgets.<n>We introduce an anytime reasoning framework and the Anytime Index, a metric that quantifies how effectively solution quality improves as reasoning tokens increase.<n> Experiments on NaturalPlan (Trip), AIME, and GPQA datasets show consistent gains across Grok-3, GPT-oss, GPT-4.1/4o, and LLaMA models.
arXiv Detail & Related papers (2026-01-16T07:09:30Z)
ROAD: Reflective Optimization via Automated Debugging for Zero-Shot Agent Alignment [1.6968020497268546]
ROAD is a novel framework that treats optimization as a dynamic debug investigation rather than a search.<n>Road is highly sample-efficient, achieving a 5.6 percent increase in success rate and a 3.8 percent increase in search accuracy.<n>These findings suggest that mimicking the human engineering loop of failure analysis and patching offers a viable, data-efficient alternative to resource-intensive training.
arXiv Detail & Related papers (2025-12-30T07:31:34Z)
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z)
Step-Opt: Boosting Optimization Modeling in LLMs through Iterative Data Synthesis and Structured Validation [18.18239596347168]
Step-Opt-Instruct is a framework that augments existing datasets and generates high-quality fine-tuning data tailored to optimization modeling.<n>We fine-tune open-source LLMs, including LLaMA-3-8B and Mistral-7B, to develop Step-Opt-a model that achieves state-of-the-art performance on benchmarks such as NL4OPT, MAMO, and IndustryOR.
arXiv Detail & Related papers (2025-06-21T08:42:27Z)
AdaptiveLLM: A Framework for Selecting Optimal Cost-Efficient LLM for Code-Generation Based on CoT Length [5.856039862078523]
We introduce AdaptiveLLM, a framework that dynamically selects optimal Large Language Models (LLMs) for a given coding task by automatically assessing task difficulty.<n>Our framework first estimates task difficulty using Chain-of-Thought lengths generated by reasoning model, clusters these into three difficulty levels via k-means, and fine-tunes CodeBERT to embed difficulty-aware features.<n>Our framework achieves a 7.86% improvement in pass@1 score while reducing resource consumption by 88.9% compared to baseline method ComplexityNet.
arXiv Detail & Related papers (2025-06-12T09:43:48Z)
ORMind: A Cognitive-Inspired End-to-End Reasoning Framework for Operations Research [56.961539386979354]
We introduce ORMind, a cognitive-inspired framework that enhances optimization through counterfactual reasoning.<n>Our approach emulates human cognition, implementing an end-to-end workflow that transforms requirements into mathematical models and executable code.<n>It is currently being tested internally in Lenovo's AI Assistant, with plans to enhance optimization capabilities for both business and consumer customers.
arXiv Detail & Related papers (2025-06-02T05:11:21Z)
OR-LLM-Agent: Automating Modeling and Solving of Operations Research Optimization Problems with Reasoning LLM [15.260794368585692]
We propose OR-LLM-Agent, an AI agent framework built on reasoning LLMs for automated Operations Research problem solving.<n>We show that OR-LLM-Agent utilizing DeepSeek-R1 in its framework outperforms advanced methods, including GPT-o3, Gemini 2.5 Pro, DeepSeek-R1, and ORLM, by at least 7% in accuracy.
arXiv Detail & Related papers (2025-03-13T03:40:50Z)
Autoformulation of Mathematical Optimization Models Using LLMs [50.030647274271516]
This paper approaches the problem of $textitautoformulation$: the automated creation of solver-ready optimization models from natural language problem descriptions.<n>We identify three core challenges of autoformulation: $textit(1)$ the vast, problem-dependent hypothesis space, and $textit(2)$ efficient and diverse exploration of this space under uncertainty.<n>We present a novel method leveraging $textitLarge Language Models$ with $textitMonte-Carlo Tree Search$, exploiting the hierarchical nature of optimization modeling to generate and systematically explore possible formulations
arXiv Detail & Related papers (2024-11-03T20:41:38Z)
Self-Steering Optimization: Autonomous Preference Optimization for Large Language Models [79.84205827056907]
We present Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference data.<n>$SSO$ employs a specialized optimization objective to build a data generator from the policy model itself, which is used to produce accurate and on-policy data.<n>Our evaluation shows that $SSO$ consistently outperforms baselines in human preference alignment and reward optimization.
arXiv Detail & Related papers (2024-10-22T16:04:03Z)
ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling [15.67321902882617]
We propose a viable path for training open-source LLMs capable of optimization modeling and developing solver codes.<n>This work also introduces IndustryOR, the first industrial benchmark for evaluating LLMs in solving practical OR problems.
arXiv Detail & Related papers (2024-05-28T01:55:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.