Related papers: ORPR: An OR-Guided Pretrain-then-Reinforce Learning Model for Inventory Management

ORPR: An OR-Guided Pretrain-then-Reinforce Learning Model for Inventory Management

URL: http://arxiv.org/abs/2512.19001v1
Date: Mon, 22 Dec 2025 03:39:43 GMT
Title: ORPR: An OR-Guided Pretrain-then-Reinforce Learning Model for Inventory Management
Authors: Lingjie Zhao, Xue Yu, Yongzhi Qi, Hao Hu, Jianshen Zhang, Yingzheng Ma, Shuyu Han, Wei Qi, Zuo-Jun Max Shen,
Abstract summary: "Pretrain-then-Reinforce" approach reconciles AI's adaptive perception with Operations Research's structural rigor.<n>We show that a lightweight, domain-informed model can deliver state-of-the-art performance and robust transferability when guided by structured OR logic.
Score: 9.138155308817215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As the pursuit of synergy between Artificial Intelligence (AI) and Operations Research (OR) gains momentum in handling complex inventory systems, a critical challenge persists: how to effectively reconcile AI's adaptive perception with OR's structural rigor. To bridge this gap, we propose a novel OR-Guided "Pretrain-then-Reinforce" framework. To provide structured guidance, we propose a simulation-augmented OR model that generates high-quality reference decisions, implicitly capturing complex business constraints and managerial preferences. Leveraging these OR-derived decisions as foundational training labels, we design a domain-informed deep learning foundation model to establish foundational decision-making capabilities, followed by a reinforcement learning (RL) fine-tuning stage. Uniquely, we position RL as a deep alignment mechanism that enables the AI agent to internalize the optimality principles of OR, while simultaneously leveraging exploration for general policy refinement and allowing expert guidance for scenario-specific adaptation (e.g., promotional events). Validated through extensive numerical experiments and a field deployment at JD.com augmented by a Difference-in-Differences (DiD) analysis, our model significantly outperforms incumbent industrial practices, delivering real-world gains of a 5.27-day reduction in turnover and a 2.29% increase in in-stock rates, alongside a 29.95% decrease in holding costs. Contrary to the prevailing trend of brute-force model scaling, our study demonstrates that a lightweight, domain-informed model can deliver state-of-the-art performance and robust transferability when guided by structured OR logic. This approach offers a scalable and cost-effective paradigm for intelligent supply chain management, highlighting the value of deeply aligning AI with OR.

Related papers

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data [16.065264121785294]
We introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning.<n>NRT reframes the training problem by treating the reasoning process as a latent variable.<n>NRT achieves state-of-the-art performance among verifier-free methods.
arXiv Detail & Related papers (2026-02-12T04:15:46Z)
LLM-Inspired Pretrain-Then-Finetune for Small-Data, Large-Scale Optimization [7.8639568562295965]
We consider small-data, large-scale decision problems in which a firm must make many operational decisions simultaneously.<n>We propose a pretrain-then-finetune approach built on a designed Transformer model to address this challenge.
arXiv Detail & Related papers (2026-02-03T16:08:33Z)
Optimizing Generative Ranking Relevance via Reinforcement Learning in Xiaohongshu Search [32.56725829132154]
We investigate whether explicit reasoning can enhance both interpretability and performance in relevance modeling.<n>In this work, we formulate relevance modeling in Xiaohongshu search as a reasoning task.<n>We introduce a Reinforcement Learning (RL)-based training framework to enhance the grounded reasoning capabilities of GRMs.
arXiv Detail & Related papers (2025-11-30T16:31:16Z)
Demystifying Reinforcement Learning in Agentic Reasoning [90.3737088727791]
We conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning.<n>We highlight our key insights: (i) replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT.<n> Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency.
arXiv Detail & Related papers (2025-10-13T17:57:15Z)
TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance [10.092283121886679]
TaoSR-AGRL is an Adaptive Guided Reinforcement Learning framework for relevance prediction in Taobao Search Relevance.<n>It decomposes the final relevance judgment into dense, structured rewards aligned with domain-specific relevance criteria.<n>It consistently outperforms DPO and standard GRPO baselines in offline experiments, improving relevance accuracy, rule adherence, and training stability.
arXiv Detail & Related papers (2025-10-09T10:34:39Z)
Evolutionary Reinforcement Learning for Interpretable Decision-Making in Supply Chain Management [3.195234044113248]
Supply Chain Management (SCM) faces challenges in adopting advanced optimization techniques due to the "black-box" nature of most AI-based solutions.<n>We employ an Interpretable Artificial Intelligence (IAI) approach that combines evolutionary computation with Reinforcement Learning (RL) to generate interpretable decision-making policies.<n>This IAI solution is embedded within a simulation-based optimization framework specifically designed to handle the inherent uncertainties and behaviors of modern supply chains.
arXiv Detail & Related papers (2025-04-16T12:28:35Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z)
A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities.<n>Their alignment with human values remains critical for ensuring helpful and harmless deployments.<n>Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z)
On the Modeling Capabilities of Large Language Models for Sequential Decision Making [52.128546842746246]
Large pretrained models are showing increasingly better performance in reasoning and planning tasks. We evaluate their ability to produce decision-making policies, either directly, by generating actions, or indirectly. In environments with unfamiliar dynamics, we explore how fine-tuning LLMs with synthetic data can significantly improve their reward modeling capabilities.
arXiv Detail & Related papers (2024-10-08T03:12:57Z)
Mitigating Distribution Shift in Model-based Offline RL via Shifts-aware Reward Learning [36.01269673940484]
This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift.<n>Our theoretical and empirical investigations reveal how these factors distort value estimation and policy optimization.<n>We derive a novel shifts-aware reward through a unified probabilistic inference framework, which modifies the vanilla reward to refine value learning and facilitate policy training.
arXiv Detail & Related papers (2024-08-23T04:25:09Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.