Deep Policy Iteration with Integer Programming for Inventory Management
- URL: http://arxiv.org/abs/2112.02215v3
- Date: Tue, 07 Jan 2025 20:32:52 GMT
- Title: Deep Policy Iteration with Integer Programming for Inventory Management
- Authors: Pavithra Harsha, Ashish Jagmohan, Jayant Kalagnanam, Brian Quanz, Divya Singhvi,
- Abstract summary: We present a framework for optimizing long-term discounted reward problems with large accessible action space and state dependent constraints.<n>Our proposed Programmable Actor Reinforcement Learning (PARL) uses a deep-policy method that leverages neural networks (NNs) to approximate the value function.<n>We benchmark the proposed algorithm against state-of-the-art RL algorithms and commonly used replenishments and find it considerably outperforms existing methods by as much as 14.7% on average.
- Score: 8.27175065641495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a Reinforcement Learning (RL) based framework for optimizing long-term discounted reward problems with large combinatorial action space and state dependent constraints. These characteristics are common to many operations management problems, e.g., network inventory replenishment, where managers have to deal with uncertain demand, lost sales, and capacity constraints that results in more complex feasible action spaces. Our proposed Programmable Actor Reinforcement Learning (PARL) uses a deep-policy iteration method that leverages neural networks (NNs) to approximate the value function and combines it with mathematical programming (MP) and sample average approximation (SAA) to solve the per-step-action optimally while accounting for combinatorial action spaces and state-dependent constraint sets. We show how the proposed methodology can be applied to complex inventory replenishment problems where analytical solutions are intractable. We also benchmark the proposed algorithm against state-of-the-art RL algorithms and commonly used replenishment heuristics and find it considerably outperforms existing methods by as much as 14.7% on average in various complex supply chain settings. We find that this improvement of PARL over benchmark algorithms can be directly attributed to better inventory cost management, especially in inventory constrained settings. Furthermore, in the simpler setting where optimal replenishment policy is tractable or known near optimal heuristics exist, we find that the RL approaches can learn near optimal policies. Finally, to make RL algorithms more accessible for inventory management researchers, we also discuss the development of a modular Python library that can be used to test the performance of RL algorithms with various supply chain structures and spur future research in developing practical and near-optimal algorithms for inventory management problems.
Related papers
- Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models.
Controlled Decoding provides a mechanism for aligning a model at inference time without retraining.
We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z) - Memory-Enhanced Neural Solvers for Efficient Adaptation in Combinatorial Optimization [6.713974813995327]
We present MEMENTO, an approach that leverages memory to improve the adaptation of neural solvers at time.
We successfully train all RL auto-regressive solvers on large instances, and show that MEMENTO can scale and is data-efficient.
Overall, MEMENTO enables to push the state-of-the-art on 11 out of 12 evaluated tasks.
arXiv Detail & Related papers (2024-06-24T08:18:19Z) - Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL)
We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$.
The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Combinatorial Optimization with Policy Adaptation using Latent Space Search [44.12073954093942]
We present a novel approach for designing performant algorithms to solve complex, typically NP-hard, problems.
We show that our search strategy outperforms state-of-the-art approaches on 11 standard benchmarking tasks.
arXiv Detail & Related papers (2023-11-13T12:24:54Z) - Stepsize Learning for Policy Gradient Methods in Contextual Markov
Decision Processes [35.889129338603446]
Policy-based algorithms are among the most widely adopted techniques in model-free RL.
They tend to struggle when asked to accomplish a series of heterogeneous tasks.
We introduce a new formulation, known as meta-MDP, that can be used to solve any hyper parameter selection problem in RL.
arXiv Detail & Related papers (2023-06-13T12:58:12Z) - Addressing the issue of stochastic environments and local
decision-making in multi-objective reinforcement learning [0.0]
Multi-objective reinforcement learning (MORL) is a relatively new field which builds on conventional Reinforcement Learning (RL)
This thesis focuses on what factors influence the frequency with which value-based MORL Q-learning algorithms learn the optimal policy for an environment.
arXiv Detail & Related papers (2022-11-16T04:56:42Z) - CACTO: Continuous Actor-Critic with Trajectory Optimization -- Towards
global optimality [5.0915256711576475]
This paper presents a novel algorithm for the continuous control of dynamical systems that combines Trayy (TO) and Reinforcement Learning (RL) in a single trajectory.
arXiv Detail & Related papers (2022-11-12T10:16:35Z) - Policy Information Capacity: Information-Theoretic Measure for Task
Complexity in Deep Reinforcement Learning [83.66080019570461]
We propose two environment-agnostic, algorithm-agnostic quantitative metrics for task difficulty.
We show that these metrics have higher correlations with normalized task solvability scores than a variety of alternatives.
These metrics can also be used for fast and compute-efficient optimizations of key design parameters.
arXiv Detail & Related papers (2021-03-23T17:49:50Z) - A Two-stage Framework and Reinforcement Learning-based Optimization
Algorithms for Complex Scheduling Problems [54.61091936472494]
We develop a two-stage framework, in which reinforcement learning (RL) and traditional operations research (OR) algorithms are combined together.
The scheduling problem is solved in two stages, including a finite Markov decision process (MDP) and a mixed-integer programming process, respectively.
Results show that the proposed algorithms could stably and efficiently obtain satisfactory scheduling schemes for agile Earth observation satellite scheduling problems.
arXiv Detail & Related papers (2021-03-10T03:16:12Z) - Deep Controlled Learning for Inventory Control [0.0]
Controlled Deep Learning (DCL) is a new DRL framework based on approximate policy specifically designed to tackle inventory problems.
DCL outperforms existing state-of-the-art iterations in lost sales inventory control, perishable inventory systems, and inventory systems with random lead times.
These substantial performance and robustness improvements pave the way for the effective application of tailored DRL algorithms to inventory management problems.
arXiv Detail & Related papers (2020-11-30T18:53:08Z) - Combining Deep Learning and Optimization for Security-Constrained
Optimal Power Flow [94.24763814458686]
Security-constrained optimal power flow (SCOPF) is fundamental in power systems.
Modeling of APR within the SCOPF problem results in complex large-scale mixed-integer programs.
This paper proposes a novel approach that combines deep learning and robust optimization techniques.
arXiv Detail & Related papers (2020-07-14T12:38:21Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z) - A Survey of Reinforcement Learning Algorithms for Dynamically Varying
Environments [1.713291434132985]
Reinforcement learning (RL) algorithms find applications in inventory control, recommender systems, vehicular traffic management, cloud computing and robotics.
Real-world complications of many tasks arising in these domains makes them difficult to solve with the basic assumptions underlying classical RL algorithms.
This paper provides a survey of RL methods developed for handling dynamically varying environment models.
A representative collection of these algorithms is discussed in detail in this work along with their categorization and their relative merits and demerits.
arXiv Detail & Related papers (2020-05-19T09:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.