Adjustable Robust Reinforcement Learning for Online 3D Bin Packing
- URL: http://arxiv.org/abs/2310.04323v1
- Date: Fri, 6 Oct 2023 15:34:21 GMT
- Title: Adjustable Robust Reinforcement Learning for Online 3D Bin Packing
- Authors: Yuxin Pan, Yize Chen, Fangzhen Lin
- Abstract summary: Current deep reinforcement learning methods for online 3D-BPP fail in real-world settings where some worst-case scenarios can materialize.
We propose an adjustable robust reinforcement learning framework that allows efficient adjustment of robustness weights.
Experiments demonstrate that AR2L is versatile in the sense that it improves policy robustness while maintaining an acceptable level of performance for the nominal case.
- Score: 11.157035538606968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Designing effective policies for the online 3D bin packing problem (3D-BPP)
has been a long-standing challenge, primarily due to the unpredictable nature
of incoming box sequences and stringent physical constraints. While current
deep reinforcement learning (DRL) methods for online 3D-BPP have shown
promising results in optimizing average performance over an underlying box
sequence distribution, they often fail in real-world settings where some
worst-case scenarios can materialize. Standard robust DRL algorithms tend to
overly prioritize optimizing the worst-case performance at the expense of
performance under normal problem instance distribution. To address these
issues, we first introduce a permutation-based attacker to investigate the
practical robustness of both DRL-based and heuristic methods proposed for
solving online 3D-BPP. Then, we propose an adjustable robust reinforcement
learning (AR2L) framework that allows efficient adjustment of robustness
weights to achieve the desired balance of the policy's performance in average
and worst-case environments. Specifically, we formulate the objective function
as a weighted sum of expected and worst-case returns, and derive the lower
performance bound by relating to the return under a mixture dynamics. To
realize this lower bound, we adopt an iterative procedure that searches for the
associated mixture dynamics and improves the corresponding policy. We integrate
this procedure into two popular robust adversarial algorithms to develop the
exact and approximate AR2L algorithms. Experiments demonstrate that AR2L is
versatile in the sense that it improves policy robustness while maintaining an
acceptable level of performance for the nominal case.
Related papers
- Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning.
Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable.
Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Online Policy Optimization for Robust MDP [17.995448897675068]
Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go.
In this work, we consider online robust Markov decision process (MDP) by interacting with an unknown nominal system.
We propose a robust optimistic policy optimization algorithm that is provably efficient.
arXiv Detail & Related papers (2022-09-28T05:18:20Z) - Online 3D Bin Packing Reinforcement Learning Solution with Buffer [1.8060107352742993]
We present a new reinforcement learning framework for a 3D-BPP solution for improving performance.
We implement a model-based RL method adapted from the popular algorithm AlphaGo.
Our adaptation is capable of working in single-player and score based environments.
arXiv Detail & Related papers (2022-08-15T11:28:20Z) - Robust Reinforcement Learning using Offline Data [23.260211453437055]
We propose a robust reinforcement learning algorithm called Robust Fitted Q-Iteration (RFQI)
RFQI uses only an offline dataset to learn the optimal robust policy.
We prove that RFQI learns a near-optimal robust policy under standard assumptions.
arXiv Detail & Related papers (2022-08-10T03:47:45Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - Queueing Network Controls via Deep Reinforcement Learning [0.0]
We develop a Proximal policy optimization algorithm for queueing networks.
The algorithm consistently generates control policies that outperform state-of-arts in literature.
A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function.
arXiv Detail & Related papers (2020-07-31T01:02:57Z) - Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot
Locomotion [78.46388769788405]
We introduce guided constrained policy optimization (GCPO), an RL framework based upon our implementation of constrained policy optimization (CPPO)
We show that guided constrained RL offers faster convergence close to the desired optimum resulting in an optimal, yet physically feasible, robotic control behavior without the need for precise reward function tuning.
arXiv Detail & Related papers (2020-02-22T10:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.