Adjustable Robust Reinforcement Learning for Online 3D Bin Packing
- URL: http://arxiv.org/abs/2310.04323v1
- Date: Fri, 6 Oct 2023 15:34:21 GMT
- Title: Adjustable Robust Reinforcement Learning for Online 3D Bin Packing
- Authors: Yuxin Pan, Yize Chen, Fangzhen Lin
- Abstract summary: Current deep reinforcement learning methods for online 3D-BPP fail in real-world settings where some worst-case scenarios can materialize.
We propose an adjustable robust reinforcement learning framework that allows efficient adjustment of robustness weights.
Experiments demonstrate that AR2L is versatile in the sense that it improves policy robustness while maintaining an acceptable level of performance for the nominal case.
- Score: 11.157035538606968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Designing effective policies for the online 3D bin packing problem (3D-BPP)
has been a long-standing challenge, primarily due to the unpredictable nature
of incoming box sequences and stringent physical constraints. While current
deep reinforcement learning (DRL) methods for online 3D-BPP have shown
promising results in optimizing average performance over an underlying box
sequence distribution, they often fail in real-world settings where some
worst-case scenarios can materialize. Standard robust DRL algorithms tend to
overly prioritize optimizing the worst-case performance at the expense of
performance under normal problem instance distribution. To address these
issues, we first introduce a permutation-based attacker to investigate the
practical robustness of both DRL-based and heuristic methods proposed for
solving online 3D-BPP. Then, we propose an adjustable robust reinforcement
learning (AR2L) framework that allows efficient adjustment of robustness
weights to achieve the desired balance of the policy's performance in average
and worst-case environments. Specifically, we formulate the objective function
as a weighted sum of expected and worst-case returns, and derive the lower
performance bound by relating to the return under a mixture dynamics. To
realize this lower bound, we adopt an iterative procedure that searches for the
associated mixture dynamics and improves the corresponding policy. We integrate
this procedure into two popular robust adversarial algorithms to develop the
exact and approximate AR2L algorithms. Experiments demonstrate that AR2L is
versatile in the sense that it improves policy robustness while maintaining an
acceptable level of performance for the nominal case.
Related papers
- VAR RL Done Right: Tackling Asynchronous Policy Conflicts in Visual Autoregressive Generation [31.201343197395573]
Visual generation is dominated by three paradigms: AutoRegressive (AR), diffusion, and Visual AutoRegressive ( VAR) models.<n>Unlike AR and diffusion, VARs operate on heterogeneous input structures across their generation steps, which creates severe asynchronous policy conflicts.<n>We propose a novel framework to enhance Group Relative Policy Optimization ( GRPO) by explicitly managing these conflicts.
arXiv Detail & Related papers (2026-01-05T16:36:40Z) - Deep Reinforcement Learning for Dynamic Algorithm Configuration: A Case Study on Optimizing OneMax with the (1+($λ$,$λ$))-GA [3.5485296570255183]
We conduct a systematic analysis of controlling the population size parameter of the (1+($$,$$)-GA on OneMax instances.<n>Our investigation of DDQN and PPO reveals two fundamental challenges that limit their effectiveness in DAC.<n>We introduce an adaptive reward shifting mechanism that leverages reward distribution statistics to enhance DDQN agent exploration.
arXiv Detail & Related papers (2025-12-03T13:54:41Z) - BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z) - ADARL: Adaptive Low-Rank Structures for Robust Policy Learning under Uncertainty [28.291179179647795]
We propose textbfAdaptive Rank Representation (AdaRL), a bi-level optimization framework that improves robustness.<n>At the lower level, AdaRL performs policy optimization under fixed-rank constraints with dynamics sampled from a Wasserstein ball around a centroid model.<n>At the upper level, it adaptively adjusts the rank to balance the bias--variance trade-off, projecting policy parameters onto a low-rank manifold.
arXiv Detail & Related papers (2025-10-13T20:05:34Z) - Rectified Robust Policy Optimization for Model-Uncertain Constrained Reinforcement Learning without Strong Duality [53.525547349715595]
We propose a novel primal-only algorithm called Rectified Robust Policy Optimization (RRPO)<n>RRPO operates directly on the primal problem without relying on dual formulations.<n>We show convergence to an approximately optimal feasible policy with complexity matching the best-known lower bound.
arXiv Detail & Related papers (2025-08-24T16:59:38Z) - Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z) - Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift.
Current approaches typically address this issue through online sampling from the target policy.
We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z) - Efficient Online Reinforcement Learning for Diffusion Policy [38.39095131927252]
We generalize the conventional denoising score matching by reweighting the loss function.<n>The resulting Reweighted Score Matching (RSM) preserves the optimal solution and low computational cost.<n>We introduce two practical algorithms named Diffusion Policy Mirror Descent (DPMD) and Soft Diffusion Actor-Critic (SDAC)
arXiv Detail & Related papers (2025-02-01T07:55:06Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - PARL: A Unified Framework for Policy Alignment in Reinforcement Learning from Human Feedback [106.63518036538163]
We present a novel unified bilevel optimization-based framework, textsfPARL, formulated to address the recently highlighted critical issue of policy alignment in reinforcement learning.
Our framework addressed these concerns by explicitly parameterizing the distribution of the upper alignment objective (reward design) by the lower optimal variable.
Our empirical results substantiate that the proposed textsfPARL can address the alignment concerns in RL by showing significant improvements.
arXiv Detail & Related papers (2023-08-03T18:03:44Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Online Policy Optimization for Robust MDP [17.995448897675068]
Reinforcement learning (RL) has exceeded human performance in many synthetic settings such as video games and Go.
In this work, we consider online robust Markov decision process (MDP) by interacting with an unknown nominal system.
We propose a robust optimistic policy optimization algorithm that is provably efficient.
arXiv Detail & Related papers (2022-09-28T05:18:20Z) - Online 3D Bin Packing Reinforcement Learning Solution with Buffer [1.8060107352742993]
We present a new reinforcement learning framework for a 3D-BPP solution for improving performance.
We implement a model-based RL method adapted from the popular algorithm AlphaGo.
Our adaptation is capable of working in single-player and score based environments.
arXiv Detail & Related papers (2022-08-15T11:28:20Z) - Robust Reinforcement Learning using Offline Data [23.260211453437055]
We propose a robust reinforcement learning algorithm called Robust Fitted Q-Iteration (RFQI)
RFQI uses only an offline dataset to learn the optimal robust policy.
We prove that RFQI learns a near-optimal robust policy under standard assumptions.
arXiv Detail & Related papers (2022-08-10T03:47:45Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - Queueing Network Controls via Deep Reinforcement Learning [0.0]
We develop a Proximal policy optimization algorithm for queueing networks.
The algorithm consistently generates control policies that outperform state-of-arts in literature.
A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function.
arXiv Detail & Related papers (2020-07-31T01:02:57Z) - Guided Constrained Policy Optimization for Dynamic Quadrupedal Robot
Locomotion [78.46388769788405]
We introduce guided constrained policy optimization (GCPO), an RL framework based upon our implementation of constrained policy optimization (CPPO)
We show that guided constrained RL offers faster convergence close to the desired optimum resulting in an optimal, yet physically feasible, robotic control behavior without the need for precise reward function tuning.
arXiv Detail & Related papers (2020-02-22T10:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.