Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales
- URL: http://arxiv.org/abs/2405.17618v2
- Date: Wed, 29 May 2024 04:19:00 GMT
- Title: Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales
- Authors: Ju-Seung Byun, Andrew Perrault,
- Abstract summary: Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance.
In this work, we improve the stability of RL training by adapting the reverse cross entropy (RCE) from supervised learning for noisy data to define a symmetric RL loss.
- Score: 13.818149654692863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) can introduce additional difficulty. Differing preferences can complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. To enhance training robustness, RL has adopted techniques from supervised learning, such as ensembles and layer normalization. In this work, we improve the stability of RL training by adapting the reverse cross entropy (RCE) from supervised learning for noisy data to define a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO), with and without added noise with especially notable performance in SPPO across different hyperparameters. Furthermore, we validate the benefits of the symmetric RL loss when using SPPO for large language models through improved performance in RLHF tasks, such as IMDB positive sentiment sentiment and TL;DR summarization tasks.
Related papers
- Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach [0.9549646359252346]
We propose dynamic Learning Rate for deep Reinforcement Learning (LRRL)
LRRL is a meta-learning approach that selects the learning rate based on the agent's performance during training.
Our empirical results demonstrate that LRRL can substantially improve the performance of deep RL algorithms.
arXiv Detail & Related papers (2024-10-16T14:15:28Z) - Avoiding mode collapse in diffusion models fine-tuned with reinforcement learning [0.0]
Fine-tuning foundation models via reinforcement learning (RL) has proven promising for aligning to downstream objectives.
We exploit the hierarchical nature of diffusion models (DMs) and train them dynamically at each epoch with a tailored RL method.
We show that models trained with HRF achieve better preservation of diversity in downstream tasks, thus enhancing the fine-tuning robustness and at uncompromising mean rewards.
arXiv Detail & Related papers (2024-10-10T19:06:23Z) - Stop Regressing: Training Value Functions via Classification for
Scalable Deep RL [109.44370201929246]
We show that training value functions with categorical cross-entropy improves performance and scalability in a variety of domains.
These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers.
arXiv Detail & Related papers (2024-03-06T18:55:47Z) - Robust Learning with Progressive Data Expansion Against Spurious
Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features.
Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process.
We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z) - Learning to Optimize for Reinforcement Learning [58.01132862590378]
Reinforcement learning (RL) is essentially different from supervised learning, and in practice, these learneds do not work well even in simple RL tasks.
Agent-gradient distribution is non-independent and identically distributed, leading to inefficient meta-training.
We show that, although only trained in toy tasks, our learned can generalize unseen complex tasks in Brax.
arXiv Detail & Related papers (2023-02-03T00:11:02Z) - Curriculum-based Asymmetric Multi-task Reinforcement Learning [14.5357225087828]
We introduce CAMRL, the first curriculum-based asymmetric multi-task learning (AMTL) algorithm for dealing with multiple reinforcement learning (RL) tasks altogether.
To mitigate the negative influence of customizing the one-off training order in curriculum-based AMTL, CAMRL switches its training mode between parallel single-task RL and asymmetric multi-task RL (MTRL)
We have conducted experiments on a wide range of benchmarks in multi-task RL, covering Gym-minigrid, Meta-world, Atari video games, vision-based PyBullet tasks, and RLBench.
arXiv Detail & Related papers (2022-11-07T08:05:13Z) - Improving GANs with A Dynamic Discriminator [106.54552336711997]
We argue that a discriminator with an on-the-fly adjustment on its capacity can better accommodate such a time-varying task.
A comprehensive empirical study confirms that the proposed training strategy, termed as DynamicD, improves the synthesis performance without incurring any additional cost or training objectives.
arXiv Detail & Related papers (2022-09-20T17:57:33Z) - DL-DRL: A double-level deep reinforcement learning approach for
large-scale task scheduling of multi-UAV [65.07776277630228]
We propose a double-level deep reinforcement learning (DL-DRL) approach based on a divide and conquer framework (DCF)
Particularly, we design an encoder-decoder structured policy network in our upper-level DRL model to allocate the tasks to different UAVs.
We also exploit another attention based policy network in our lower-level DRL model to construct the route for each UAV, with the objective to maximize the number of executed tasks.
arXiv Detail & Related papers (2022-08-04T04:35:53Z) - Value Penalized Q-Learning for Recommender Systems [30.704083806571074]
Scaling reinforcement learning to recommender systems (RS) is promising since maximizing the expected cumulative rewards for RL agents meets the objective of RS.
A key approach to this goal is offline RL, which aims to learn policies from logged data.
We propose Value Penalized Q-learning (VPQ), an uncertainty-based offline RL algorithm.
arXiv Detail & Related papers (2021-10-15T08:08:28Z) - Dynamic Multi-Scale Loss Optimization for Object Detection [14.256807110937622]
We study the objective imbalance of multi-scale detector training.
We propose an Adaptive Variance Weighting (AVW) to balance multi-scale loss according to the statistical variance.
We develop a novel Reinforcement Learning Optimization (RLO) to decide the weighting scheme probabilistically during training.
arXiv Detail & Related papers (2021-08-09T13:12:41Z) - Combining Pessimism with Optimism for Robust and Efficient Model-Based
Deep Reinforcement Learning [56.17667147101263]
In real-world tasks, reinforcement learning agents encounter situations that are not present during training time.
To ensure reliable performance, the RL agents need to exhibit robustness against worst-case situations.
We propose the Robust Hallucinated Upper-Confidence RL (RH-UCRL) algorithm to provably solve this problem.
arXiv Detail & Related papers (2021-03-18T16:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.