SIBRE: Self Improvement Based REwards for Adaptive Feedback in
Reinforcement Learning
- URL: http://arxiv.org/abs/2004.09846v3
- Date: Mon, 21 Dec 2020 10:08:03 GMT
- Title: SIBRE: Self Improvement Based REwards for Adaptive Feedback in
Reinforcement Learning
- Authors: Somjit Nath, Richa Verma, Abhik Ray, Harshad Khadilkar
- Abstract summary: We propose a generic reward shaping approach for improving the rate of convergence in reinforcement learning (RL)
The approach is designed for use in conjunction with any existing RL algorithm, and consists of rewarding improvement over the agent's own past performance.
We prove that SIBRE converges in expectation under the same conditions as the original RL algorithm.
- Score: 5.868852957948178
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a generic reward shaping approach for improving the rate of
convergence in reinforcement learning (RL), called Self Improvement Based
REwards, or SIBRE. The approach is designed for use in conjunction with any
existing RL algorithm, and consists of rewarding improvement over the agent's
own past performance. We prove that SIBRE converges in expectation under the
same conditions as the original RL algorithm. The reshaped rewards help
discriminate between policies when the original rewards are weakly
discriminated or sparse. Experiments on several well-known benchmark
environments with different RL algorithms show that SIBRE converges to the
optimal policy faster and more stably. We also perform sensitivity analysis
with respect to hyper-parameters, in comparison with baseline RL algorithms.
Related papers
- Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - Reinforcement Replaces Supervision: Query focused Summarization using
Deep Reinforcement Learning [43.123290672073814]
We deal with systems that generate summaries from document(s) based on a query.
Motivated by the insight that Reinforcement Learning (RL) provides a generalization to Supervised Learning (SL) for Natural Language Generation, we use an RL-based approach for this task.
We develop multiple Policy Gradient networks, trained on various reward signals: ROUGE, BLEU, and Semantic Similarity.
arXiv Detail & Related papers (2023-11-29T10:38:16Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - One-Step Distributional Reinforcement Learning [10.64435582017292]
We present the simpler one-step distributional reinforcement learning (OS-DistrRL) framework.
We show that our approach comes with a unified theory for both policy evaluation and control.
We propose two OS-DistrRL algorithms for which we provide an almost sure convergence analysis.
arXiv Detail & Related papers (2023-04-27T06:57:00Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Deep Black-Box Reinforcement Learning with Movement Primitives [15.184283143878488]
We present a new algorithm for deep reinforcement learning (RL)
It is based on differentiable trust region layers, a successful on-policy deep RL algorithm.
We compare our ERL algorithm to state-of-the-art step-based algorithms in many complex simulated robotic control tasks.
arXiv Detail & Related papers (2022-10-18T06:34:52Z) - ARC -- Actor Residual Critic for Adversarial Imitation Learning [3.4806267677524896]
We show that ARC aided AIL outperforms standard AIL in simulated continuous-control and real robotic manipulation tasks.
ARC algorithms are simple to implement and can be incorporated into any existing AIL implementation with an AC algorithm.
arXiv Detail & Related papers (2022-06-05T04:49:58Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z) - Combining Pessimism with Optimism for Robust and Efficient Model-Based
Deep Reinforcement Learning [56.17667147101263]
In real-world tasks, reinforcement learning agents encounter situations that are not present during training time.
To ensure reliable performance, the RL agents need to exhibit robustness against worst-case situations.
We propose the Robust Hallucinated Upper-Confidence RL (RH-UCRL) algorithm to provably solve this problem.
arXiv Detail & Related papers (2021-03-18T16:50:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.