Related papers: Internally Rewarded Reinforcement Learning

Internally Rewarded Reinforcement Learning

URL: http://arxiv.org/abs/2302.00270v3
Date: Thu, 24 Aug 2023 19:43:02 GMT
Title: Internally Rewarded Reinforcement Learning
Authors: Mengdi Li, Xufeng Zhao, Jae Hee Lee, Cornelius Weber, Stefan Wermter
Abstract summary: We study a class of reinforcement learning problems where the reward signals for policy learning are generated by an internal reward model. We show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise.
Score: 22.01249652558878
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study a class of reinforcement learning problems where the reward signals for policy learning are generated by an internal reward model that is dependent on and jointly optimized with the policy. This interdependence between the policy and the reward model leads to an unstable learning process because reward signals from an immature reward model are noisy and impede policy learning, and conversely, an under-optimized policy impedes reward estimation learning. We call this learning setting $\textit{Internally Rewarded Reinforcement Learning}$ (IRRL) as the reward is not provided directly by the environment but $\textit{internally}$ by a reward model. In this paper, we formally formulate IRRL and present a class of problems that belong to IRRL. We theoretically derive and empirically analyze the effect of the reward function in IRRL and based on these analyses propose the clipped linear reward function. Experimental results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance compared with baselines in diverse tasks.

Related papers

Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs [58.18140409409302]
Large Language Models (LLMs) have made substantial strides in structured tasks through Reinforcement Learning (RL) Applying RL in broader domains like chatbots and content generation presents unique challenges. We show a case study of reproducing existing reward model ensemble research using embedding-based reward models.
arXiv Detail & Related papers (2025-02-04T19:37:35Z)
SR-Reward: Taking The Path More Traveled [8.818066308133108]
We propose a novel method for learning reward functions directly from offline demonstrations. Unlike traditional inverse reinforcement learning (IRL), our approach decouples the reward function from the learner's policy. Our reward function, called textitSR-Reward, leverages successor representation (SR) to encode a state based on expected future states' visitation under the demonstration policy and transition dynamics.
arXiv Detail & Related papers (2025-01-04T16:21:10Z)
Robot See, Robot Do: Imitation Reward for Noisy Financial Environments [0.0]
This paper introduces a novel and more robust reward function by leveraging imitation learning. We integrate imitation (expert's) feedback with reinforcement (agent's) feedback in a model-free reinforcement learning algorithm. Empirical results demonstrate that this novel approach improves financial performance metrics compared to traditional benchmarks.
arXiv Detail & Related papers (2024-11-13T14:24:47Z)
CLARE: Conservative Model-Based Reward Learning for Offline Inverse Reinforcement Learning [26.05184273238923]
This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL) We devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function. Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy.
arXiv Detail & Related papers (2023-02-09T17:16:29Z)
Transferable Reward Learning by Dynamics-Agnostic Discriminator Ensemble [8.857776147129464]
Recovering reward function from expert demonstrations is a fundamental problem in reinforcement learning. We present a dynamics-agnostic discriminator-ensemble reward learning method capable of learning both state-action and state-only reward functions.
arXiv Detail & Related papers (2022-06-01T05:16:39Z)
Reward Uncertainty for Exploration in Preference-based Reinforcement Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z)
Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning. We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z)
Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework. To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z)
Hindsight Reward Tweaking via Conditional Deep Reinforcement Learning [37.61951923445689]
We propose a novel paradigm for deep reinforcement learning to model the influences of reward functions within a near-optimal space. We demonstrate the feasibility of this approach and study one of its potential application in policy performance boosting with multiple MuJoCo tasks.
arXiv Detail & Related papers (2021-09-06T10:06:48Z)
Simplifying Deep Reinforcement Learning via Self-Supervision [51.2400839966489]
Self-Supervised Reinforcement Learning (SSRL) is a simple algorithm that optimize policies with purely supervised losses. We show that SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time.
arXiv Detail & Related papers (2021-06-10T06:29:59Z)
Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible. In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types. We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z)
Disturbing Reinforcement Learning Agents with Corrupted Rewards [62.997667081978825]
We analyze the effects of different attack strategies based on reward perturbations on reinforcement learning algorithms. We show that smoothly crafting adversarial rewards are able to mislead the learner, and that using low exploration probability values, the policy learned is more robust to corrupt rewards.
arXiv Detail & Related papers (2021-02-12T15:53:48Z)
Off-Policy Adversarial Inverse Reinforcement Learning [0.0]
Adversarial Imitation Learning (AIL) is a class of algorithms in Reinforcement learning (RL) We propose an Off-Policy Adversarial Inverse Reinforcement Learning (Off-policy-AIRL) algorithm which is sample efficient as well as gives good imitation performance.
arXiv Detail & Related papers (2020-05-03T16:51:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.