Related papers: Offline Reinforcement Learning with Imputed Rewards

Offline Reinforcement Learning with Imputed Rewards

URL: http://arxiv.org/abs/2407.10839v1
Date: Mon, 15 Jul 2024 15:53:13 GMT
Title: Offline Reinforcement Learning with Imputed Rewards
Authors: Carlo Romeo, Andrew D. Bagdanov,
Abstract summary: We propose a Reward Model that can estimate the reward signal from a very limited sample of environment transitions annotated with rewards. Our results show that, using only 1% of reward-labeled transitions from the original datasets, our learned reward model is able to impute rewards for the remaining 99% of the transitions.
Score: 8.856568375969848
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Offline Reinforcement Learning (ORL) offers a robust solution to training agents in applications where interactions with the environment must be strictly limited due to cost, safety, or lack of accurate simulation environments. Despite its potential to facilitate deployment of artificial agents in the real world, Offline Reinforcement Learning typically requires very many demonstrations annotated with ground-truth rewards. Consequently, state-of-the-art ORL algorithms can be difficult or impossible to apply in data-scarce scenarios. In this paper we propose a simple but effective Reward Model that can estimate the reward signal from a very limited sample of environment transitions annotated with rewards. Once the reward signal is modeled, we use the Reward Model to impute rewards for a large sample of reward-free transitions, thus enabling the application of ORL techniques. We demonstrate the potential of our approach on several D4RL continuous locomotion tasks. Our results show that, using only 1\% of reward-labeled transitions from the original datasets, our learned reward model is able to impute rewards for the remaining 99\% of the transitions, from which performant agents can be learned using Offline Reinforcement Learning.

Related papers

Residual Off-Policy RL for Finetuning Behavior Cloning Policies [41.99435186991878]
We present a recipe that combines the benefits of behavior cloning (BC) and reinforcement learning (RL) through a residual learning framework.<n>Our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems.<n>In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands.
arXiv Detail & Related papers (2025-09-23T17:59:46Z)
Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z)
From Novelty to Imitation: Self-Distilled Rewards for Offline Reinforcement Learning [7.559920170287638]
Offline Reinforcement Learning (RL) aims to learn effective policies from a static dataset without requiring further agent-environment interactions.<n>We propose ReLOAD, a novel reward annotation framework for offline RL.<n>Our approach adapts Random Network Distillation (RND) to generate intrinsic rewards from expert demonstrations.
arXiv Detail & Related papers (2025-07-17T06:16:06Z)
Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z)
Attention-Based Reward Shaping for Sparse and Delayed Rewards [7.811459544911894]
We propose Attention-based REward Shaping (ARES) to generate shaped rewards for any environment.<n>ARES can be trained entirely offline and is able to generate meaningful shaped rewards even when using small datasets or episodes produced by agents taking random actions.<n>Our results show that ARES can significantly improve learning in delayed reward settings, enabling RL agents to train in scenarios that would otherwise require impractical amounts of data or even be unlearnable.
arXiv Detail & Related papers (2025-05-16T02:43:05Z)
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems [54.4392552373835]
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs) We propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals to provide reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks.
arXiv Detail & Related papers (2025-02-26T17:19:12Z)
Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model. We show that the proposed algorithms converge to the stationary solutions of the IRL problem. Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z)
Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output. We use these attention weights to redistribute the reward along the whole completion. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z)
Leveraging Optimal Transport for Enhanced Offline Reinforcement Learning in Surgical Robotic Environments [4.2569494803130565]
We introduce an innovative algorithm designed to assign rewards to offline trajectories, using a small number of high-quality expert demonstrations. This approach circumvents the need for handcrafted rewards, unlocking the potential to harness vast datasets for policy learning.
arXiv Detail & Related papers (2023-10-13T03:39:15Z)
Deep Reinforcement Learning from Hierarchical Preference Design [99.46415116087259]
This paper shows by exploiting certain structures, one can ease the reward design process. We propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning.
arXiv Detail & Related papers (2023-09-06T00:44:29Z)
Optimal Transport for Offline Imitation Learning [31.218468923400373]
offline reinforcement learning (RL) is a promising framework for learning good decision-making policies without the need to interact with the real environment. We introduce Optimal Transport Reward labeling (OTR), an algorithm that assigns rewards to offline trajectories. We show that OTR with a single demonstration can consistently match the performance of offline RL with ground-truth rewards.
arXiv Detail & Related papers (2023-03-24T12:45:42Z)
Actively Learning Costly Reward Functions for Reinforcement Learning [56.34005280792013]
We show that it is possible to train agents in complex real-world environments orders of magnitudes faster. By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions.
arXiv Detail & Related papers (2022-11-23T19:17:20Z)
Handling Sparse Rewards in Reinforcement Learning Using Model Predictive Control [9.118706387430883]
Reinforcement learning (RL) has recently proven great success in various domains. Yet, the design of the reward function requires detailed domain expertise and tedious fine-tuning to ensure that agents are able to learn the desired behaviour. We propose to use model predictive control(MPC) as an experience source for training RL agents in sparse reward environments.
arXiv Detail & Related papers (2022-10-04T11:06:38Z)
Residual Reinforcement Learning from Demonstrations [51.56457466788513]
Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal. We extend the residual formulation to learn from visual inputs and sparse rewards using demonstrations. Our experimental evaluation on simulated manipulation tasks on a 6-DoF UR5 arm and a 28-DoF dexterous hand demonstrates that residual RL from demonstrations is able to generalize to unseen environment conditions more flexibly than either behavioral cloning or RL fine-tuning.
arXiv Detail & Related papers (2021-06-15T11:16:49Z)
Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious. We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data. In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.