DreamSmooth: Improving Model-based Reinforcement Learning via Reward
Smoothing
- URL: http://arxiv.org/abs/2311.01450v2
- Date: Sun, 18 Feb 2024 00:59:14 GMT
- Title: DreamSmooth: Improving Model-based Reinforcement Learning via Reward
Smoothing
- Authors: Vint Lee, Pieter Abbeel, Youngwoon Lee
- Abstract summary: DreamSmooth learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep.
We show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks.
- Score: 60.21269454707625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model-based reinforcement learning (MBRL) has gained much attention for its
ability to learn complex behaviors in a sample-efficient way: planning actions
by generating imaginary trajectories with predicted rewards. Despite its
success, we found that surprisingly, reward prediction is often a bottleneck of
MBRL, especially for sparse rewards that are challenging (or even ambiguous) to
predict. Motivated by the intuition that humans can learn from rough reward
estimates, we propose a simple yet effective reward smoothing approach,
DreamSmooth, which learns to predict a temporally-smoothed reward, instead of
the exact reward at the given timestep. We empirically show that DreamSmooth
achieves state-of-the-art performance on long-horizon sparse-reward tasks both
in sample efficiency and final performance without losing performance on common
benchmarks, such as Deepmind Control Suite and Atari benchmarks.
Related papers
- R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback [25.27230140274847]
Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences.
This paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning [44.770495418026734]
Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals.
Traditional methods assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards.
We propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism.
arXiv Detail & Related papers (2024-10-26T13:12:27Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z) - Unpacking Reward Shaping: Understanding the Benefits of Reward
Engineering on Sample Complexity [114.88145406445483]
Reinforcement learning provides an automated framework for learning behaviors from high-level reward specifications.
In practice the choice of reward function can be crucial for good results.
arXiv Detail & Related papers (2022-10-18T04:21:25Z) - Handling Sparse Rewards in Reinforcement Learning Using Model Predictive
Control [9.118706387430883]
Reinforcement learning (RL) has recently proven great success in various domains.
Yet, the design of the reward function requires detailed domain expertise and tedious fine-tuning to ensure that agents are able to learn the desired behaviour.
We propose to use model predictive control(MPC) as an experience source for training RL agents in sparse reward environments.
arXiv Detail & Related papers (2022-10-04T11:06:38Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Learning Dense Reward with Temporal Variant Self-Supervision [5.131840233837565]
Complex real-world robotic applications lack explicit and informative descriptions that can directly be used as rewards.
Previous effort has shown that it is possible to algorithmically extract dense rewards directly from multimodal observations.
This paper proposes a more efficient and robust way of sampling and learning.
arXiv Detail & Related papers (2022-05-20T20:30:57Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z) - Reward prediction for representation learning and reward shaping [0.8883733362171032]
We propose learning a state representation in a self-supervised manner for reward prediction.
We augment the training of out-of-the-box RL agents by shaping the reward using our reward predictor during policy learning.
arXiv Detail & Related papers (2021-05-07T11:29:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.