Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach
- URL: http://arxiv.org/abs/2501.19128v3
- Date: Tue, 05 Aug 2025 10:02:43 GMT
- Title: Shaping Sparse Rewards in Reinforcement Learning: A Semi-supervised Approach
- Authors: Wenyun Li, Wenjie Huang, Chen Sun,
- Abstract summary: Experimental results in Atari and robotic manipulation demonstrate that our method outperforms supervised-based approaches in reward inference.<n>In more sparse-reward environments, our method achieves up to twice the peak scores compared to supervised baselines.
- Score: 7.200081267352692
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many real-world scenarios, reward signal for agents are exceedingly sparse, making it challenging to learn an effective reward function for reward shaping. To address this issue, the proposed approach in this paper performs reward shaping not only by utilizing non-zero-reward transitions but also by employing the \emph{Semi-Supervised Learning} (SSL) technique combined with a novel data augmentation to learn trajectory space representations from the majority of transitions, {i.e}., zero-reward transitions, thereby improving the efficacy of reward shaping. Experimental results in Atari and robotic manipulation demonstrate that our method outperforms supervised-based approaches in reward inference, leading to higher agent scores. Notably, in more sparse-reward environments, our method achieves up to twice the peak scores compared to supervised baselines. The proposed double entropy data augmentation enhances performance, showcasing a 15.8\% increase in best score over other augmentation methods
Related papers
- Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning [106.68304931854038]
Reinforcement learning with verifiable rewards (RLVR) has been widely used for enhancing the reasoning abilities of large language models (LLMs)<n>We conduct a systematic empirical analysis of the entropy-performance exchange mechanism of RLVR across different levels of granularity.<n>Our analysis reveals that, in the rising stage, entropy reduction in negative samples facilitates the learning of effective reasoning patterns.<n>In the plateau stage, learning efficiency strongly correlates with high-entropy tokens present in low-perplexity samples and those located at the end of sequences.
arXiv Detail & Related papers (2025-08-04T10:08:10Z) - ReDit: Reward Dithering for Improved LLM Policy Optimization [6.841631032347429]
DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system.<n>While it's a ''perfect'' reward system that effectively mitigates reward hacking, such reward functions are often discrete.<n>We propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise.
arXiv Detail & Related papers (2025-06-23T13:36:24Z) - TreeRPO: Tree Relative Policy Optimization [55.97385410074841]
name is a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling.<n>Building on the group-relative reward training mechanism of GRPO, name innovatively computes rewards based on step-level groups generated during tree sampling.
arXiv Detail & Related papers (2025-06-05T15:56:38Z) - Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning [5.242869847419834]
Reward shaping is a technique in reinforcement learning that addresses the sparse-reward problem by providing more frequent and informative rewards.
We introduce a self-adaptive and highly efficient reward shaping mechanism that incorporates success rates derived from historical experiences as shaped rewards.
Our method is validated on various tasks with extremely sparse rewards, demonstrating notable improvements in sample efficiency and convergence stability over relevant baselines.
arXiv Detail & Related papers (2024-08-06T08:22:16Z) - Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning [16.93475375389869]
Drawing inspiration from this Toddler-Inspired Reward Transition, we set out to explore the implications of varying reward transitions when incorporated into Reinforcement Learning (RL) tasks.
Through various experiments, including those in egocentric navigation and robotic arm manipulation tasks, we found that proper reward transitions significantly influence sample efficiency and success rates.
arXiv Detail & Related papers (2024-03-11T16:34:23Z) - Transductive Reward Inference on Graph [53.003245457089406]
We develop a reward inference method based on the contextual properties of information propagation on graphs.
We leverage both the available data and limited reward annotations to construct a reward propagation graph.
We employ the constructed graph for transductive reward inference, thereby estimating rewards for unlabelled data.
arXiv Detail & Related papers (2024-02-06T03:31:28Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z) - DreamSmooth: Improving Model-based Reinforcement Learning via Reward
Smoothing [60.21269454707625]
DreamSmooth learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep.
We show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks.
arXiv Detail & Related papers (2023-11-02T17:57:38Z) - Adversarial Batch Inverse Reinforcement Learning: Learn to Reward from
Imperfect Demonstration for Interactive Recommendation [23.048841953423846]
We focus on the problem of learning to reward, which is fundamental to reinforcement learning.
Previous approaches either introduce additional procedures for learning to reward, thereby increasing the complexity of optimization.
We propose a novel batch inverse reinforcement learning paradigm that achieves the desired properties.
arXiv Detail & Related papers (2023-10-30T13:43:20Z) - A State Augmentation based approach to Reinforcement Learning from Human
Preferences [20.13307800821161]
Preference Based Reinforcement Learning attempts to solve the issue by utilizing binary feedbacks on queried trajectory pairs.
We present a state augmentation technique that allows the agent's reward model to be robust.
arXiv Detail & Related papers (2023-02-17T07:10:50Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Learning Dense Reward with Temporal Variant Self-Supervision [5.131840233837565]
Complex real-world robotic applications lack explicit and informative descriptions that can directly be used as rewards.
Previous effort has shown that it is possible to algorithmically extract dense rewards directly from multimodal observations.
This paper proposes a more efficient and robust way of sampling and learning.
arXiv Detail & Related papers (2022-05-20T20:30:57Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z) - Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping [71.214923471669]
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL)
In this paper, we consider the problem of adaptively utilizing a given shaping reward function.
Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
arXiv Detail & Related papers (2020-11-05T05:34:14Z) - Deep Reinforcement Learning with a Stage Incentive Mechanism of Dense
Reward for Robotic Trajectory Planning [3.0242753679068466]
We present three dense reward functions to improve the efficiency of DRL-based methods for robot manipulator trajectory planning.
A posture reward function is proposed to speed up the learning process with a more reasonable trajectory.
A stride reward function is proposed to improve the stability of the learning process.
arXiv Detail & Related papers (2020-09-25T07:36:32Z) - Intrinsic Reward Driven Imitation Learning via Generative Model [48.97800481338626]
Most inverse reinforcement learning (IRL) methods fail to outperform the demonstrator in a high-dimensional environment.
We propose a novel reward learning module to generate intrinsic reward signals via a generative model.
Empirical results show that our method outperforms state-of-the-art IRL methods on multiple Atari games, even with one-life demonstration.
arXiv Detail & Related papers (2020-06-26T15:39:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.