Delayed Rewards Calibration via Reward Empirical Sufficiency
- URL: http://arxiv.org/abs/2102.10527v2
- Date: Tue, 23 Feb 2021 03:19:53 GMT
- Title: Delayed Rewards Calibration via Reward Empirical Sufficiency
- Authors: Yixuan Liu, Hu Wang, Xiaowei Wang, Xiaoyue Sun, Liuyue Jiang and
Minhui Xue
- Abstract summary: We introduce a delay reward calibration paradigm inspired from a classification perspective.
We define an empirical sufficient distribution, where the state vectors within the distribution will lead agents to reward signals.
A purify-trained classifier is designed to obtain the distribution and generate the calibrated rewards.
- Score: 11.089718301262433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Appropriate credit assignment for delay rewards is a fundamental challenge
for reinforcement learning. To tackle this problem, we introduce a delay reward
calibration paradigm inspired from a classification perspective. We hypothesize
that well-represented state vectors share similarities with each other since
they contain the same or equivalent essential information. To this end, we
define an empirical sufficient distribution, where the state vectors within the
distribution will lead agents to environmental reward signals in the consequent
steps. Therefore, a purify-trained classifier is designed to obtain the
distribution and generate the calibrated rewards. We examine the correctness of
sufficient state extraction by tracking the real-time extraction and building
different reward functions in environments. The results demonstrate that the
classifier could generate timely and accurate calibrated rewards. Moreover, the
rewards are able to make the model training process more efficient. Finally, we
identify and discuss that the sufficient states extracted by our model resonate
with the observations of humans.
Related papers
- Reinforcement Learning from Bagged Reward [46.16904382582698]
In Reinforcement Learning (RL), it is commonly assumed that an immediate reward signal is generated for each action taken by the agent.
In many real-world scenarios, immediate reward signals are not obtainable; instead, agents receive a single reward that is contingent upon a partial sequence or a complete trajectory.
We propose a Transformer-based reward model, the Reward Bag Transformer, which employs a bidirectional attention mechanism to interpret contextual nuances.
arXiv Detail & Related papers (2024-02-06T07:26:44Z) - Transductive Reward Inference on Graph [53.003245457089406]
We develop a reward inference method based on the contextual properties of information propagation on graphs.
We leverage both the available data and limited reward annotations to construct a reward propagation graph.
We employ the constructed graph for transductive reward inference, thereby estimating rewards for unlabelled data.
arXiv Detail & Related papers (2024-02-06T03:31:28Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z) - Time-series Generation by Contrastive Imitation [87.51882102248395]
We study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy.
At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality.
arXiv Detail & Related papers (2023-11-02T16:45:25Z) - Reward-Directed Conditional Diffusion: Provable Distribution Estimation
and Reward Improvement [42.45888600367566]
Directed generation aims to generate samples with desired properties as measured by a reward function.
We consider the common learning scenario where the data set consists of unlabeled data along with a smaller set of data with noisy reward labels.
arXiv Detail & Related papers (2023-07-13T20:20:40Z) - Interpretable Reward Redistribution in Reinforcement Learning: A Causal
Approach [45.83200636718999]
A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed.
We propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution.
Experimental results show that our method outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-05-28T21:51:38Z) - Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution.
Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z) - Augmentation by Counterfactual Explanation -- Fixing an Overconfident
Classifier [11.233334009240947]
A highly accurate but overconfident model is ill-suited for deployment in critical applications such as healthcare and autonomous driving.
This paper proposes an application of counterfactual explanations in fixing an over-confident classifier.
arXiv Detail & Related papers (2022-10-21T18:53:16Z) - Unpacking Reward Shaping: Understanding the Benefits of Reward
Engineering on Sample Complexity [114.88145406445483]
Reinforcement learning provides an automated framework for learning behaviors from high-level reward specifications.
In practice the choice of reward function can be crucial for good results.
arXiv Detail & Related papers (2022-10-18T04:21:25Z) - Distributional Reward Estimation for Effective Multi-Agent Deep
Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL)
Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training.
The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.