Episodic Return Decomposition by Difference of Implicitly Assigned
Sub-Trajectory Reward
- URL: http://arxiv.org/abs/2312.10642v1
- Date: Sun, 17 Dec 2023 07:58:19 GMT
- Title: Episodic Return Decomposition by Difference of Implicitly Assigned
Sub-Trajectory Reward
- Authors: Haoxin Lin, Hongqiu Wu, Jiaji Zhang, Yihao Sun, Junyin Ye, Yang Yu
- Abstract summary: We propose a novel episodic return decomposition method called Diaster.
Diaster decomposes any episodic reward into credits of two divided sub-trajectories at any cut point.
Experimental results show that our method outperforms previous state-of-the-art methods in terms of both sample efficiency and performance.
- Score: 8.445578144906415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world decision-making problems are usually accompanied by delayed
rewards, which affects the sample efficiency of Reinforcement Learning,
especially in the extremely delayed case where the only feedback is the
episodic reward obtained at the end of an episode. Episodic return
decomposition is a promising way to deal with the episodic-reward setting.
Several corresponding algorithms have shown remarkable effectiveness of the
learned step-wise proxy rewards from return decomposition. However, these
existing methods lack either attribution or representation capacity, leading to
inefficient decomposition in the case of long-term episodes. In this paper, we
propose a novel episodic return decomposition method called Diaster (Difference
of implicitly assigned sub-trajectory reward). Diaster decomposes any episodic
reward into credits of two divided sub-trajectories at any cut point, and the
step-wise proxy rewards come from differences in expectation. We theoretically
and empirically verify that the decomposed proxy reward function can guide the
policy to be nearly optimal. Experimental results show that our method
outperforms previous state-of-the-art methods in terms of both sample
efficiency and performance.
Related papers
- Reinforcement Learning with Segment Feedback [56.54271464134885]
We consider a model named RL with segment feedback, which offers a general paradigm filling the gap between per-state-action feedback and trajectory feedback.
Under binary feedback, increasing the number of segments $m$ decreases the regret at an exponential rate; in contrast, surprisingly, under sum feedback, increasing $m$ does not reduce the regret significantly.
arXiv Detail & Related papers (2025-02-03T23:08:42Z) - ABPT: Amended Backpropagation through Time with Partially Differentiable Rewards [3.1986315488647588]
Partially differentiable rewards will result in biased gradient propagation that degrades training performance.
We propose Amended Backpropagation-through-Time (ABPT), a novel approach that mitigates gradient bias while preserving the training efficiency of BPTT.
ABPT combines 0-step and N-step returns, effectively reducing the bias by leveraging value gradients from the learned Q-value function.
arXiv Detail & Related papers (2025-01-24T14:18:22Z) - Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution.
Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z) - Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal.
We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths.
We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z) - STAS: Spatial-Temporal Return Decomposition for Multi-agent
Reinforcement Learning [10.102447181869005]
We introduce a novel method that learns credit assignment in both temporal and spatial dimensions.
Our results demonstrate that our method effectively assigns spatial-temporal credit, outperforming all state-of-the-art baselines.
arXiv Detail & Related papers (2023-04-15T10:09:03Z) - Reward Imputation with Sketching for Contextual Batched Bandits [48.80803376405073]
Contextual batched bandit (CBB) is a setting where a batch of rewards is observed from the environment at the end of each episode.
Existing approaches for CBB often ignore the rewards of the non-executed actions, leading to underutilization of feedback information.
We propose Sketched Policy Updating with Imputed Rewards (SPUIR) that completes the unobserved rewards using sketching.
arXiv Detail & Related papers (2022-10-13T04:26:06Z) - Deterministic and Discriminative Imitation (D2-Imitation): Revisiting
Adversarial Imitation for Sample Efficiency [61.03922379081648]
We propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization.
Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation.
arXiv Detail & Related papers (2021-12-11T19:36:19Z) - Learning Long-Term Reward Redistribution via Randomized Return
Decomposition [18.47810850195995]
We consider the problem formulation of episodic reinforcement learning with trajectory feedback.
It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory.
We propose a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning.
arXiv Detail & Related papers (2021-11-26T13:23:36Z) - Replacing Rewards with Examples: Example-Based Policy Search via
Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function.
In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved.
Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.