Interpretable Reward Redistribution in Reinforcement Learning: A Causal
Approach
- URL: http://arxiv.org/abs/2305.18427v3
- Date: Fri, 10 Nov 2023 21:58:57 GMT
- Title: Interpretable Reward Redistribution in Reinforcement Learning: A Causal
Approach
- Authors: Yudi Zhang, Yali Du, Biwei Huang, Ziyan Wang, Jun Wang, Meng Fang,
Mykola Pechenizkiy
- Abstract summary: A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed.
We propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution.
Experimental results show that our method outperforms state-of-the-art methods.
- Score: 45.83200636718999
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A major challenge in reinforcement learning is to determine which
state-action pairs are responsible for future rewards that are delayed. Reward
redistribution serves as a solution to re-assign credits for each time step
from observed sequences. While the majority of current approaches construct the
reward redistribution in an uninterpretable manner, we propose to explicitly
model the contributions of state and action from a causal perspective,
resulting in an interpretable reward redistribution and preserving policy
invariance. In this paper, we start by studying the role of causal generative
models in reward redistribution by characterizing the generation of Markovian
rewards and trajectory-wise long-term return and further propose a framework,
called Generative Return Decomposition (GRD), for policy optimization in
delayed reward scenarios. Specifically, GRD first identifies the unobservable
Markovian rewards and causal relations in the generative process. Then, GRD
makes use of the identified causal generative model to form a compact
representation to train policy over the most favorable subspace of the state
space of the agent. Theoretically, we show that the unobservable Markovian
reward function is identifiable, as well as the underlying causal structure and
causal models. Experimental results show that our method outperforms
state-of-the-art methods and the provided visualization further demonstrates
the interpretability of our method. The project page is located at
https://reedzyd.github.io/GenerativeReturnDecomposition/.
Related papers
- R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback [25.27230140274847]
Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences.
This paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning [44.770495418026734]
Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals.
Traditional methods assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards.
We propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism.
arXiv Detail & Related papers (2024-10-26T13:12:27Z) - Reinforcement Learning from Bagged Reward [46.16904382582698]
In Reinforcement Learning (RL), it is commonly assumed that an immediate reward signal is generated for each action taken by the agent.
In many real-world scenarios, designing immediate reward signals is difficult.
We propose a novel reward redistribution method equipped with a bidirectional attention mechanism.
arXiv Detail & Related papers (2024-02-06T07:26:44Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution.
Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z) - Learning Long-Term Reward Redistribution via Randomized Return
Decomposition [18.47810850195995]
We consider the problem formulation of episodic reinforcement learning with trajectory feedback.
It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory.
We propose a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning.
arXiv Detail & Related papers (2021-11-26T13:23:36Z) - Distributional Reinforcement Learning for Multi-Dimensional Reward
Functions [91.88969237680669]
We introduce Multi-Dimensional Distributional DQN (MD3QN) to model the joint return distribution from multiple reward sources.
As a by-product of joint distribution modeling, MD3QN can capture the randomness in returns for each source of reward.
In experiments, our method accurately models the joint return distribution in environments with richly correlated reward functions.
arXiv Detail & Related papers (2021-10-26T11:24:23Z) - Delayed Rewards Calibration via Reward Empirical Sufficiency [11.089718301262433]
We introduce a delay reward calibration paradigm inspired from a classification perspective.
We define an empirical sufficient distribution, where the state vectors within the distribution will lead agents to reward signals.
A purify-trained classifier is designed to obtain the distribution and generate the calibrated rewards.
arXiv Detail & Related papers (2021-02-21T06:42:31Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.