Interpretable Reward Redistribution in Reinforcement Learning: A Causal
Approach
- URL: http://arxiv.org/abs/2305.18427v3
- Date: Fri, 10 Nov 2023 21:58:57 GMT
- Title: Interpretable Reward Redistribution in Reinforcement Learning: A Causal
Approach
- Authors: Yudi Zhang, Yali Du, Biwei Huang, Ziyan Wang, Jun Wang, Meng Fang,
Mykola Pechenizkiy
- Abstract summary: A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed.
We propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution.
Experimental results show that our method outperforms state-of-the-art methods.
- Score: 45.83200636718999
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A major challenge in reinforcement learning is to determine which
state-action pairs are responsible for future rewards that are delayed. Reward
redistribution serves as a solution to re-assign credits for each time step
from observed sequences. While the majority of current approaches construct the
reward redistribution in an uninterpretable manner, we propose to explicitly
model the contributions of state and action from a causal perspective,
resulting in an interpretable reward redistribution and preserving policy
invariance. In this paper, we start by studying the role of causal generative
models in reward redistribution by characterizing the generation of Markovian
rewards and trajectory-wise long-term return and further propose a framework,
called Generative Return Decomposition (GRD), for policy optimization in
delayed reward scenarios. Specifically, GRD first identifies the unobservable
Markovian rewards and causal relations in the generative process. Then, GRD
makes use of the identified causal generative model to form a compact
representation to train policy over the most favorable subspace of the state
space of the agent. Theoretically, we show that the unobservable Markovian
reward function is identifiable, as well as the underlying causal structure and
causal models. Experimental results show that our method outperforms
state-of-the-art methods and the provided visualization further demonstrates
the interpretability of our method. The project page is located at
https://reedzyd.github.io/GenerativeReturnDecomposition/.
Related papers
- Reinforcement Learning from Bagged Reward [46.16904382582698]
In Reinforcement Learning (RL), it is commonly assumed that an immediate reward signal is generated for each action taken by the agent.
In many real-world scenarios, immediate reward signals are not obtainable; instead, agents receive a single reward that is contingent upon a partial sequence or a complete trajectory.
We propose a Transformer-based reward model, the Reward Bag Transformer, which employs a bidirectional attention mechanism to interpret contextual nuances.
arXiv Detail & Related papers (2024-02-06T07:26:44Z) - Value-Distributional Model-Based Reinforcement Learning [63.32053223422317]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution.
Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z) - Distributional Reward Estimation for Effective Multi-Agent Deep
Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL)
Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training.
The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z) - Learning Long-Term Reward Redistribution via Randomized Return
Decomposition [18.47810850195995]
We consider the problem formulation of episodic reinforcement learning with trajectory feedback.
It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory.
We propose a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning.
arXiv Detail & Related papers (2021-11-26T13:23:36Z) - Distributional Reinforcement Learning for Multi-Dimensional Reward
Functions [91.88969237680669]
We introduce Multi-Dimensional Distributional DQN (MD3QN) to model the joint return distribution from multiple reward sources.
As a by-product of joint distribution modeling, MD3QN can capture the randomness in returns for each source of reward.
In experiments, our method accurately models the joint return distribution in environments with richly correlated reward functions.
arXiv Detail & Related papers (2021-10-26T11:24:23Z) - Delayed Rewards Calibration via Reward Empirical Sufficiency [11.089718301262433]
We introduce a delay reward calibration paradigm inspired from a classification perspective.
We define an empirical sufficient distribution, where the state vectors within the distribution will lead agents to reward signals.
A purify-trained classifier is designed to obtain the distribution and generate the calibrated rewards.
arXiv Detail & Related papers (2021-02-21T06:42:31Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z) - Nested-Wasserstein Self-Imitation Learning for Sequence Generation [158.19606942252284]
We propose the concept of nested-Wasserstein distance for distributional semantic matching.
A novel nested-Wasserstein self-imitation learning framework is developed, encouraging the model to exploit historical high-rewarded sequences.
arXiv Detail & Related papers (2020-01-20T02:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.