Counterfactual Credit Assignment in Model-Free Reinforcement Learning
- URL: http://arxiv.org/abs/2011.09464v2
- Date: Tue, 14 Dec 2021 13:36:12 GMT
- Title: Counterfactual Credit Assignment in Model-Free Reinforcement Learning
- Authors: Thomas Mesnard, Th\'eophane Weber, Fabio Viola, Shantanu Thakoor, Alaa
Saade, Anna Harutyunyan, Will Dabney, Tom Stepleton, Nicolas Heess, Arthur
Guez, \'Eric Moulines, Marcus Hutter, Lars Buesing, R\'emi Munos
- Abstract summary: Credit assignment in reinforcement learning is the problem of measuring an action's influence on future rewards.
We adapt the notion of counterfactuals from causality theory to a model-free RL setup.
We formulate a family of policy algorithms that use future-conditional value functions as baselines or critics, and show that they are provably low variance.
- Score: 47.79277857377155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Credit assignment in reinforcement learning is the problem of measuring an
action's influence on future rewards. In particular, this requires separating
skill from luck, i.e. disentangling the effect of an action on rewards from
that of external factors and subsequent actions. To achieve this, we adapt the
notion of counterfactuals from causality theory to a model-free RL setup. The
key idea is to condition value functions on future events, by learning to
extract relevant information from a trajectory. We formulate a family of policy
gradient algorithms that use these future-conditional value functions as
baselines or critics, and show that they are provably low variance. To avoid
the potential bias from conditioning on future information, we constrain the
hindsight information to not contain information about the agent's actions. We
demonstrate the efficacy and validity of our algorithm on a number of
illustrative and challenging problems.
Related papers
- What Hides behind Unfairness? Exploring Dynamics Fairness in Reinforcement Learning [52.51430732904994]
In reinforcement learning problems, agents must consider long-term fairness while maximizing returns.
Recent works have proposed many different types of fairness notions, but how unfairness arises in RL problems remains unclear.
We introduce a novel notion called dynamics fairness, which explicitly captures the inequality stemming from environmental dynamics.
arXiv Detail & Related papers (2024-04-16T22:47:59Z) - Preserving Commonsense Knowledge from Pre-trained Language Models via
Causal Inference [20.5696436171006]
Most existing studies attribute it to catastrophic forgetting, and they retain the pre-trained knowledge indiscriminately.
We frame fine-tuning into a causal graph and discover that the crux of catastrophic forgetting lies in the missing causal effects from the pretrained data.
In the experiments, our method outperforms state-of-the-art fine-tuning methods on all six commonsense QA datasets.
arXiv Detail & Related papers (2023-06-19T09:06:44Z) - Reinforcement Learning from Passive Data via Latent Intentions [86.4969514480008]
We show that passive data can still be used to learn features that accelerate downstream RL.
Our approach learns from passive data by modeling intentions.
Our experiments demonstrate the ability to learn from many forms of passive data, including cross-embodiment video data and YouTube videos.
arXiv Detail & Related papers (2023-04-10T17:59:05Z) - CLARE: Conservative Model-Based Reward Learning for Offline Inverse
Reinforcement Learning [26.05184273238923]
This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL)
We devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function.
Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy.
arXiv Detail & Related papers (2023-02-09T17:16:29Z) - Learning Models for Actionable Recourse [31.30850378503406]
We propose an algorithm that theoretically guarantees recourse to affected individuals with high probability without sacrificing accuracy.
We demonstrate the efficacy of our approach via extensive experiments on real data.
arXiv Detail & Related papers (2020-11-12T01:15:18Z) - Learning "What-if" Explanations for Sequential Decision-Making [92.8311073739295]
Building interpretable parameterizations of real-world decision-making on the basis of demonstrated behavior is essential.
We propose learning explanations of expert decisions by modeling their reward function in terms of preferences with respect to "what if" outcomes.
We highlight the effectiveness of our batch, counterfactual inverse reinforcement learning approach in recovering accurate and interpretable descriptions of behavior.
arXiv Detail & Related papers (2020-07-02T14:24:17Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z) - Transfer Reinforcement Learning under Unobserved Contextual Information [16.895704973433382]
We study a transfer reinforcement learning problem where the state transitions and rewards are affected by the environmental context.
We develop a method to obtain causal bounds on the transition and reward functions using the demonstrator's data.
We propose new Q learning and UCB-Q learning algorithms that converge to the true value function without bias.
arXiv Detail & Related papers (2020-03-09T22:00:04Z) - Value-driven Hindsight Modelling [68.658900923595]
Value estimation is a critical component of the reinforcement learning (RL) paradigm.
Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function.
We develop an approach for representation learning in RL that sits in between these two extremes.
This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function.
arXiv Detail & Related papers (2020-02-19T18:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.