Breaking Habits: On the Role of the Advantage Function in Learning Causal State Representations
- URL: http://arxiv.org/abs/2506.11912v1
- Date: Fri, 13 Jun 2025 16:06:47 GMT
- Title: Breaking Habits: On the Role of the Advantage Function in Learning Causal State Representations
- Authors: Miguel Suau,
- Abstract summary: We show that the advantage function, commonly used in policy gradient methods, reduces the variance of gradient estimates.<n>We provide both analytical and empirical evidence demonstrating that training with the advantage function leads to improved out-of-trajectory performance.
- Score: 4.514386953429771
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has shown that reinforcement learning agents can develop policies that exploit spurious correlations between rewards and observations. This phenomenon, known as policy confounding, arises because the agent's policy influences both past and future observation variables, creating a feedback loop that can hinder the agent's ability to generalize beyond its usual trajectories. In this paper, we show that the advantage function, commonly used in policy gradient methods, not only reduces the variance of gradient estimates but also mitigates the effects of policy confounding. By adjusting action values relative to the state representation, the advantage function downweights state-action pairs that are more likely under the current policy, breaking spurious correlations and encouraging the agent to focus on causal factors. We provide both analytical and empirical evidence demonstrating that training with the advantage function leads to improved out-of-trajectory performance.
Related papers
- Learning Nonlinear Causal Reductions to Explain Reinforcement Learning Policies [50.30741668990102]
We take a causal perspective on explaining the behavior of reinforcement learning policies.<n>We learn a simplified high-level causal model that explains these relationships.<n>We prove that for a class of nonlinear causal models, there exists a unique solution.
arXiv Detail & Related papers (2025-07-20T10:25:24Z) - Learning Causally Invariant Reward Functions from Diverse Demonstrations [6.351909403078771]
Inverse reinforcement learning methods aim to retrieve the reward function of a Markov decision process based on a dataset of expert demonstrations.
This adaptation often exhibits overfitting to the expert data set when a policy is trained on the obtained reward function under distribution shift of the environment dynamics.
In this work, we explore a novel regularization approach for inverse reinforcement learning methods based on the causal invariance principle with the goal of improved reward function generalization.
arXiv Detail & Related papers (2024-09-12T12:56:24Z) - Skill or Luck? Return Decomposition via Advantage Functions [15.967056781224102]
Learning from off-policy data is essential for sample-efficient reinforcement learning.
We show that the advantage function can be understood as the causal effect of an action on the return.
This decomposition enables us to naturally extend Direct Advantage Estimation to off-policy settings.
arXiv Detail & Related papers (2024-02-20T10:09:00Z) - Offline Reinforcement Learning with On-Policy Q-Function Regularization [57.09073809901382]
We deal with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy.
We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.
arXiv Detail & Related papers (2023-07-25T21:38:08Z) - Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL [20.43882227518439]
Reinforcement learning agents tend to develop habits that are effective only under specific policies.
This paper presents a mathematical characterization of this phenomenon, termed policy confounding.
arXiv Detail & Related papers (2023-06-04T17:51:37Z) - Reinforcement Learning Your Way: Agent Characterization through Policy
Regularization [0.0]
We develop a method to imbue a characteristic behaviour into agents' policies through regularization of their objective functions.
Our method guides the agents' behaviour during learning which results in an intrinsic characterization.
In future work, we intend to employ it to develop agents that optimize individual financial customers' investment portfolios based on their spending personalities.
arXiv Detail & Related papers (2022-01-21T08:18:38Z) - Taylor Expansion of Discount Factors [56.46324239692532]
In practical reinforcement learning (RL), the discount factor used for estimating value functions often differs from that used for defining the evaluation objective.
In this work, we study the effect that this discrepancy of discount factors has during learning, and discover a family of objectives that interpolate value functions of two distinct discount factors.
arXiv Detail & Related papers (2021-06-11T05:02:17Z) - On Proximal Policy Optimization's Heavy-tailed Gradients [150.08522793940708]
We study the heavy-tailed nature of the gradients of the Proximal Policy Optimization surrogate reward function.
In this paper, we study the effects of the standard PPO clippings, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients.
We propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks.
arXiv Detail & Related papers (2021-02-20T05:51:28Z) - Fighting Copycat Agents in Behavioral Cloning from Observation Histories [85.404120663644]
Imitation learning trains policies to map from input observations to the actions that an expert would choose.
We propose an adversarial approach to learn a feature representation that removes excess information about the previous expert action nuisance correlate.
arXiv Detail & Related papers (2020-10-28T10:52:10Z) - Learning "What-if" Explanations for Sequential Decision-Making [92.8311073739295]
Building interpretable parameterizations of real-world decision-making on the basis of demonstrated behavior is essential.
We propose learning explanations of expert decisions by modeling their reward function in terms of preferences with respect to "what if" outcomes.
We highlight the effectiveness of our batch, counterfactual inverse reinforcement learning approach in recovering accurate and interpretable descriptions of behavior.
arXiv Detail & Related papers (2020-07-02T14:24:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.