Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL
- URL: http://arxiv.org/abs/2306.02419v2
- Date: Mon, 24 Jun 2024 07:06:44 GMT
- Title: Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL
- Authors: Miguel Suau, Matthijs T. J. Spaan, Frans A. Oliehoek,
- Abstract summary: Reinforcement learning agents tend to develop habits that are effective only under specific policies.
This paper presents a mathematical characterization of this phenomenon, termed policy confounding.
- Score: 20.43882227518439
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning agents tend to develop habits that are effective only under specific policies. Following an initial exploration phase where agents try out different actions, they eventually converge onto a particular policy. As this occurs, the distribution over state-action trajectories becomes narrower, leading agents to repeatedly experience the same transitions. This repetitive exposure fosters spurious correlations between certain observations and rewards. Agents may then pick up on these correlations and develop simplistic habits tailored to the specific set of trajectories dictated by their policy. The problem is that these habits may yield incorrect outcomes when agents are forced to deviate from their typical trajectories, prompted by changes in the environment. This paper presents a mathematical characterization of this phenomenon, termed policy confounding, and illustrates, through a series of examples, the circumstances under which it occurs.
Related papers
- Breaking Habits: On the Role of the Advantage Function in Learning Causal State Representations [4.514386953429771]
We show that the advantage function, commonly used in policy gradient methods, reduces the variance of gradient estimates.<n>We provide both analytical and empirical evidence demonstrating that training with the advantage function leads to improved out-of-trajectory performance.
arXiv Detail & Related papers (2025-06-13T16:06:47Z) - Partial Identifiability in Inverse Reinforcement Learning For Agents With Non-Exponential Discounting [64.13583792391783]
inverse reinforcement learning aims to infer an agent's preferences from observing their behaviour.
One of the central difficulties in IRL is that multiple preferences may lead to the same observed behaviour.
We show that generally IRL is unable to infer enough information about $R$ to identify the correct optimal policy.
arXiv Detail & Related papers (2024-12-15T11:08:58Z) - Towards Generalizable Reinforcement Learning via Causality-Guided Self-Adaptive Representations [22.6449779859417]
General intelligence requires quick adaption across tasks.
In this paper, we explore a wider range of scenarios where not only the distribution but also the environment spaces may change.
We introduce a causality-guided self-adaptive representation-based approach, called CSR, that equips the agent to generalize effectively.
arXiv Detail & Related papers (2024-07-30T08:48:49Z) - Invariant Causal Imitation Learning for Generalizable Policies [87.51882102248395]
We propose Invariant Causal Learning (ICIL) to learn an imitation policy.
ICIL learns a representation of causal features that is disentangled from the specific representations of noise variables.
We show that ICIL is effective in learning imitation policies capable of generalizing to unseen environments.
arXiv Detail & Related papers (2023-11-02T16:52:36Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Hierarchical Imitation Learning for Stochastic Environments [31.64016324441371]
Existing methods that improve distributional realism typically rely on hierarchical policies.
We propose Robust Type Conditioning (RTC), which eliminates the shift with adversarial training under environmentality.
Experiments on two domains, including the large-scale Open Motion dataset, show improved distributional realism while maintaining or improving task performance compared to state-of-the-art baselines.
arXiv Detail & Related papers (2023-09-25T10:10:34Z) - Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment.
We first adopt a transformer-based method to learn policy embeddings.
Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z) - Let Offline RL Flow: Training Conservative Agents in the Latent Space of
Normalizing Flows [58.762959061522736]
offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions.
We build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model.
We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms.
arXiv Detail & Related papers (2022-11-20T21:57:10Z) - On Proximal Policy Optimization's Heavy-tailed Gradients [150.08522793940708]
We study the heavy-tailed nature of the gradients of the Proximal Policy Optimization surrogate reward function.
In this paper, we study the effects of the standard PPO clippings, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients.
We propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks.
arXiv Detail & Related papers (2021-02-20T05:51:28Z) - Fighting Copycat Agents in Behavioral Cloning from Observation Histories [85.404120663644]
Imitation learning trains policies to map from input observations to the actions that an expert would choose.
We propose an adversarial approach to learn a feature representation that removes excess information about the previous expert action nuisance correlate.
arXiv Detail & Related papers (2020-10-28T10:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.