Policy Gradients Incorporating the Future
- URL: http://arxiv.org/abs/2108.02096v1
- Date: Wed, 4 Aug 2021 14:57:11 GMT
- Title: Policy Gradients Incorporating the Future
- Authors: David Venuto, Elaine Lau, Doina Precup, Ofir Nachum
- Abstract summary: We introduce a method that allows an agent to "look into the future" without explicitly predicting it.
We propose to allow an agent, during its training on past experience, to observe what emphactually happened in the future at that time.
This gives our agent the opportunity to utilize rich and useful information about the future trajectory dynamics in addition to the present.
- Score: 66.20567145291342
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning about the future -- understanding how decisions in the present time
affect outcomes in the future -- is one of the central challenges for
reinforcement learning (RL), especially in highly-stochastic or partially
observable environments. While predicting the future directly is hard, in this
work we introduce a method that allows an agent to "look into the future"
without explicitly predicting it. Namely, we propose to allow an agent, during
its training on past experience, to observe what \emph{actually} happened in
the future at that time, while enforcing an information bottleneck to avoid the
agent overly relying on this privileged information. This gives our agent the
opportunity to utilize rich and useful information about the future trajectory
dynamics in addition to the present. Our method, Policy Gradients Incorporating
the Future (PGIF), is easy to implement and versatile, being applicable to
virtually any policy gradient algorithm. We apply our proposed method to a
number of off-the-shelf RL algorithms and show that PGIF is able to achieve
higher reward faster in a variety of online and offline RL domains, as well as
sparse-reward and partially observable environments.
Related papers
- FIMP: Future Interaction Modeling for Multi-Agent Motion Prediction [18.10147252674138]
We propose Future Interaction modeling for Motion Prediction (FIMP), which captures potential future interactions in an end-to-end manner.
Experiments show that our future interaction modeling improves the performance remarkably, leading to superior performance on the Argoverse motion forecasting benchmark.
arXiv Detail & Related papers (2024-01-29T14:41:55Z) - Reinforcement Learning from Passive Data via Latent Intentions [86.4969514480008]
We show that passive data can still be used to learn features that accelerate downstream RL.
Our approach learns from passive data by modeling intentions.
Our experiments demonstrate the ability to learn from many forms of passive data, including cross-embodiment video data and YouTube videos.
arXiv Detail & Related papers (2023-04-10T17:59:05Z) - Dichotomy of Control: Separating What You Can Control from What You
Cannot [129.62135987416164]
We propose a future-conditioned supervised learning framework that separates mechanisms within a policy's control (actions) from those beyond a policy's control (environmentity)
We show that DoC yields policies that are consistent with their conditioning inputs, ensuring that conditioning a learned policy on a desired high-return future outcome will correctly induce high-return behavior.
arXiv Detail & Related papers (2022-10-24T17:49:56Z) - Lifelong Hyper-Policy Optimization with Multiple Importance Sampling
Regularization [40.17392342387002]
We propose an approach which learns a hyper-policy, whose input is time, that outputs the parameters of the policy to be queried at that time.
This hyper-policy is trained to maximize the estimated future performance, efficiently reusing past data by means of importance sampling.
We empirically validate our approach, in comparison with state-of-the-art algorithms, on realistic environments.
arXiv Detail & Related papers (2021-12-13T13:09:49Z) - Useful Policy Invariant Shaping from Arbitrary Advice [24.59807772487328]
A major challenge of RL research is to discover how to learn with less data.
Potential-based reward shaping (PBRS) holds promise, but it is limited by the need for a well-defined potential function.
The recently introduced dynamic potential based advice (DPBA) method tackles this challenge by admitting arbitrary advice from a human or other agent.
arXiv Detail & Related papers (2020-11-02T20:29:09Z) - Deep Reinforcement Learning amidst Lifelong Non-Stationarity [67.24635298387624]
We show that an off-policy RL algorithm can reason about and tackle lifelong non-stationarity.
Our method leverages latent variable models to learn a representation of the environment from current and past experiences.
We also introduce several simulation environments that exhibit lifelong non-stationarity, and empirically find that our approach substantially outperforms approaches that do not reason about environment shift.
arXiv Detail & Related papers (2020-06-18T17:34:50Z) - Deep Reinforcement and InfoMax Learning [32.426674181365456]
We introduce an objective based on Deep InfoMax which trains the agent to predict the future by maximizing the mutual information between its internal representation of successive timesteps.
We test our approach in several synthetic settings, where it successfully learns representations that are predictive of the future.
arXiv Detail & Related papers (2020-06-12T14:19:46Z) - Optimizing for the Future in Non-Stationary MDPs [52.373873622008944]
We present a policy gradient algorithm that maximizes a forecast of future performance.
We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
arXiv Detail & Related papers (2020-05-17T03:41:19Z) - Value-driven Hindsight Modelling [68.658900923595]
Value estimation is a critical component of the reinforcement learning (RL) paradigm.
Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function.
We develop an approach for representation learning in RL that sits in between these two extremes.
This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function.
arXiv Detail & Related papers (2020-02-19T18:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.