Decoupling Value and Policy for Generalization in Reinforcement Learning
- URL: http://arxiv.org/abs/2102.10330v1
- Date: Sat, 20 Feb 2021 12:40:11 GMT
- Title: Decoupling Value and Policy for Generalization in Reinforcement Learning
- Authors: Roberta Raileanu, Rob Fergus
- Abstract summary: We argue that more information is needed to accurately estimate the value function than to learn the optimal policy.
We propose two approaches which are combined to create IDAAC: Invariant Decoupled Advantage Actor-Critic.
IDAAC shows good generalization to unseen environments, achieving a new state-of-the-art on the Procgen benchmark and outperforming popular methods on DeepMind Control tasks with distractors.
- Score: 20.08992844616678
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Standard deep reinforcement learning algorithms use a shared representation
for the policy and value function. However, we argue that more information is
needed to accurately estimate the value function than to learn the optimal
policy. Consequently, the use of a shared representation for the policy and
value function can lead to overfitting. To alleviate this problem, we propose
two approaches which are combined to create IDAAC: Invariant Decoupled
Advantage Actor-Critic. First, IDAAC decouples the optimization of the policy
and value function, using separate networks to model them. Second, it
introduces an auxiliary loss which encourages the representation to be
invariant to task-irrelevant properties of the environment. IDAAC shows good
generalization to unseen environments, achieving a new state-of-the-art on the
Procgen benchmark and outperforming popular methods on DeepMind Control tasks
with distractors. Moreover, IDAAC learns representations, value predictions,
and policies that are more robust to aesthetic changes in the observations that
do not change the underlying state of the environment.
Related papers
- Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - Invariant Causal Imitation Learning for Generalizable Policies [87.51882102248395]
We propose Invariant Causal Learning (ICIL) to learn an imitation policy.
ICIL learns a representation of causal features that is disentangled from the specific representations of noise variables.
We show that ICIL is effective in learning imitation policies capable of generalizing to unseen environments.
arXiv Detail & Related papers (2023-11-02T16:52:36Z) - Adversarial Policy Optimization in Deep Reinforcement Learning [16.999444076456268]
The policy represented by the deep neural network can overfitting, which hamper a reinforcement learning agent from learning effective policy.
Data augmentation can provide a performance boost to RL agents by mitigating the effect of overfitting.
We propose a novel RL algorithm to mitigate the above issue and improve the efficiency of the learned policy.
arXiv Detail & Related papers (2023-04-27T21:01:08Z) - Improved Regret for Efficient Online Reinforcement Learning with Linear
Function Approximation [69.0695698566235]
We study reinforcement learning with linear function approximation and adversarially changing cost functions.
We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback.
arXiv Detail & Related papers (2023-01-30T17:26:39Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment.
We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy.
Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z) - Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm [16.115903198836694]
Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL)
This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy)
This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency.
arXiv Detail & Related papers (2021-10-19T14:36:45Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z) - What About Inputing Policy in Value Function: Policy Representation and
Policy-extended Value Function Approximator [39.287998861631]
We study Policy-extended Value Function Approximator (PeVFA) in Reinforcement Learning (RL)
We show that generalized value estimates offered by PeVFA may have lower initial approximation error to true values of successive policies.
We propose a representation learning framework for RL policy, providing several approaches to learn effective policy embeddings from policy network parameters or state-action pairs.
arXiv Detail & Related papers (2020-10-19T14:09:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.