What About Inputing Policy in Value Function: Policy Representation and
Policy-extended Value Function Approximator
- URL: http://arxiv.org/abs/2010.09536v4
- Date: Wed, 15 Dec 2021 17:14:53 GMT
- Title: What About Inputing Policy in Value Function: Policy Representation and
Policy-extended Value Function Approximator
- Authors: Hongyao Tang, Zhaopeng Meng, Jianye Hao, Chen Chen, Daniel Graves,
Dong Li, Changmin Yu, Hangyu Mao, Wulong Liu, Yaodong Yang, Wenyuan Tao, Li
Wang
- Abstract summary: We study Policy-extended Value Function Approximator (PeVFA) in Reinforcement Learning (RL)
We show that generalized value estimates offered by PeVFA may have lower initial approximation error to true values of successive policies.
We propose a representation learning framework for RL policy, providing several approaches to learn effective policy embeddings from policy network parameters or state-action pairs.
- Score: 39.287998861631
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study Policy-extended Value Function Approximator (PeVFA) in Reinforcement
Learning (RL), which extends conventional value function approximator (VFA) to
take as input not only the state (and action) but also an explicit policy
representation. Such an extension enables PeVFA to preserve values of multiple
policies at the same time and brings an appealing characteristic, i.e.,
\emph{value generalization among policies}. We formally analyze the value
generalization under Generalized Policy Iteration (GPI). From theoretical and
empirical lens, we show that generalized value estimates offered by PeVFA may
have lower initial approximation error to true values of successive policies,
which is expected to improve consecutive value approximation during GPI. Based
on above clues, we introduce a new form of GPI with PeVFA which leverages the
value generalization along policy improvement path. Moreover, we propose a
representation learning framework for RL policy, providing several approaches
to learn effective policy embeddings from policy network parameters or
state-action pairs. In our experiments, we evaluate the efficacy of value
generalization offered by PeVFA and policy representation learning in several
OpenAI Gym continuous control tasks. For a representative instance of algorithm
implementation, Proximal Policy Optimization (PPO) re-implemented under the
paradigm of GPI with PeVFA achieves about 40\% performance improvement on its
vanilla counterpart in most environments.
Related papers
- Reflective Policy Optimization [20.228281670899204]
Reflective Policy Optimization (RPO) amalgamates past and future state-action information for policy optimization.
RPO empowers the agent for introspection, allowing modifications to its actions within the current state.
Empirical results demonstrate RPO's feasibility and efficacy in two reinforcement learning benchmarks.
arXiv Detail & Related papers (2024-06-06T01:46:49Z) - Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Towards an Understanding of Default Policies in Multitask Policy
Optimization [29.806071693039655]
Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms.
We take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization.
We then derive a principled RPO algorithm for multitask learning with strong performance guarantees.
arXiv Detail & Related papers (2021-11-04T16:45:15Z) - Hinge Policy Optimization: Rethinking Policy Improvement and
Reinterpreting PPO [6.33198867705718]
Policy optimization is a fundamental principle for designing reinforcement learning algorithms.
Despite its superior empirical performance, PPO-clip has not been justified via theoretical proof up to date.
This is the first ever that can prove global convergence to an optimal policy for a variant of PPO-clip.
arXiv Detail & Related papers (2021-10-26T15:56:57Z) - Decoupling Value and Policy for Generalization in Reinforcement Learning [20.08992844616678]
We argue that more information is needed to accurately estimate the value function than to learn the optimal policy.
We propose two approaches which are combined to create IDAAC: Invariant Decoupled Advantage Actor-Critic.
IDAAC shows good generalization to unseen environments, achieving a new state-of-the-art on the Procgen benchmark and outperforming popular methods on DeepMind Control tasks with distractors.
arXiv Detail & Related papers (2021-02-20T12:40:11Z) - Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z) - Policy Evaluation Networks [50.53250641051648]
We introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding.
Our empirical results demonstrate that combining these three elements can produce policies that outperform those that generated the training data.
arXiv Detail & Related papers (2020-02-26T23:00:27Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.