CUP: Critic-Guided Policy Reuse
- URL: http://arxiv.org/abs/2210.08153v1
- Date: Sat, 15 Oct 2022 00:53:03 GMT
- Title: CUP: Critic-Guided Policy Reuse
- Authors: Jin Zhang, Siyuan Li, Chongjie Zhang
- Abstract summary: Critic-gUided Policy reuse (CUP) is a policy reuse algorithm that avoids training any extra components and efficiently reuses source policies.
CUP selects the source policy that has the largest one-step improvement over the current target policy, and forms a guidance policy.
Empirical results demonstrate that CUP achieves efficient transfer and significantly outperforms baseline algorithms.
- Score: 37.12379523150601
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The ability to reuse previous policies is an important aspect of human
intelligence. To achieve efficient policy reuse, a Deep Reinforcement Learning
(DRL) agent needs to decide when to reuse and which source policies to reuse.
Previous methods solve this problem by introducing extra components to the
underlying algorithm, such as hierarchical high-level policies over source
policies, or estimations of source policies' value functions on the target
task. However, training these components induces either optimization
non-stationarity or heavy sampling cost, significantly impairing the
effectiveness of transfer. To tackle this problem, we propose a novel policy
reuse algorithm called Critic-gUided Policy reuse (CUP), which avoids training
any extra components and efficiently reuses source policies. CUP utilizes the
critic, a common component in actor-critic methods, to evaluate and choose
source policies. At each state, CUP chooses the source policy that has the
largest one-step improvement over the current target policy, and forms a
guidance policy. The guidance policy is theoretically guaranteed to be a
monotonic improvement over the current target policy. Then the target policy is
regularized to imitate the guidance policy to perform efficient policy search.
Empirical results demonstrate that CUP achieves efficient transfer and
significantly outperforms baseline algorithms.
Related papers
- IOB: Integrating Optimization Transfer and Behavior Transfer for
Multi-Policy Reuse [50.90781542323258]
Reinforcement learning (RL) agents can transfer knowledge from source policies to a related target task.
Previous methods introduce additional components, such as hierarchical policies or estimations of source policies' value functions.
We propose a novel transfer RL method that selects the source policy without training extra components.
arXiv Detail & Related papers (2023-08-14T09:22:35Z) - Value Enhancement of Reinforcement Learning via Efficient and Robust
Trust Region Optimization [14.028916306297928]
Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy.
We propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms.
arXiv Detail & Related papers (2023-01-05T18:43:40Z) - Local Policy Improvement for Recommender Systems [8.617221361305901]
We show how to train a new policy given data collected from a previously-deployed policy.
We suggest an alternative approach of local policy improvement without off-policy correction.
This local policy improvement paradigm is ideal for recommender systems, as previous policies are typically of decent quality and policies are updated frequently.
arXiv Detail & Related papers (2022-12-22T00:47:40Z) - Hinge Policy Optimization: Rethinking Policy Improvement and
Reinterpreting PPO [6.33198867705718]
Policy optimization is a fundamental principle for designing reinforcement learning algorithms.
Despite its superior empirical performance, PPO-clip has not been justified via theoretical proof up to date.
This is the first ever that can prove global convergence to an optimal policy for a variant of PPO-clip.
arXiv Detail & Related papers (2021-10-26T15:56:57Z) - Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm [16.115903198836694]
Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL)
This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy)
This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency.
arXiv Detail & Related papers (2021-10-19T14:36:45Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z) - Policy Evaluation Networks [50.53250641051648]
We introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding.
Our empirical results demonstrate that combining these three elements can produce policies that outperform those that generated the training data.
arXiv Detail & Related papers (2020-02-26T23:00:27Z) - Efficient Deep Reinforcement Learning via Adaptive Policy Transfer [50.51637231309424]
Policy Transfer Framework (PTF) is proposed to accelerate Reinforcement Learning (RL)
Our framework learns when and which source policy is the best to reuse for the target policy and when to terminate it.
Experimental results show it significantly accelerates the learning process and surpasses state-of-the-art policy transfer methods.
arXiv Detail & Related papers (2020-02-19T07:30:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.