Adaptive Policy Learning to Additional Tasks
- URL: http://arxiv.org/abs/2305.15193v1
- Date: Wed, 24 May 2023 14:31:11 GMT
- Title: Adaptive Policy Learning to Additional Tasks
- Authors: Wenjian Hao, Zehui Lu, Zihao Liang, Tianyu Zhou, Shaoshuai Mou
- Abstract summary: This paper develops a policy learning method for tuning a pre-trained policy to adapt to additional tasks without altering the original task.
A method named Adaptive Policy Gradient (APG) is proposed in this paper, which combines Bellman's principle of optimality with the policy gradient approach to improve the convergence rate.
- Score: 3.43814540650436
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper develops a policy learning method for tuning a pre-trained policy
to adapt to additional tasks without altering the original task. A method named
Adaptive Policy Gradient (APG) is proposed in this paper, which combines
Bellman's principle of optimality with the policy gradient approach to improve
the convergence rate. This paper provides theoretical analysis which guarantees
the convergence rate and sample complexity of $\mathcal{O}(1/T)$ and
$\mathcal{O}(1/\epsilon)$, respectively, where $T$ denotes the number of
iterations and $\epsilon$ denotes the accuracy of the resulting stationary
policy. Furthermore, several challenging numerical simulations, including
cartpole, lunar lander, and robot arm, are provided to show that APG obtains
similar performance compared to existing deterministic policy gradient methods
while utilizing much less data and converging at a faster rate.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Sample Complexity of Policy-Based Methods under Off-Policy Sampling and
Linear Function Approximation [8.465228064780748]
off-policy sampling and linear function approximation are employed for policy evaluation.
Various policy update rules, including natural policy gradient (NPG), are considered for policy update.
We establish for the first time an overall $mathcalO(epsilon-2)$ sample complexity for finding an optimal policy.
arXiv Detail & Related papers (2022-08-05T15:59:05Z) - Policy Gradient Method For Robust Reinforcement Learning [23.62008807533706]
This paper develops the first policy gradient method with global optimality guarantee and complexity analysis for robust reinforcement learning under model mismatch.
We show that the proposed robust policy gradient method converges to the global optimum gradient under direct policy parameterization.
We then extend our methodology to the general model-free setting and design the robust actoriable parametric policy class and value function.
arXiv Detail & Related papers (2022-05-15T17:35:17Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - Cautiously Optimistic Policy Optimization and Exploration with Linear
Function Approximation [48.744735294559824]
Policy optimization methods are popular reinforcement learning algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts.
In this paper, we propose a new algorithm, COPOE, that overcomes the sample complexity issue of PCPG while retaining its robustness to model misspecification.
The result is an improvement in sample complexity from $widetildeO (1/epsilon11)$ for PCPG to $widetildeO (1/epsilon3)$ for PCPG, nearly bridging the gap with value-based techniques.
arXiv Detail & Related papers (2021-03-24T01:42:59Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z) - Stochastic Recursive Momentum for Policy Gradient Methods [28.277961340108313]
We propose a novel algorithm named STOchastic Recursive Momentum for Policy Gradient (Storm-PG)
Storm-PG enjoys a provably sharp $O (1/epsilon3)$ sample bound for STORM-PG, matching the best-known convergence rate for policy gradient algorithm.
Numerical experiments depicts the superiority of our algorithm over comparative policy gradient algorithms.
arXiv Detail & Related papers (2020-03-09T17:59:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.