Related papers: On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction

On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction

URL: http://arxiv.org/abs/2106.00993v1
Date: Wed, 2 Jun 2021 07:26:29 GMT
Title: On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction
Authors: Jiawei Huang, Nan Jiang
Abstract summary: We study the convergence properties of off-policy policy improvement algorithms with state-action density ratio correction. We present two strategies with finite-time convergence guarantees. We prove that O-SPIM converges to a stationary point with total complexity $O(epsilon-4)$, which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.
Score: 28.548040329949387
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we study the convergence properties of off-policy policy improvement algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min optimization problem. We characterize the bias of the learning objective and present two strategies with finite-time convergence guarantees. In our first strategy, we present algorithm P-SREDA with convergence rate $O(\epsilon^{-3})$, whose dependency on $\epsilon$ is optimal. In our second strategy, we propose a new off-policy actor-critic style algorithm named O-SPIM. We prove that O-SPIM converges to a stationary point with total complexity $O(\epsilon^{-4})$, which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.

Related papers

Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Rethinking the Global Convergence of Softmax Policy Gradient with Linear Function Approximation [52.772454746132276]
We show that the approximation error in modeling problem-dependent quantities is irrelevant to the algorithm's global convergence.<n>We prove that $textttLin-SPG$ with any arbitrary constant learning rate can ensure global convergence to the optimal policy.
arXiv Detail & Related papers (2025-05-06T04:03:06Z)
FastPart: Over-Parameterized Stochastic Gradient Descent for Sparse optimisation on Measures [3.377298662011438]
This paper presents a novel algorithm that leverages Gradient Descent strategies in conjunction with Random Features to augment the scalability of Conic Particle Gradient Descent (CPGD)<n>We provide rigorous mathematical proofs demonstrating the following key findings: $mathrm(i)$ The total variation norms of the solution measures along the descent trajectory remain bounded, ensuring stability and preventing undesirable divergence; $mathrm(ii)$ We establish a global convergence guarantee with a convergence rate of $O(log(K)/sqrtK)$ over $K$, showcasing the efficiency and effectiveness of
arXiv Detail & Related papers (2023-12-10T20:41:43Z)
Low-Switching Policy Gradient with Exploration via Online Sensitivity Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation. We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z)
Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO. We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z)
Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used. Second, to explain these findings we introduce the concept of committal rate for policy optimization. Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z)
Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks. We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution. We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z)
Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques. VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z)
On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE. Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal. We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z)
Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs. bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad. We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization [44.24881971917951]
Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms. We develop convergence guarantees for entropy-regularized NPG methods under softmax parameterization. Our results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.
arXiv Detail & Related papers (2020-07-13T17:58:41Z)
Convergence of adaptive algorithms for weakly convex constrained optimization [59.36386973876765]
We prove the $mathcaltilde O(t-1/4)$ rate of convergence for the norm of the gradient of Moreau envelope. Our analysis works with mini-batch size of $1$, constant first and second order moment parameters, and possibly smooth optimization domains.
arXiv Detail & Related papers (2020-06-11T17:43:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.