Related papers: Provably Efficient Exploration in Policy Optimization

Provably Efficient Exploration in Policy Optimization

URL: http://arxiv.org/abs/1912.05830v4
Date: Mon, 1 Apr 2024 00:56:31 GMT
Title: Provably Efficient Exploration in Policy Optimization
Authors: Qi Cai, Zhuoran Yang, Chi Jin, Zhaoran Wang,
Abstract summary: This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO) OPPO achieves $tildeO(sqrtd2 H3 T )$ regret. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
Score: 117.09887790160406
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^2 H^3 T} )$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.

Related papers

Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z)
$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization [91.43730624072226]
$f$-PO is a novel framework that generalizes and extends existing approaches. We conduct experiments on state-of-the-art language models using benchmark datasets.
arXiv Detail & Related papers (2024-10-29T02:11:45Z)
Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice. We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z)
Low-Switching Policy Gradient with Exploration via Online Sensitivity Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation. We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z)
Optimistic Natural Policy Gradient: a Simple Efficient Policy Optimization Framework for Online RL [23.957148537567146]
This paper proposes a simple efficient policy optimization framework -- Optimistic NPG for online RL. For $d$-dimensional linear MDPs, Optimistic NPG is computationally efficient, and learns an $varepsilon$-optimal policy within $tildeO(d2/varepsilon3)$ samples.
arXiv Detail & Related papers (2023-05-18T15:19:26Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used. Second, to explain these findings we introduce the concept of committal rate for policy optimization. Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z)
Hinge Policy Optimization: Rethinking Policy Improvement and Reinterpreting PPO [6.33198867705718]
Policy optimization is a fundamental principle for designing reinforcement learning algorithms. Despite its superior empirical performance, PPO-clip has not been justified via theoretical proof up to date. This is the first ever that can prove global convergence to an optimal policy for a variant of PPO-clip.
arXiv Detail & Related papers (2021-10-26T15:56:57Z)
Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs [45.6318149525364]
We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs) We propose the $underlinetextp$eriodically $underlinetextr$estarted $underlinetexto$ptimistic $underlinetextp$olicy $underlinetexto$ptimization algorithm (PROPO)
arXiv Detail & Related papers (2021-10-18T02:33:20Z)
Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards. To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.