Related papers: Best of Both Worlds Policy Optimization

Best of Both Worlds Policy Optimization

URL: http://arxiv.org/abs/2302.09408v1
Date: Sat, 18 Feb 2023 19:46:11 GMT
Title: Best of Both Worlds Policy Optimization
Authors: Christoph Dann, Chen-Yu Wei, Julian Zimmert
Abstract summary: We show that by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog$(T)$ regret when the losses are adversarial. This is the first time a gap-dependent polylog$(T)$ regret bound is shown for policy optimization.
Score: 33.13041034490332
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Policy optimization methods are popular reinforcement learning algorithms in practice. Recent works have built theoretical foundation for them by proving $\sqrt{T}$ regret bounds even when the losses are adversarial. Such bounds are tight in the worst case but often overly pessimistic. In this work, we show that in tabular Markov decision processes (MDPs), by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog$(T)$ regret when the losses are stochastic, without sacrificing the worst-case guarantee in the adversarial regime. To our knowledge, this is also the first time a gap-dependent polylog$(T)$ regret bound is shown for policy optimization. Specifically, we achieve this by leveraging a Tsallis entropy or a Shannon entropy regularizer in the policy update. Then we show that under known transitions, we can further obtain a first-order regret bound in the adversarial regime by leveraging the log-barrier regularizer.

Related papers

Logarithmic Regret of Exploration in Average Reward Markov Decision Processes [2.9158689853305693]
State-of-the-art algorithms for regret follow a well-established framework. We show that there is a significant advantage in replacing the Doubling Trick (DT) rule with the Vanishing Multiplicative (VM) rule.
arXiv Detail & Related papers (2025-02-10T13:59:00Z)
Achieving $\widetilde{\mathcal{O}}(\sqrt{T})$ Regret in Average-Reward POMDPs with Known Observation Models [56.92178753201331]
We tackle average-reward infinite-horizon POMDPs with an unknown transition model. We present a novel and simple estimator that overcomes this barrier.
arXiv Detail & Related papers (2025-01-30T22:29:41Z)
Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback. We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based. Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z)
Off-Policy Primal-Dual Safe Reinforcement Learning [16.918188277722503]
We show that the error in cumulative cost estimation causes significant underestimation of cost when using off-policy methods. We propose conservative policy optimization, which learns a policy in a constraint-satisfying area by considering the uncertainty in estimation. We then introduce local policy convexification to help eliminate such suboptimality by gradually reducing the estimation uncertainty.
arXiv Detail & Related papers (2024-01-26T10:33:38Z)
Revisiting Weighted Strategy for Non-stationary Parametric Bandits [82.1942459195896]
This paper revisits the weighted strategy for non-stationary parametric bandits. We propose a refined analysis framework, which produces a simpler weight-based algorithm. Our new framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2023-03-05T15:11:14Z)
Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z)
The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL) We focus on learning with general function classes and general model classes. We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z)
The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition [49.78053380710322]
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes. Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known. In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader ($textFTRL$) framework together with a set of new techniques.
arXiv Detail & Related papers (2021-06-08T05:46:35Z)
Improved Regret Bound and Experience Replay in Regularized Policy Iteration [22.621710838468097]
We study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm can be sharpened from $O(T3/4)$ to $O(sqrtT)$ under nearly identical assumptions. Our result provides the first high-probability $O(sqrtT)$ regret bound for a computationally efficient algorithm in this setting.
arXiv Detail & Related papers (2021-02-25T00:55:07Z)
Robust Policy Gradient against Strong Data Corruption [30.910088777897045]
We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions. Our attack model assumes an textitadaptive adversary who can arbitrarily corrupt the reward and transition at every step within an episode. We develop a Filtered Policy Gradient algorithm that can tolerate even reward corruption and can find an $O(epsilon1/4)$-optimal policy.
arXiv Detail & Related papers (2021-02-11T01:48:38Z)
Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards. To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.