Best of Both Worlds Policy Optimization
- URL: http://arxiv.org/abs/2302.09408v1
- Date: Sat, 18 Feb 2023 19:46:11 GMT
- Title: Best of Both Worlds Policy Optimization
- Authors: Christoph Dann, Chen-Yu Wei, Julian Zimmert
- Abstract summary: We show that by properly designing the regularizer, the exploration bonus and the learning rates, one can achieve a more favorable polylog$(T)$ regret when the losses are adversarial.
This is the first time a gap-dependent polylog$(T)$ regret bound is shown for policy optimization.
- Score: 33.13041034490332
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policy optimization methods are popular reinforcement learning algorithms in
practice. Recent works have built theoretical foundation for them by proving
$\sqrt{T}$ regret bounds even when the losses are adversarial. Such bounds are
tight in the worst case but often overly pessimistic. In this work, we show
that in tabular Markov decision processes (MDPs), by properly designing the
regularizer, the exploration bonus and the learning rates, one can achieve a
more favorable polylog$(T)$ regret when the losses are stochastic, without
sacrificing the worst-case guarantee in the adversarial regime. To our
knowledge, this is also the first time a gap-dependent polylog$(T)$ regret
bound is shown for policy optimization. Specifically, we achieve this by
leveraging a Tsallis entropy or a Shannon entropy regularizer in the policy
update. Then we show that under known transitions, we can further obtain a
first-order regret bound in the adversarial regime by leveraging the
log-barrier regularizer.
Related papers
- Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - Off-Policy Primal-Dual Safe Reinforcement Learning [16.918188277722503]
We show that the error in cumulative cost estimation causes significant underestimation of cost when using off-policy methods.
We propose conservative policy optimization, which learns a policy in a constraint-satisfying area by considering the uncertainty in estimation.
We then introduce local policy convexification to help eliminate such suboptimality by gradually reducing the estimation uncertainty.
arXiv Detail & Related papers (2024-01-26T10:33:38Z) - Revisiting Weighted Strategy for Non-stationary Parametric Bandits [82.1942459195896]
This paper revisits the weighted strategy for non-stationary parametric bandits.
We propose a refined analysis framework, which produces a simpler weight-based algorithm.
Our new framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2023-03-05T15:11:14Z) - Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary.
We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret
We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z) - The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL)
We focus on learning with general function classes and general model classes.
We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z) - The best of both worlds: stochastic and adversarial episodic MDPs with
unknown transition [49.78053380710322]
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes.
Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known.
In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader ($textFTRL$) framework together with a set of new techniques.
arXiv Detail & Related papers (2021-06-08T05:46:35Z) - Improved Regret Bound and Experience Replay in Regularized Policy
Iteration [22.621710838468097]
We study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation.
We first show that the regret analysis of the Politex algorithm can be sharpened from $O(T3/4)$ to $O(sqrtT)$ under nearly identical assumptions.
Our result provides the first high-probability $O(sqrtT)$ regret bound for a computationally efficient algorithm in this setting.
arXiv Detail & Related papers (2021-02-25T00:55:07Z) - Robust Policy Gradient against Strong Data Corruption [30.910088777897045]
We study the problem of robust reinforcement learning under adversarial corruption on both rewards and transitions.
Our attack model assumes an textitadaptive adversary who can arbitrarily corrupt the reward and transition at every step within an episode.
We develop a Filtered Policy Gradient algorithm that can tolerate even reward corruption and can find an $O(epsilon1/4)$-optimal policy.
arXiv Detail & Related papers (2021-02-11T01:48:38Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.