Policy Optimization in Adversarial MDPs: Improved Exploration via
Dilated Bonuses
- URL: http://arxiv.org/abs/2107.08346v1
- Date: Sun, 18 Jul 2021 02:30:48 GMT
- Title: Policy Optimization in Adversarial MDPs: Improved Exploration via
Dilated Bonuses
- Authors: Haipeng Luo, Chen-Yu Wei, Chung-Wei Lee
- Abstract summary: We develop a general solution that adds dilated bonuses to the policy update to facilitate global exploration.
We apply it to several episodic MDP settings with adversarial losses and bandit feedback.
When a simulator is unavailable, we further consider a linear MDP setting and obtain $widetildemathcalO(T14/15)$ regret.
- Score: 40.12297110530343
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Policy optimization is a widely-used method in reinforcement learning. Due to
its local-search nature, however, theoretical guarantees on global optimality
often rely on extra assumptions on the Markov Decision Processes (MDPs) that
bypass the challenge of global exploration. To eliminate the need of such
assumptions, in this work, we develop a general solution that adds dilated
bonuses to the policy update to facilitate global exploration. To showcase the
power and generality of this technique, we apply it to several episodic MDP
settings with adversarial losses and bandit feedback, improving and
generalizing the state-of-the-art. Specifically, in the tabular case, we obtain
$\widetilde{\mathcal{O}}(\sqrt{T})$ regret where $T$ is the number of episodes,
improving the $\widetilde{\mathcal{O}}({T}^{2/3})$ regret bound by Shani et al.
(2020). When the number of states is infinite, under the assumption that the
state-action values are linear in some low-dimensional features, we obtain
$\widetilde{\mathcal{O}}({T}^{2/3})$ regret with the help of a simulator,
matching the result of Neu and Olkhovskaya (2020) while importantly removing
the need of an exploratory policy that their algorithm requires. When a
simulator is unavailable, we further consider a linear MDP setting and obtain
$\widetilde{\mathcal{O}}({T}^{14/15})$ regret, which is the first result for
linear MDPs with adversarial losses and bandit feedback.
Related papers
- Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization [11.11876897168701]
We propose an algorithm that achieves a regret bound of order $tildemathcalO(mathrmpoly(H)sqrtSAT)$.
The proposed algorithm and analysis completely avoid the typical tool given by occupancy measures.
arXiv Detail & Related papers (2024-07-08T08:06:45Z) - Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with
General Utilities [12.104551746465932]
We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints.
Our algorithm converges to a first-order stationary point (FOSP) at the rate of $mathcalOleft(T-2/3right)$.
In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $widetildemathcalOleft(epsilon-3.5right)$ samples to achieve an $epsilon$-FOSP.
arXiv Detail & Related papers (2023-05-27T20:08:35Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z) - Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.