Related papers: Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs

Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs

URL: http://arxiv.org/abs/2305.08359v1
Date: Mon, 15 May 2023 05:37:32 GMT
Title: Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs
Authors: Kaixuan Ji and Qingyue Zhao and Jiafan He and Weitong Zhang and Quanquan Gu
Abstract summary: We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
Score: 72.40181882916089
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent studies have shown that episodic reinforcement learning (RL) is no harder than bandits when the total reward is bounded by $1$, and proved regret bounds that have a polylogarithmic dependence on the planning horizon $H$. However, it remains an open question that if such results can be carried over to adversarial RL, where the reward is adversarially chosen at each episode. In this paper, we answer this question affirmatively by proposing the first horizon-free policy search algorithm. To tackle the challenges caused by exploration and adversarially chosen reward, our algorithm employs (1) a variance-uncertainty-aware weighted least square estimator for the transition kernel; and (2) an occupancy measure-based technique for the online search of a \emph{stochastic} policy. We show that our algorithm achieves an $\tilde{O}\big((d+\log (|\mathcal{S}|^2 |\mathcal{A}|))\sqrt{K}\big)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|\mathcal{S}|$ and $|\mathcal{A}|$ are the cardinalities of the state and action spaces. We also provide hardness results and regret lower bounds to justify the near optimality of our algorithm and the unavoidability of $\log|\mathcal{S}|$ and $\log|\mathcal{A}|$ in the regret bound.

Related papers

Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery [45.601316850669406]
We present efficient algorithms for policy evaluation, best policy identification and regret minimization. For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal. All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix.
arXiv Detail & Related papers (2024-02-24T06:36:08Z)
Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and adversarial regimes.
arXiv Detail & Related papers (2023-12-24T08:27:30Z)
Safe Posterior Sampling for Constrained MDPs with Bounded Constraint Violation [8.849815837266977]
Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications. We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well. We establish a sub-linear $tildemathcal Oleft(H2.5 sqrt|mathcalS|2 |mathcalA| K right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.
arXiv Detail & Related papers (2023-01-27T06:18:25Z)
Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs. Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic. This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z)
Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning. The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration. We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z)
Contextual Recommendations and Low-Regret Cutting-Plane Algorithms [49.91214213074933]
We consider the following variant of contextual linear bandits motivated by routing applications in navigational engines and recommendation systems. We design novel cutting-plane algorithms with low "regret" -- the total distance between the true point $w*$ and the hyperplanes the separation oracle returns.
arXiv Detail & Related papers (2021-06-09T05:39:05Z)
Provably Breaking the Quadratic Error Compounding Barrier in Imitation Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$. We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020) We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z)
Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP) For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension. For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z)
Naive Exploration is Optimal for Online LQR [49.681825576239355]
We show that the optimal regret scales as $widetildeTheta(sqrtd_mathbfu2 d_mathbfx T)$, where $T$ is the number of time steps, $d_mathbfu$ is the dimension of the input space, and $d_mathbfx$ is the dimension of the system state. Our lower bounds rule out the possibility of a $mathrmpoly(logT)$-regret algorithm, which had been
arXiv Detail & Related papers (2020-01-27T03:44:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.