Refined Regret for Adversarial MDPs with Linear Function Approximation
- URL: http://arxiv.org/abs/2301.12942v2
- Date: Thu, 1 Jun 2023 22:43:47 GMT
- Title: Refined Regret for Adversarial MDPs with Linear Function Approximation
- Authors: Yan Dai, Haipeng Luo, Chen-Yu Wei, Julian Zimmert
- Abstract summary: We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
- Score: 50.00022394876222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider learning in an adversarial Markov Decision Process (MDP) where
the loss functions can change arbitrarily over $K$ episodes and the state space
can be arbitrarily large. We assume that the Q-function of any policy is linear
in some known features, that is, a linear function approximation exists. The
best existing regret upper bound for this setting (Luo et al., 2021) is of
order $\tilde{\mathcal O}(K^{2/3})$ (omitting all other dependencies), given
access to a simulator. This paper provides two algorithms that improve the
regret to $\tilde{\mathcal O}(\sqrt K)$ in the same setting. Our first
algorithm makes use of a refined analysis of the Follow-the-Regularized-Leader
(FTRL) algorithm with the log-barrier regularizer. This analysis allows the
loss estimators to be arbitrarily negative and might be of independent
interest. Our second algorithm develops a magnitude-reduced loss estimator,
further removing the polynomial dependency on the number of actions in the
first algorithm and leading to the optimal regret bound (up to logarithmic
terms and dependency on the horizon). Moreover, we also extend the first
algorithm to simulator-free linear MDPs, which achieves $\tilde{\mathcal
O}(K^{8/9})$ regret and greatly improves over the best existing bound
$\tilde{\mathcal O}(K^{14/15})$. This algorithm relies on a better alternative
to the Matrix Geometric Resampling procedure by Neu & Olkhovskaya (2020), which
could again be of independent interest.
Related papers
- Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation [1.8416014644193066]
We propose an algorithm for learning linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition.
Our algorithm for linear MDPs achieves the best-known regret upper bound of $widetildemathcalO(d3/2mathrmsp(v*)sqrtT)$ over $T$ time steps.
For linear mixture MDPs, our algorithm attains a regret bound of $widetildemathcalO(dcdotmathrm
arXiv Detail & Related papers (2024-09-16T23:13:42Z) - Achieving Tractable Minimax Optimal Regret in Average Reward MDPs [19.663336027878408]
We present the first tractable algorithm with minimax optimal regret of $widetildemathrmO(sqrtmathrmsp(h*) S A T)$.
Remarkably, our algorithm does not require prior information on $mathrmsp(h*)$.
arXiv Detail & Related papers (2024-06-03T11:53:44Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits.
Our algorithms deliver near-optimal regret bounds in both the adversarial and adversarial regimes.
arXiv Detail & Related papers (2023-12-24T08:27:30Z) - A Doubly Robust Approach to Sparse Reinforcement Learning [19.68978899041642]
We propose a new regret algorithm for episodic sparse linear Markov decision process (SMDP)
The proposed algorithm is $tildeO(sigma-1_min s_star H sqrtN)$, where $sigma_min$ denotes the restrictive minimum eigenvalue of the average Gram matrix of feature vectors.
arXiv Detail & Related papers (2023-10-23T18:52:17Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Nearly Minimax Optimal Regret for Learning Infinite-horizon
Average-reward MDPs with Linear Function Approximation [95.80683238546499]
We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation.
We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $tildeO(dsqrtDT)$, where $d$ is the dimension of the feature mapping.
We also prove a matching lower bound $tildeOmega(dsqrtDT)$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors
arXiv Detail & Related papers (2021-02-15T02:08:39Z) - Learning Infinite-horizon Average-reward MDPs with Linear Function
Approximation [44.374427255708135]
We develop new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation.
Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal $widetildeO(sqrtT)$ regret.
Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with $widetildeO(sqrtT)$ regret.
arXiv Detail & Related papers (2020-07-23T08:23:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.