Learning Infinite-Horizon Average-Reward Linear Mixture MDPs of Bounded Span
- URL: http://arxiv.org/abs/2410.14992v1
- Date: Sat, 19 Oct 2024 05:45:50 GMT
- Title: Learning Infinite-Horizon Average-Reward Linear Mixture MDPs of Bounded Span
- Authors: Woojin Chae, Kihyuk Hong, Yufan Zhang, Ambuj Tewari, Dabeen Lee,
- Abstract summary: This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear mixture Markov decision processes (MDPs)
Our algorithm for linear mixture MDPs achieves a nearly minimax optimal regret upper bound of $widetildemathcalO(dsqrtmathrmsp(v*)T)$ over $T$ time steps.
- Score: 16.49229317664822
- License:
- Abstract: This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear mixture Markov decision processes (MDPs) under the Bellman optimality condition. Our algorithm for linear mixture MDPs achieves a nearly minimax optimal regret upper bound of $\widetilde{\mathcal{O}}(d\sqrt{\mathrm{sp}(v^*)T})$ over $T$ time steps where $\mathrm{sp}(v^*)$ is the span of the optimal bias function $v^*$ and $d$ is the dimension of the feature mapping. Our algorithm applies the recently developed technique of running value iteration on a discounted-reward MDP approximation with clipping by the span. We prove that the value iteration procedure, even with the clipping operation, converges. Moreover, we show that the associated variance term due to random transitions can be bounded even under clipping. Combined with the weighted ridge regression-based parameter estimation scheme, this leads to the nearly minimax optimal regret guarantee.
Related papers
- Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation [1.8416014644193066]
We propose an algorithm for learning linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition.
Our algorithm for linear MDPs achieves the best-known regret upper bound of $widetildemathcalO(d3/2mathrmsp(v*)sqrtT)$ over $T$ time steps.
For linear mixture MDPs, our algorithm attains a regret bound of $widetildemathcalO(dcdotmathrm
arXiv Detail & Related papers (2024-09-16T23:13:42Z) - Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs [16.49229317664822]
We study the infinite-horizon average-reward reinforcement learning with linear MDPs.
In this paper, we propose that the regret bound of $widetildeO(sqrtT)$ achieves an algorithm that is computationally efficient.
arXiv Detail & Related papers (2024-05-23T20:58:33Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Learning Infinite-horizon Average-reward MDPs with Linear Function
Approximation [44.374427255708135]
We develop new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation.
Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal $widetildeO(sqrtT)$ regret.
Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with $widetildeO(sqrtT)$ regret.
arXiv Detail & Related papers (2020-07-23T08:23:44Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.