Learning Infinite-horizon Average-reward MDPs with Linear Function
Approximation
- URL: http://arxiv.org/abs/2007.11849v2
- Date: Mon, 26 Apr 2021 09:12:03 GMT
- Title: Learning Infinite-horizon Average-reward MDPs with Linear Function
Approximation
- Authors: Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Rahul Jain
- Abstract summary: We develop new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation.
Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal $widetildeO(sqrtT)$ regret.
Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with $widetildeO(sqrtT)$ regret.
- Score: 44.374427255708135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop several new algorithms for learning Markov Decision Processes in
an infinite-horizon average-reward setting with linear function approximation.
Using the optimism principle and assuming that the MDP has a linear structure,
we first propose a computationally inefficient algorithm with optimal
$\widetilde{O}(\sqrt{T})$ regret and another computationally efficient variant
with $\widetilde{O}(T^{3/4})$ regret, where $T$ is the number of interactions.
Next, taking inspiration from adversarial linear bandits, we develop yet
another efficient algorithm with $\widetilde{O}(\sqrt{T})$ regret under a
different set of assumptions, improving the best existing result by Hao et al.
(2020) with $\widetilde{O}(T^{2/3})$ regret. Moreover, we draw a connection
between this algorithm and the Natural Policy Gradient algorithm proposed by
Kakade (2002), and show that our analysis improves the sample complexity bound
recently given by Agarwal et al. (2020).
Related papers
- Learning Infinite-Horizon Average-Reward Linear Mixture MDPs of Bounded Span [16.49229317664822]
This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear mixture Markov decision processes (MDPs)
Our algorithm for linear mixture MDPs achieves a nearly minimax optimal regret upper bound of $widetildemathcalO(dsqrtmathrmsp(v*)T)$ over $T$ time steps.
arXiv Detail & Related papers (2024-10-19T05:45:50Z) - Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation [1.8416014644193066]
We propose an algorithm for learning linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition.
Our algorithm for linear MDPs achieves the best-known regret upper bound of $widetildemathcalO(d3/2mathrmsp(v*)sqrtT)$ over $T$ time steps.
For linear mixture MDPs, our algorithm attains a regret bound of $widetildemathcalO(dcdotmathrm
arXiv Detail & Related papers (2024-09-16T23:13:42Z) - Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs [16.49229317664822]
We study the infinite-horizon average-reward reinforcement learning with linear MDPs.
In this paper, we propose that the regret bound of $widetildeO(sqrtT)$ achieves an algorithm that is computationally efficient.
arXiv Detail & Related papers (2024-05-23T20:58:33Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Nearly Minimax Optimal Regret for Learning Infinite-horizon
Average-reward MDPs with Linear Function Approximation [95.80683238546499]
We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation.
We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $tildeO(dsqrtDT)$, where $d$ is the dimension of the feature mapping.
We also prove a matching lower bound $tildeOmega(dsqrtDT)$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors
arXiv Detail & Related papers (2021-02-15T02:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.