Nearly Minimax Optimal Reinforcement Learning with Linear Function
Approximation
- URL: http://arxiv.org/abs/2206.11489v1
- Date: Thu, 23 Jun 2022 06:04:21 GMT
- Title: Nearly Minimax Optimal Reinforcement Learning with Linear Function
Approximation
- Authors: Pihe Hu, Yu Chen, Longbo Huang
- Abstract summary: We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear.
We propose a novel-efficient algorithm, LSVI-UCB$+$, which achieves an $widetildeO(HdsqrtT)$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps.
- Score: 25.60689712525918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study reinforcement learning with linear function approximation where the
transition probability and reward functions are linear with respect to a
feature mapping $\boldsymbol{\phi}(s,a)$. Specifically, we consider the
episodic inhomogeneous linear Markov Decision Process (MDP), and propose a
novel computation-efficient algorithm, LSVI-UCB$^+$, which achieves an
$\widetilde{O}(Hd\sqrt{T})$ regret bound where $H$ is the episode length, $d$
is the feature dimension, and $T$ is the number of steps. LSVI-UCB$^+$ builds
on weighted ridge regression and upper confidence value iteration with a
Bernstein-type exploration bonus. Our statistical results are obtained with
novel analytical tools, including a new Bernstein self-normalized bound with
conservatism on elliptical potentials, and refined analysis of the correction
term. To the best of our knowledge, this is the first minimax optimal algorithm
for linear MDPs up to logarithmic factors, which closes the $\sqrt{Hd}$ gap
between the best known upper bound of $\widetilde{O}(\sqrt{H^3d^3T})$ in
\cite{jin2020provably} and lower bound of $\Omega(Hd\sqrt{T})$ for linear MDPs.
Related papers
- Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation [1.8416014644193066]
We propose an algorithm for learning linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition.
Our algorithm for linear MDPs achieves the best-known regret upper bound of $widetildemathcalO(d3/2mathrmsp(v*)sqrtT)$ over $T$ time steps.
For linear mixture MDPs, our algorithm attains a regret bound of $widetildemathcalO(dcdotmathrm
arXiv Detail & Related papers (2024-09-16T23:13:42Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z) - Logarithmic Regret for Reinforcement Learning with Linear Function
Approximation [99.59319332864129]
We show that logarithmic regret is attainable under two recently proposed linear MDP assumptions.
To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation.
arXiv Detail & Related papers (2020-11-23T17:25:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.