Learning Near Optimal Policies with Low Inherent Bellman Error
- URL: http://arxiv.org/abs/2003.00153v3
- Date: Sun, 28 Jun 2020 23:46:53 GMT
- Title: Learning Near Optimal Policies with Low Inherent Bellman Error
- Authors: Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill
- Abstract summary: We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning.
We show that exploration is possible using only emphbatch assumptions with an algorithm that achieves the optimal statistical rate for the setting we consider.
- Score: 115.16037976819331
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the exploration problem with approximate linear action-value
functions in episodic reinforcement learning under the notion of low inherent
Bellman error, a condition normally employed to show convergence of approximate
value iteration. First we relate this condition to other common frameworks and
show that it is strictly more general than the low rank (or linear) MDP
assumption of prior work. Second we provide an algorithm with a high
probability regret bound $\widetilde O(\sum_{t=1}^H d_t \sqrt{K} + \sum_{t=1}^H
\sqrt{d_t} \IBE K)$ where $H$ is the horizon, $K$ is the number of episodes,
$\IBE$ is the value if the inherent Bellman error and $d_t$ is the feature
dimension at timestep $t$. In addition, we show that the result is unimprovable
beyond constants and logs by showing a matching lower bound. This has two
important consequences: 1) it shows that exploration is possible using only
\emph{batch assumptions} with an algorithm that achieves the optimal
statistical rate for the setting we consider, which is more general than prior
work on low-rank MDPs 2) the lack of closedness (measured by the inherent
Bellman error) is only amplified by $\sqrt{d_t}$ despite working in the online
setting. Finally, the algorithm reduces to the celebrated \textsc{LinUCB} when
$H=1$ but with a different choice of the exploration parameter that allows
handling misspecified contextual linear bandits. While computational
tractability questions remain open for the MDP setting, this enriches the class
of MDPs with a linear representation for the action-value function where
statistically efficient reinforcement learning is possible.
Related papers
- The Role of Inherent Bellman Error in Offline Reinforcement Learning with Linear Function Approximation [29.69428894587431]
In this paper, we study the offline RL problem with linear function approximation.
Our main structural assumption is that the MDP has low inherent Bellman error.
We show that the scaling of the suboptimality with $sqrtvarepsilon_mathrmBE$ cannot be improved for any algorithm.
arXiv Detail & Related papers (2024-06-17T16:04:06Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - A Doubly Robust Approach to Sparse Reinforcement Learning [19.68978899041642]
We propose a new regret algorithm for episodic sparse linear Markov decision process (SMDP)
The proposed algorithm is $tildeO(sigma-1_min s_star H sqrtN)$, where $sigma_min$ denotes the restrictive minimum eigenvalue of the average Gram matrix of feature vectors.
arXiv Detail & Related papers (2023-10-23T18:52:17Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - Nearly Minimax Optimal Reinforcement Learning with Linear Function
Approximation [25.60689712525918]
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear.
We propose a novel-efficient algorithm, LSVI-UCB$+$, which achieves an $widetildeO(HdsqrtT)$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps.
arXiv Detail & Related papers (2022-06-23T06:04:21Z) - Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations [79.66404989555566]
We consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $Pi$ that may not contain any near-optimal policy.
We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP.
arXiv Detail & Related papers (2021-06-22T03:20:40Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.