Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes
- URL: http://arxiv.org/abs/2012.08507v2
- Date: Thu, 7 Jan 2021 18:57:11 GMT
- Title: Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes
- Authors: Dongruo Zhou and Quanquan Gu and Csaba Szepesvari
- Abstract summary: We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
- Score: 91.38793800392108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study reinforcement learning (RL) with linear function approximation where
the underlying transition probability kernel of the Markov decision process
(MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et
al., 2020) and the learning agent has access to either an integration or a
sampling oracle of the individual basis kernels. We propose a new
Bernstein-type concentration inequality for self-normalized martingales for
linear bandit problems with bounded noise. Based on the new inequality, we
propose a new, computationally efficient algorithm with linear function
approximation named $\text{UCRL-VTR}^{+}$ for the aforementioned linear mixture
MDPs in the episodic undiscounted setting. We show that $\text{UCRL-VTR}^{+}$
attains an $\tilde O(dH\sqrt{T})$ regret where $d$ is the dimension of feature
mapping, $H$ is the length of the episode and $T$ is the number of interactions
with the MDP. We also prove a matching lower bound $\Omega(dH\sqrt{T})$ for
this setting, which shows that $\text{UCRL-VTR}^{+}$ is minimax optimal up to
logarithmic factors. In addition, we propose the $\text{UCLK}^{+}$ algorithm
for the same family of MDPs under discounting and show that it attains an
$\tilde O(d\sqrt{T}/(1-\gamma)^{1.5})$ regret, where $\gamma\in [0,1)$ is the
discount factor. Our upper bound matches the lower bound
$\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$ proved by Zhou et al. (2020) up to
logarithmic factors, suggesting that $\text{UCLK}^{+}$ is nearly minimax
optimal. To the best of our knowledge, these are the first computationally
efficient, nearly minimax optimal algorithms for RL with linear function
approximation.
Related papers
- Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation [3.2703356989962518]
We study model-based reinforcement learning with non-linear function approximation.
We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings.
arXiv Detail & Related papers (2024-06-19T15:29:14Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes [21.77276136591518]
We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
arXiv Detail & Related papers (2023-06-28T17:43:19Z) - Nearly Minimax Optimal Reinforcement Learning with Linear Function
Approximation [25.60689712525918]
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear.
We propose a novel-efficient algorithm, LSVI-UCB$+$, which achieves an $widetildeO(HdsqrtT)$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps.
arXiv Detail & Related papers (2022-06-23T06:04:21Z) - Locally Differentially Private Reinforcement Learning for Linear Mixture
Markov Decision Processes [78.27542864367821]
Reinforcement learning (RL) algorithms can be used to provide personalized services, which rely on users' private and sensitive data.
To protect the users' privacy, privacy-preserving RL algorithms are in demand.
We propose a novel $(varepsilon, delta)$-LDP algorithm for learning a class of Markov decision processes (MDPs) dubbed linear mixture MDPs.
arXiv Detail & Related papers (2021-10-19T17:44:09Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Nearly Minimax Optimal Regret for Learning Infinite-horizon
Average-reward MDPs with Linear Function Approximation [95.80683238546499]
We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation.
We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $tildeO(dsqrtDT)$, where $d$ is the dimension of the feature mapping.
We also prove a matching lower bound $tildeOmega(dsqrtDT)$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors
arXiv Detail & Related papers (2021-02-15T02:08:39Z) - Near-optimal Representation Learning for Linear Bandits and Linear RL [41.33483293243257]
We first consider the setting where we play $M$ linear bandits with dimension $d$ concurrently.
These bandits share a common $k$-dimensional linear representation so that $kll d$ and $k ll M$.
We propose a sample-efficient algorithm, MTLR-OFUL, which leverages the shared representation to achieve $tildeO(MsqrtdkT + dsqrtkMT )$ regret.
arXiv Detail & Related papers (2021-02-08T11:11:53Z) - Logarithmic Regret for Reinforcement Learning with Linear Function
Approximation [99.59319332864129]
We show that logarithmic regret is attainable under two recently proposed linear MDP assumptions.
To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation.
arXiv Detail & Related papers (2020-11-23T17:25:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.