Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes
- URL: http://arxiv.org/abs/2212.06132v3
- Date: Sat, 4 Nov 2023 01:56:36 GMT
- Title: Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes
- Authors: Jiafan He and Heyang Zhao and Dongruo Zhou and Quanquan Gu
- Abstract summary: We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
- Score: 80.89852729380425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study reinforcement learning (RL) with linear function approximation. For
episodic time-inhomogeneous linear Markov decision processes (linear MDPs)
whose transition probability can be parameterized as a linear function of a
given feature mapping, we propose the first computationally efficient algorithm
that achieves the nearly minimax optimal regret $\tilde O(d\sqrt{H^3K})$, where
$d$ is the dimension of the feature mapping, $H$ is the planning horizon, and
$K$ is the number of episodes. Our algorithm is based on a weighted linear
regression scheme with a carefully designed weight, which depends on a new
variance estimator that (1) directly estimates the variance of the optimal
value function, (2) monotonically decreases with respect to the number of
episodes to ensure a better estimation accuracy, and (3) uses a rare-switching
policy to update the value function estimator to control the complexity of the
estimated value function class. Our work provides a complete answer to optimal
RL with linear MDPs, and the developed algorithm and theoretical tools may be
of independent interest.
Related papers
- A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning
with General Function Approximation [66.26739783789387]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.
MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.
Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z) - Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning [53.97335841137496]
We propose an oracle-efficient algorithm, dubbed Pessimistic Least-Square Value Iteration (PNLSVI) for offline RL with non-linear function approximation.
Our algorithm enjoys a regret bound that has a tight dependency on the function class complexity and achieves minimax optimal instance-dependent regret when specialized to linear function approximation.
arXiv Detail & Related papers (2023-10-02T17:42:01Z) - VO$Q$L: Towards Optimal Regret in Model-free RL with Nonlinear Function
Approximation [43.193807443491814]
We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximation and sparse rewards.
We design a new algorithm, Variance-weighted Optimistic $Q$-Learning (VO$Q$L), based on $Q$-learning and bound its regret dimension to completeness and bounded Eluder for the regression function class.
arXiv Detail & Related papers (2022-12-12T17:37:00Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Nearly Minimax Optimal Regret for Learning Infinite-horizon
Average-reward MDPs with Linear Function Approximation [95.80683238546499]
We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation.
We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $tildeO(dsqrtDT)$, where $d$ is the dimension of the feature mapping.
We also prove a matching lower bound $tildeOmega(dsqrtDT)$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors
arXiv Detail & Related papers (2021-02-15T02:08:39Z) - On Query-efficient Planning in MDPs under Linear Realizability of the
Optimal State-value Function [14.205660708980988]
We consider the problem of local planning in fixed-horizon Markov Decision Processes (MDPs) with a generative model.
A recent lower bound established that the related problem when the action-value function of the optimal policy is linearly realizable requires an exponential number of queries.
In this work, we establish that poly$(H, d)$ learning is possible (with state value function realizability) whenever the action set is small.
arXiv Detail & Related papers (2021-02-03T13:23:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.