A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning
with General Function Approximation
- URL: http://arxiv.org/abs/2311.15238v1
- Date: Sun, 26 Nov 2023 08:31:57 GMT
- Title: A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning
with General Function Approximation
- Authors: Heyang Zhao and Jiafan He and Quanquan Gu
- Abstract summary: We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.
MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.
Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
- Score: 66.26739783789387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The exploration-exploitation dilemma has been a central challenge in
reinforcement learning (RL) with complex model classes. In this paper, we
propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound
(MQL-UCB) for RL with general function approximation. Our key algorithmic
design includes (1) a general deterministic policy-switching strategy that
achieves low switching cost, (2) a monotonic value function structure with
carefully controlled function class complexity, and (3) a variance-weighted
regression scheme that exploits historical trajectories with high data
efficiency. MQL-UCB achieves minimax optimal regret of $\tilde{O}(d\sqrt{HK})$
when $K$ is sufficiently large and near-optimal policy switching cost of
$\tilde{O}(dH)$, with $d$ being the eluder dimension of the function class, $H$
being the planning horizon, and $K$ being the number of episodes.
Our work sheds light on designing provably sample-efficient and
deployment-efficient Q-learning with nonlinear function approximation.
Related papers
- Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning
with Linear Function Approximation [16.871660060209674]
We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the emphreward-free exploration setting.
We propose a new algorithm that collects at most $widetildeO(fracd2H5epsilon2)$ trajectories within $H$ deployments to identify $epsilon$-optimal policy for any (possibly data-dependent) choice of reward functions.
arXiv Detail & Related papers (2022-10-03T03:48:26Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - On Function Approximation in Reinforcement Learning: Optimism in the
Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning.
In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function.
Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z) - A Model-free Learning Algorithm for Infinite-horizon Average-reward MDPs
with Near-optimal Regret [44.374427255708135]
We propose Exploration Enhanced Q-learning (EE-QL), a model-free algorithm for infinite-horizon average-reward Markov Decision Processes (MDPs)
EE-QL assumes that an online concentrating approximation of the optimal average reward is available.
This is the first model-free learning algorithm that achieves $O(sqrt T)$ regret without the ergodic assumption, and matches the lower bound in terms of $T$ except for logarithmic factors.
arXiv Detail & Related papers (2020-06-08T05:09:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.