Minimax Optimal Q Learning with Nearest Neighbors
- URL: http://arxiv.org/abs/2308.01490v2
- Date: Mon, 17 Jun 2024 05:41:41 GMT
- Title: Minimax Optimal Q Learning with Nearest Neighbors
- Authors: Puning Zhao, Lifeng Lai,
- Abstract summary: We propose two new nearest neighbor $Q$ learning methods, one for the offline setting and the other for the online setting.
We show that the sample complexities of these two methods are $tildeO(frac1epsilond+2 (1-gamma)d+2)$ and $tildeO(frac1epsilond+2 (1-gamma)d+3)$ for offline and online methods respectively.
- Score: 32.19733226959755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Analyzing the Markov decision process (MDP) with continuous state spaces is generally challenging. A recent interesting work \cite{shah2018q} solves MDP with bounded continuous state space by a nearest neighbor $Q$ learning approach, which has a sample complexity of $\tilde{O}(\frac{1}{\epsilon^{d+3}(1-\gamma)^{d+7}})$ for $\epsilon$-accurate $Q$ function estimation with discount factor $\gamma$. In this paper, we propose two new nearest neighbor $Q$ learning methods, one for the offline setting and the other for the online setting. We show that the sample complexities of these two methods are $\tilde{O}(\frac{1}{\epsilon^{d+2}(1-\gamma)^{d+2}})$ and $\tilde{O}(\frac{1}{\epsilon^{d+2}(1-\gamma)^{d+3}})$ for offline and online methods respectively, which significantly improve over existing results and have minimax optimal dependence over $\epsilon$. We achieve such improvement by utilizing the samples more efficiently. In particular, the method in \cite{shah2018q} clears up all samples after each iteration, thus these samples are somewhat wasted. On the other hand, our offline method does not remove any samples, and our online method only removes samples with time earlier than $\beta t$ at time $t$ with $\beta$ being a tunable parameter, thus our methods significantly reduce the loss of information. Apart from the sample complexity, our methods also have additional advantages of better computational complexity, as well as suitability to unbounded state spaces.
Related papers
- Misspecified $Q$-Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error [25.777423855881878]
We show a novel elimination-based algorithm to show one can obtain an $Oleft(Hepsilonright)$-optimal policy.
We complement our upper bound with an $widetildeOmegaleft(Hepsilonright)$-optimality lower bound, giving a complete picture of this problem.
arXiv Detail & Related papers (2024-07-18T15:58:04Z) - Truncated Variance Reduced Value Iteration [23.282578280033622]
We provide faster randomized algorithms for computing an $epsilon$-optimal policy in a discounted Markov decision process.
Our results take a step to closing the sample-complexity gap between model-free and model-based methods.
arXiv Detail & Related papers (2024-05-21T17:28:06Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes [21.77276136591518]
We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
arXiv Detail & Related papers (2023-06-28T17:43:19Z) - Near Sample-Optimal Reduction-based Policy Learning for Average Reward
MDP [58.13930707612128]
This work considers the sample complexity of obtaining an $varepsilon$-optimal policy in an average reward Markov Decision Process (AMDP)
We prove an upper bound of $widetilde O(H varepsilon-3 ln frac1delta)$ samples per state-action pair, where $H := sp(h*)$ is the span of bias of any optimal policy, $varepsilon$ is the accuracy and $delta$ is the failure probability.
arXiv Detail & Related papers (2022-12-01T15:57:58Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$.
We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.