Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes
- URL: http://arxiv.org/abs/2210.11604v3
- Date: Sun, 21 May 2023 21:06:44 GMT
- Title: Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes
- Authors: Runlong Zhou, Ruosong Wang, Simon S. Du
- Abstract summary: We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
- Score: 62.90204655228324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study regret minimization for reinforcement learning (RL) in Latent Markov
Decision Processes (LMDPs) with context in hindsight. We design a novel
model-based algorithmic framework which can be instantiated with both a
model-optimistic and a value-optimistic solver. We prove an
$\tilde{O}(\sqrt{\mathsf{Var}^\star M \Gamma S A K})$ regret bound where
$\tilde{O}$ hides logarithm factors, $M$ is the number of contexts, $S$ is the
number of states, $A$ is the number of actions, $K$ is the number of episodes,
$\Gamma \le S$ is the maximum transition degree of any state-action pair, and
$\mathsf{Var}^\star$ is a variance quantity describing the determinism of the
LMDP. The regret bound only scales logarithmically with the planning horizon,
thus yielding the first (nearly) horizon-free regret bound for LMDP. This is
also the first problem-dependent regret bound for LMDP. Key in our proof is an
analysis of the total variance of alpha vectors (a generalization of value
functions), which is handled with a truncation method. We complement our
positive result with a novel $\Omega(\sqrt{\mathsf{Var}^\star M S A K})$ regret
lower bound with $\Gamma = 2$, which shows our upper bound minimax optimal when
$\Gamma$ is a constant for the class of variance-bounded LMDPs. Our lower bound
relies on new constructions of hard instances and an argument inspired by the
symmetrization technique from theoretical computer science, both of which are
technically different from existing lower bound proof for MDPs, and thus can be
of independent interest.
Related papers
- Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation [3.2703356989962518]
We study model-based reinforcement learning with non-linear function approximation.
We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings.
arXiv Detail & Related papers (2024-06-19T15:29:14Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - A Doubly Robust Approach to Sparse Reinforcement Learning [19.68978899041642]
We propose a new regret algorithm for episodic sparse linear Markov decision process (SMDP)
The proposed algorithm is $tildeO(sigma-1_min s_star H sqrtN)$, where $sigma_min$ denotes the restrictive minimum eigenvalue of the average Gram matrix of feature vectors.
arXiv Detail & Related papers (2023-10-23T18:52:17Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Nearly Minimax Optimal Reinforcement Learning with Linear Function
Approximation [25.60689712525918]
We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear.
We propose a novel-efficient algorithm, LSVI-UCB$+$, which achieves an $widetildeO(HdsqrtT)$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps.
arXiv Detail & Related papers (2022-06-23T06:04:21Z) - Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP)
For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension.
For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov
Decision Processes [91.38793800392108]
We study reinforcement learning with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model.
We propose a new, computationally efficient algorithm with linear function approximation named $textUCRL-VTR+$ for the aforementioned linear mixture MDPs.
To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
arXiv Detail & Related papers (2020-12-15T18:56:46Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.