Model Selection with Near Optimal Rates for Reinforcement Learning with
General Model Classes
- URL: http://arxiv.org/abs/2107.05849v1
- Date: Tue, 13 Jul 2021 05:00:38 GMT
- Title: Model Selection with Near Optimal Rates for Reinforcement Learning with
General Model Classes
- Authors: Avishek Ghosh, Sayak Ray Chowdhury and Kannan Ramchandran
- Abstract summary: We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem.
In the model selection framework, instead of $mathcalP*$, we are given $M$ nested families of transition kernels.
We show that textttARL-GEN obtains a regret of $TildemathcalO(d_mathcalE*H2+sqrtd_mathcalE* mathbbM* H2 T)$
- Score: 27.361399036211694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the problem of model selection for the finite horizon episodic
Reinforcement Learning (RL) problem where the transition kernel $P^*$ belongs
to a family of models $\mathcal{P}^*$ with finite metric entropy. In the model
selection framework, instead of $\mathcal{P}^*$, we are given $M$ nested
families of transition kernels $\cP_1 \subset \cP_2 \subset \ldots \subset
\cP_M$. We propose and analyze a novel algorithm, namely \emph{Adaptive
Reinforcement Learning (General)} (\texttt{ARL-GEN}) that adapts to the
smallest such family where the true transition kernel $P^*$ lies.
\texttt{ARL-GEN} uses the Upper Confidence Reinforcement Learning
(\texttt{UCRL}) algorithm with value targeted regression as a blackbox and puts
a model selection module at the beginning of each epoch. Under a mild
separability assumption on the model classes, we show that \texttt{ARL-GEN}
obtains a regret of
$\Tilde{\mathcal{O}}(d_{\mathcal{E}}^*H^2+\sqrt{d_{\mathcal{E}}^* \mathbb{M}^*
H^2 T})$, with high probability, where $H$ is the horizon length, $T$ is the
total number of steps, $d_{\mathcal{E}}^*$ is the Eluder dimension and
$\mathbb{M}^*$ is the metric entropy corresponding to $\mathcal{P}^*$. Note
that this regret scaling matches that of an oracle that knows $\mathcal{P}^*$
in advance. We show that the cost of model selection for \texttt{ARL-GEN} is an
additive term in the regret having a weak dependence on $T$. Subsequently, we
remove the separability assumption and consider the setup of linear mixture
MDPs, where the transition kernel $P^*$ has a linear function approximation.
With this low rank structure, we propose novel adaptive algorithms for model
selection, and obtain (order-wise) regret identical to that of an oracle with
knowledge of the true model class.
Related papers
- A Proximal Modified Quasi-Newton Method for Nonsmooth Regularized Optimization [0.7373617024876725]
Lipschitz-of-$nabla f$.
$mathcalS_k|p$.
$mathcalS_k|p$.
$nabla f$.
$mathcalS_k|p$.
arXiv Detail & Related papers (2024-09-28T18:16:32Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Regret-Optimal Federated Transfer Learning for Kernel Regression with Applications in American Option Pricing [8.723136784230906]
We propose an optimal iterative scheme for federated transfer learning, where a central planner has access to datasets.
Our objective is to minimize the cumulative deviation of the generated parameters $thetai(t)_t=0T$ across all $T$ iterations.
By leveraging symmetries within the regret-optimal algorithm, we develop a nearly regret $_optimal that runs with $mathcalO(Np2)$ fewer elementary operations.
arXiv Detail & Related papers (2023-09-08T19:17:03Z) - Sharper Model-free Reinforcement Learning for Average-reward Markov
Decision Processes [21.77276136591518]
We develop provably efficient model-free reinforcement learning (RL) algorithms for Markov Decision Processes (MDPs)
In the simulator setting, we propose a model-free RL algorithm that finds an $epsilon$-optimal policy using $widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$ samples.
arXiv Detail & Related papers (2023-06-28T17:43:19Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs
with a Generative Model [3.749193647980305]
This paper considers a Markov decision process (MDP) that admits a set of state-action features.
We show that a model-based approach (resp.$$Q-learning) provably learns an $varepsilon$-optimal policy with high probability.
arXiv Detail & Related papers (2021-05-28T17:49:39Z) - Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP)
For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension.
For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z) - Model-Based Reinforcement Learning with Value-Targeted Regression [48.92439657407732]
We focus on finite-horizon episodic RL where the transition model $P$ belongs to a known family of models $mathcalP$.
We derive a bound on the regret, which, in the special case of linear mixtures, the regret bound takes the form $tildemathcalO(dsqrtH3T)$.
arXiv Detail & Related papers (2020-06-01T17:47:53Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.