Adversarial Online Multi-Task Reinforcement Learning
- URL: http://arxiv.org/abs/2301.04268v1
- Date: Wed, 11 Jan 2023 02:18:26 GMT
- Title: Adversarial Online Multi-Task Reinforcement Learning
- Authors: Quan Nguyen and Nishant A. Mehta
- Abstract summary: We consider the adversarial online multi-task reinforcement learning setting.
In each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models.
The learner's objective is to generalize its regret with respect to the optimal policy for each task.
- Score: 12.421997449847153
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the adversarial online multi-task reinforcement learning setting,
where in each of $K$ episodes the learner is given an unknown task taken from a
finite set of $M$ unknown finite-horizon MDP models. The learner's objective is
to minimize its regret with respect to the optimal policy for each task. We
assume the MDPs in $\mathcal{M}$ are well-separated under a notion of
$\lambda$-separability, and show that this notion generalizes many
task-separability notions from previous works. We prove a minimax lower bound
of $\Omega(K\sqrt{DSAH})$ on the regret of any learning algorithm and an
instance-specific lower bound of $\Omega(\frac{K}{\lambda^2})$ in sample
complexity for a class of uniformly-good cluster-then-learn algorithms. We use
a novel construction called 2-JAO MDP for proving the instance-specific lower
bound. The lower bounds are complemented with a polynomial time algorithm that
obtains $\tilde{O}(\frac{K}{\lambda^2})$ sample complexity guarantee for the
clustering phase and $\tilde{O}(\sqrt{MK})$ regret guarantee for the learning
phase, indicating that the dependency on $K$ and $\frac{1}{\lambda^2}$ is
tight.
Related papers
- Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation [3.2703356989962518]
We study model-based reinforcement learning with non-linear function approximation.
We develop a provably efficient discounted value iteration-based algorithm that works for both infinite-horizon average-reward and discounted-reward settings.
arXiv Detail & Related papers (2024-06-19T15:29:14Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement
Learning with Latent Low-Rank Structure [9.759209713196718]
We consider a class of MDPs for which the associated optimal $Q*$ function is low rank, where the latent features are unknown.
We show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of $tildeOleft((|S|+|A|)mathrmpoly(d,H)/epsilon2right)$ for a rank
arXiv Detail & Related papers (2022-06-07T20:39:51Z) - On the Power of Multitask Representation Learning in Linear MDP [61.58929164172968]
This paper presents analyses for the statistical benefit of multitask representation learning in linear Markov Decision Process (MDP)
We first discover a emphLeast-Activated-Feature-Abundance (LAFA) criterion, denoted as $kappa$, with which we prove that a straightforward least-square algorithm learns a policy which is $tildeO(H2sqrtfrackappa mathcalC(Phi)2 kappa dNT+frackappa dn)
arXiv Detail & Related papers (2021-06-15T11:21:06Z) - Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation [30.137884459159107]
We consider the question of learning $Q$-function in a sample efficient manner for reinforcement learning with continuous state and action spaces.
We develop a simple, iterative learning algorithm that finds $epsilon$-Schmidt $Q$-function with sample complexity of $widetildeO(frac1epsilonmax(d1), d_2)+2)$ when the optimal $Q$-function has low rank $r$ and the factor $$ is below a certain threshold.
arXiv Detail & Related papers (2020-06-11T00:55:35Z) - Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample
Complexity [59.34067736545355]
Given an MDP with $S$ states, $A$ actions, the discount factor $gamma in (0,1)$, and an approximation threshold $epsilon > 0$, we provide a model-free algorithm to learn an $epsilon$-optimal policy.
For small enough $epsilon$, we show an improved algorithm with sample complexity.
arXiv Detail & Related papers (2020-06-06T13:34:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.