A Few Expert Queries Suffices for Sample-Efficient RL with Resets and
Linear Value Approximation
- URL: http://arxiv.org/abs/2207.08342v1
- Date: Mon, 18 Jul 2022 01:39:13 GMT
- Title: A Few Expert Queries Suffices for Sample-Efficient RL with Resets and
Linear Value Approximation
- Authors: Philip Amortila, Nan Jiang, Dhruv Madeka, Dean P. Foster
- Abstract summary: We study sample-efficient Reinforcement Learning (RL) in settings where only the optimal value function is assumed to be linearlyrealizable.
We present a statistically and computationally efficient algorithm (Delphi) for blending exploration with expert queries.
Delphi requires $tildemathcalO(d)$ expert queries and a $textttpoly(d,|mathcalA|,1/varepsilon)$ amount of exploratory samples to provably recover an $varepsilon$suboptimal policy.
- Score: 16.29514743112387
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The current paper studies sample-efficient Reinforcement Learning (RL) in
settings where only the optimal value function is assumed to be
linearly-realizable. It has recently been understood that, even under this
seemingly strong assumption and access to a generative model, worst-case sample
complexities can be prohibitively (i.e., exponentially) large. We investigate
the setting where the learner additionally has access to interactive
demonstrations from an expert policy, and we present a statistically and
computationally efficient algorithm (Delphi) for blending exploration with
expert queries. In particular, Delphi requires $\tilde{\mathcal{O}}(d)$ expert
queries and a $\texttt{poly}(d,H,|\mathcal{A}|,1/\varepsilon)$ amount of
exploratory samples to provably recover an $\varepsilon$-suboptimal policy.
Compared to pure RL approaches, this corresponds to an exponential improvement
in sample complexity with surprisingly-little expert input. Compared to prior
imitation learning (IL) approaches, our required number of expert
demonstrations is independent of $H$ and logarithmic in $1/\varepsilon$,
whereas all prior work required at least linear factors of both in addition to
the same dependence on $d$. Towards establishing the minimal amount of expert
queries needed, we show that, in the same setting, any learner whose
exploration budget is polynomially-bounded (in terms of $d,H,$ and
$|\mathcal{A}|$) will require at least $\tilde\Omega(\sqrt{d})$ oracle calls to
recover a policy competing with the expert's value function. Under the weaker
assumption that the expert's policy is linear, we show that the lower bound
increases to $\tilde\Omega(d)$.
Related papers
- Computational-Statistical Gaps for Improper Learning in Sparse Linear Regression [4.396860522241307]
We show that an efficient learning algorithm for sparse linear regression can be used to solve sparse PCA problems with a negative spike.
We complement our reduction with low-degree and statistical query lower bounds for the sparse problems from which we reduce.
arXiv Detail & Related papers (2024-02-21T19:55:01Z) - Learning Thresholds with Latent Values and Censored Feedback [18.129896050051432]
We show a problem where the unknown reward $g(gamma, v)$ depends on the proposed threshold $gamma$ and latent value $v$ and it can be $only$ achieved if the threshold is lower than or equal to the unknown latent value.
This problem has broad applications in practical scenarios, e.g., reserve price optimization in online auctions, online task assignments in crowdsourcing, setting recruiting bars in hiring.
arXiv Detail & Related papers (2023-12-07T19:30:08Z) - Demonstration-Regularized RL [39.96273388393764]
Using expert demonstrations, we identify an optimal policy at a sample complexity of order $widetildeO(mathrmPoly(S,A,H)/(varepsilon2 NmathrmE)$ in finite and $widetildeO(mathrmPoly(d,H)/(varepsilon2 NmathrmE)$ in linear Markov decision processes.
We establish that demonstration-regularized methods are provably efficient for reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-10-26T10:54:47Z) - Fast Rates for Maximum Entropy Exploration [52.946307632704645]
We address the challenge of exploration in reinforcement learning (RL) when the agent operates in an unknown environment with sparse or no rewards.
We study the maximum entropy exploration problem two different types.
For visitation entropy, we propose a game-theoretic algorithm that has $widetildemathcalO(H3S2A/varepsilon2)$ sample complexity.
For the trajectory entropy, we propose a simple algorithm that has a sample of complexity of order $widetildemathcalO(mathrmpoly(S,
arXiv Detail & Related papers (2023-03-14T16:51:14Z) - Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement
Learning with Latent Low-Rank Structure [9.759209713196718]
We consider a class of MDPs for which the associated optimal $Q*$ function is low rank, where the latent features are unknown.
We show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of $tildeOleft((|S|+|A|)mathrmpoly(d,H)/epsilon2right)$ for a rank
arXiv Detail & Related papers (2022-06-07T20:39:51Z) - Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations [79.66404989555566]
We consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $Pi$ that may not contain any near-optimal policy.
We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP.
arXiv Detail & Related papers (2021-06-22T03:20:40Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - On Function Approximation in Reinforcement Learning: Optimism in the
Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning.
In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function.
Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z) - Linear-Sample Learning of Low-Rank Distributions [56.59844655107251]
We show that learning $ktimes k$, rank-$r$, matrices to normalized $L_1$ distance requires $Omega(frackrepsilon2)$ samples.
We propose an algorithm that uses $cal O(frackrepsilon2log2fracepsilon)$ samples, a number linear in the high dimension, and nearly linear in the matrices, typically low, rank proofs.
arXiv Detail & Related papers (2020-09-30T19:10:32Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation [30.137884459159107]
We consider the question of learning $Q$-function in a sample efficient manner for reinforcement learning with continuous state and action spaces.
We develop a simple, iterative learning algorithm that finds $epsilon$-Schmidt $Q$-function with sample complexity of $widetildeO(frac1epsilonmax(d1), d_2)+2)$ when the optimal $Q$-function has low rank $r$ and the factor $$ is below a certain threshold.
arXiv Detail & Related papers (2020-06-11T00:55:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.