Exploring and Learning in Sparse Linear MDPs without Computationally
Intractable Oracles
- URL: http://arxiv.org/abs/2309.09457v2
- Date: Tue, 19 Sep 2023 01:56:24 GMT
- Title: Exploring and Learning in Sparse Linear MDPs without Computationally
Intractable Oracles
- Authors: Noah Golowich and Ankur Moitra and Dhruv Rohatgi
- Abstract summary: In this paper we revisit linear MDPs from the perspective of feature selection.
Our main result is the first-time algorithm for this problem.
We show that they do exist and can be computed efficiently via convex programming.
- Score: 39.10180309328293
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The key assumption underlying linear Markov Decision Processes (MDPs) is that
the learner has access to a known feature map $\phi(x, a)$ that maps
state-action pairs to $d$-dimensional vectors, and that the rewards and
transitions are linear functions in this representation. But where do these
features come from? In the absence of expert domain knowledge, a tempting
strategy is to use the ``kitchen sink" approach and hope that the true features
are included in a much larger set of potential features. In this paper we
revisit linear MDPs from the perspective of feature selection. In a $k$-sparse
linear MDP, there is an unknown subset $S \subset [d]$ of size $k$ containing
all the relevant features, and the goal is to learn a near-optimal policy in
only poly$(k,\log d)$ interactions with the environment. Our main result is the
first polynomial-time algorithm for this problem. In contrast, earlier works
either made prohibitively strong assumptions that obviated the need for
exploration, or required solving computationally intractable optimization
problems.
Along the way we introduce the notion of an emulator: a succinct approximate
representation of the transitions that suffices for computing certain Bellman
backups. Since linear MDPs are a non-parametric model, it is not even obvious
whether polynomial-sized emulators exist. We show that they do exist and can be
computed efficiently via convex programming.
As a corollary of our main result, we give an algorithm for learning a
near-optimal policy in block MDPs whose decoding function is a low-depth
decision tree; the algorithm runs in quasi-polynomial time and takes a
polynomial number of samples. This can be seen as a reinforcement learning
analogue of classic results in computational learning theory. Furthermore, it
gives a natural model where improving the sample complexity via representation
learning is computationally feasible.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Exponential Hardness of Reinforcement Learning with Linear Function
Approximation [20.066210529358177]
We show a computational lower bound, which is exponential in feature and horizon, for linear reinforcement learning under the Exponential Time Hypothesis.
We also show a lower bound optimized for horizon dependence that almost matches the best known upper bound of $exp(sqrtH)$.
arXiv Detail & Related papers (2023-02-25T00:19:49Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function
Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback.
We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode.
We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z) - Online learning in MDPs with linear function approximation and bandit
feedback [13.32560004325655]
We consider an online learning problem where the learner interacts with a Markov decision process in a sequence of episodes.
We show that MDP-LinExp3 is the first provably efficient algorithm for this problem setting.
arXiv Detail & Related papers (2020-07-03T11:06:38Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z) - Learning Polynomials of Few Relevant Dimensions [12.122488725065388]
Polynomial regression is a basic primitive in learning and statistics.
We give an algorithm that learns the runtime within accuracy $epsilon$ with sample complexity that is roughly $N = O_r,d(n log2(1/epsilon) (log n)d)$ and $O_r,d(N n2)$.
arXiv Detail & Related papers (2020-04-28T18:00:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.