Related papers: Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles

Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles

URL: http://arxiv.org/abs/2309.09457v2
Date: Tue, 19 Sep 2023 01:56:24 GMT
Title: Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles
Authors: Noah Golowich and Ankur Moitra and Dhruv Rohatgi
Abstract summary: In this paper we revisit linear MDPs from the perspective of feature selection. Our main result is the first-time algorithm for this problem. We show that they do exist and can be computed efficiently via convex programming.
Score: 39.10180309328293
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The key assumption underlying linear Markov Decision Processes (MDPs) is that the learner has access to a known feature map $\phi(x, a)$ that maps state-action pairs to $d$-dimensional vectors, and that the rewards and transitions are linear functions in this representation. But where do these features come from? In the absence of expert domain knowledge, a tempting strategy is to use the ``kitchen sink" approach and hope that the true features are included in a much larger set of potential features. In this paper we revisit linear MDPs from the perspective of feature selection. In a $k$-sparse linear MDP, there is an unknown subset $S \subset [d]$ of size $k$ containing all the relevant features, and the goal is to learn a near-optimal policy in only poly$(k,\log d)$ interactions with the environment. Our main result is the first polynomial-time algorithm for this problem. In contrast, earlier works either made prohibitively strong assumptions that obviated the need for exploration, or required solving computationally intractable optimization problems. Along the way we introduce the notion of an emulator: a succinct approximate representation of the transitions that suffices for computing certain Bellman backups. Since linear MDPs are a non-parametric model, it is not even obvious whether polynomial-sized emulators exist. We show that they do exist and can be computed efficiently via convex programming. As a corollary of our main result, we give an algorithm for learning a near-optimal policy in block MDPs whose decoding function is a low-depth decision tree; the algorithm runs in quasi-polynomial time and takes a polynomial number of samples. This can be seen as a reinforcement learning analogue of classic results in computational learning theory. Furthermore, it gives a natural model where improving the sample complexity via representation learning is computationally feasible.

Related papers

Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Efficiently Learning One-Hidden-Layer ReLU Networks via Schur Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss. Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z)
Exponential Hardness of Reinforcement Learning with Linear Function Approximation [20.066210529358177]
We show a computational lower bound, which is exponential in feature and horizon, for linear reinforcement learning under the Exponential Time Hypothesis. We also show a lower bound optimized for horizon dependence that almost matches the best known upper bound of $exp(sqrtH)$.
arXiv Detail & Related papers (2023-02-25T00:19:49Z)
Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes. This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z)
Randomized Exploration for Reinforcement Learning with General Value Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm. Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z)
Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation [92.3161051419884]
We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback. We show that it can achieve $tildeO(dHsqrtT)$ regret, where $H$ is the length of the episode. We also prove a matching lower bound of $tildeOmega(dHsqrtT)$ up to logarithmic factors.
arXiv Detail & Related papers (2021-02-17T18:54:08Z)
Online learning in MDPs with linear function approximation and bandit feedback [13.32560004325655]
We consider an online learning problem where the learner interacts with a Markov decision process in a sequence of episodes. We show that MDP-LinExp3 is the first provably efficient algorithm for this problem setting.
arXiv Detail & Related papers (2020-07-03T11:06:38Z)
Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP) We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z)
Learning Polynomials of Few Relevant Dimensions [12.122488725065388]
Polynomial regression is a basic primitive in learning and statistics. We give an algorithm that learns the runtime within accuracy $epsilon$ with sample complexity that is roughly $N = O_r,d(n log2(1/epsilon) (log n)d)$ and $O_r,d(N n2)$.
arXiv Detail & Related papers (2020-04-28T18:00:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.