Online RL in Linearly $q^\pi$-Realizable MDPs Is as Easy as in Linear
MDPs If You Learn What to Ignore
- URL: http://arxiv.org/abs/2310.07811v2
- Date: Wed, 20 Dec 2023 18:09:29 GMT
- Title: Online RL in Linearly $q^\pi$-Realizable MDPs Is as Easy as in Linear
MDPs If You Learn What to Ignore
- Authors: Gell\'ert Weisz and Andr\'as Gy\"orgy and Csaba Szepesv\'ari
- Abstract summary: We consider online reinforcement learning in episodic Markov decision processes (MDPs)
We show that the difference between the two classes is the presence of states in linearly $qpi$-realizable MDPs.
We derive a novel (computationally inefficient) learning algorithm for linearly $qpi$-realizable MDPs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider online reinforcement learning (RL) in episodic Markov decision
processes (MDPs) under the linear $q^\pi$-realizability assumption, where it is
assumed that the action-values of all policies can be expressed as linear
functions of state-action features. This class is known to be more general than
linear MDPs, where the transition kernel and the reward function are assumed to
be linear functions of the feature vectors. As our first contribution, we show
that the difference between the two classes is the presence of states in
linearly $q^\pi$-realizable MDPs where for any policy, all the actions have
approximately equal values, and skipping over these states by following an
arbitrarily fixed policy in those states transforms the problem to a linear
MDP. Based on this observation, we derive a novel (computationally inefficient)
learning algorithm for linearly $q^\pi$-realizable MDPs that simultaneously
learns what states should be skipped over and runs another learning algorithm
on the linear MDP hidden in the problem. The method returns an
$\epsilon$-optimal policy after $\text{polylog}(H, d)/\epsilon^2$ interactions
with the MDP, where $H$ is the time horizon and $d$ is the dimension of the
feature vectors, giving the first polynomial-sample-complexity online RL
algorithm for this setting. The results are proved for the misspecified case,
where the sample complexity is shown to degrade gracefully with the
misspecification error.
Related papers
- Local Linearity: the Key for No-regret Reinforcement Learning in Continuous MDPs [56.237917407785545]
Existing solutions either work under very specific assumptions or achieve bounds that are vacuous in some regimes.
Many structural assumptions are known to suffer from a provably unavoidable exponential dependence on the time horizon $H$ in the regret.
We show how any mildly smooth MDP can be represented as a Locally Linearizable MDP by an appropriate choice of representation.
arXiv Detail & Related papers (2024-10-31T16:07:22Z) - Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs [44.69257217086967]
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives.
In this paper, we address the learning problem given linear function approximation with $q_pi$-realizability.
arXiv Detail & Related papers (2024-06-26T17:57:13Z) - Trajectory Data Suffices for Statistically Efficient Learning in Offline RL with Linear $q^π$-Realizability and Concentrability [34.51093353030245]
offline reinforcement learning (RL) in $H$-horizon Markov decision processes (MDPs)
We prove that with trajectory data, a dataset of size $textpoly(d,H,C_textconc)/epsilon2$ is sufficient for deriving an $epsilon$-optimal policy.
arXiv Detail & Related papers (2024-05-27T03:59:13Z) - Exploring and Learning in Sparse Linear MDPs without Computationally
Intractable Oracles [39.10180309328293]
In this paper we revisit linear MDPs from the perspective of feature selection.
Our main result is the first-time algorithm for this problem.
We show that they do exist and can be computed efficiently via convex programming.
arXiv Detail & Related papers (2023-09-18T03:35:48Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - RL for Latent MDPs: Regret Guarantees and a Lower Bound [74.41782017817808]
We consider the regret problem for reinforcement learning in latent Markov Decision Processes (LMDP)
In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent.
We show that the key link is a notion of separation between the MDP system dynamics.
arXiv Detail & Related papers (2021-02-09T16:49:58Z) - Logarithmic Regret for Reinforcement Learning with Linear Function
Approximation [99.59319332864129]
We show that logarithmic regret is attainable under two recently proposed linear MDP assumptions.
To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation.
arXiv Detail & Related papers (2020-11-23T17:25:00Z) - Online learning in MDPs with linear function approximation and bandit
feedback [13.32560004325655]
We consider an online learning problem where the learner interacts with a Markov decision process in a sequence of episodes.
We show that MDP-LinExp3 is the first provably efficient algorithm for this problem setting.
arXiv Detail & Related papers (2020-07-03T11:06:38Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.