Local Linearity: the Key for No-regret Reinforcement Learning in Continuous MDPs
- URL: http://arxiv.org/abs/2410.24071v1
- Date: Thu, 31 Oct 2024 16:07:22 GMT
- Title: Local Linearity: the Key for No-regret Reinforcement Learning in Continuous MDPs
- Authors: Davide Maran, Alberto Maria Metelli, Matteo Papini, Marcello Restelli,
- Abstract summary: Existing solutions either work under very specific assumptions or achieve bounds that are vacuous in some regimes.
Many structural assumptions are known to suffer from a provably unavoidable exponential dependence on the time horizon $H$ in the regret.
We show how any mildly smooth MDP can be represented as a Locally Linearizable MDP by an appropriate choice of representation.
- Score: 56.237917407785545
- License:
- Abstract: Achieving the no-regret property for Reinforcement Learning (RL) problems in continuous state and action-space environments is one of the major open problems in the field. Existing solutions either work under very specific assumptions or achieve bounds that are vacuous in some regimes. Furthermore, many structural assumptions are known to suffer from a provably unavoidable exponential dependence on the time horizon $H$ in the regret, which makes any possible solution unfeasible in practice. In this paper, we identify local linearity as the feature that makes Markov Decision Processes (MDPs) both learnable (sublinear regret) and feasible (regret that is polynomial in $H$). We define a novel MDP representation class, namely Locally Linearizable MDPs, generalizing other representation classes like Linear MDPs and MDPS with low inherent Belmman error. Then, i) we introduce Cinderella, a no-regret algorithm for this general representation class, and ii) we show that all known learnable and feasible MDP families are representable in this class. We first show that all known feasible MDPs belong to a family that we call Mildly Smooth MDPs. Then, we show how any mildly smooth MDP can be represented as a Locally Linearizable MDP by an appropriate choice of representation. This way, Cinderella is shown to achieve state-of-the-art regret bounds for all previously known (and some new) continuous MDPs for which RL is learnable and feasible.
Related papers
- No-Regret Reinforcement Learning in Smooth MDPs [24.249446550171307]
We introduce a novel structural assumption on the decision processes (MDPs) that generalizes most of the settings proposed so far.
We propose two algorithms for regret minimization in $nu-$smoothness.
We compare our results with state-of-the-art ones from RL theory, showing that our algorithms achieve the best guarantees.
arXiv Detail & Related papers (2024-02-06T08:18:14Z) - Online RL in Linearly $q^\pi$-Realizable MDPs Is as Easy as in Linear
MDPs If You Learn What to Ignore [0.0]
We consider online reinforcement learning in episodic Markov decision processes (MDPs)
We show that the difference between the two classes is the presence of states in linearly $qpi$-realizable MDPs.
We derive a novel (computationally inefficient) learning algorithm for linearly $qpi$-realizable MDPs.
arXiv Detail & Related papers (2023-10-11T18:50:25Z) - Reinforcement Learning in Linear MDPs: Constant Regret and
Representation Selection [136.4014229319618]
We study the role of the representation of state-action value functions in regret minimization in finite-horizon Markov Decision Processes (MDPs) with linear structure.
We first derive a necessary condition on the representation, called universally spanning optimal features (UNISOFT), to achieve constant regret in any MDP with linear reward function.
arXiv Detail & Related papers (2021-10-27T22:07:08Z) - A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs [117.82903457289584]
We derive a novel problem-dependent lower-bound for regret in finite-horizon Markov Decision Processes (MDPs)
We show that our lower-bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all.
We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
arXiv Detail & Related papers (2021-06-24T13:46:09Z) - Sample Efficient Reinforcement Learning In Continuous State Spaces: A
Perspective Beyond Linearity [50.38337893712897]
We introduce the Effective Planning Window (EPW) condition, a structural condition on MDPs that makes no linearity assumptions.
We demonstrate that the EPW condition permits sample efficient RL, by providing an algorithm which provably solves MDPs satisfying this condition.
We additionally show the necessity of conditions like EPW, by demonstrating that simple MDPs with slight nonlinearities cannot be solved sample efficiently.
arXiv Detail & Related papers (2021-06-15T00:06:59Z) - Reward is enough for convex MDPs [30.478950691312715]
We study convex MDPs in which goals are expressed as convex functions of the stationary distribution.
We propose a meta-algorithm for solving this problem and show that it unifies many existing algorithms in the literature.
arXiv Detail & Related papers (2021-06-01T17:46:25Z) - RL for Latent MDPs: Regret Guarantees and a Lower Bound [74.41782017817808]
We consider the regret problem for reinforcement learning in latent Markov Decision Processes (LMDP)
In an LMDP, an MDP is randomly drawn from a set of $M$ possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent.
We show that the key link is a notion of separation between the MDP system dynamics.
arXiv Detail & Related papers (2021-02-09T16:49:58Z) - FLAMBE: Structural Complexity and Representation Learning of Low Rank
MDPs [53.710405006523274]
This work focuses on the representation learning question: how can we learn such features?
Under the assumption that the underlying (unknown) dynamics correspond to a low rank transition matrix, we show how the representation learning question is related to a particular non-linear matrix decomposition problem.
We develop FLAMBE, which engages in exploration and representation learning for provably efficient RL in low rank transition models.
arXiv Detail & Related papers (2020-06-18T19:11:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.