Reinforcement Learning in Linear MDPs: Constant Regret and
Representation Selection
- URL: http://arxiv.org/abs/2110.14798v1
- Date: Wed, 27 Oct 2021 22:07:08 GMT
- Title: Reinforcement Learning in Linear MDPs: Constant Regret and
Representation Selection
- Authors: Matteo Papini, Andrea Tirinzoni, Aldo Pacchiano, Marcello Restelli,
Alessandro Lazaric and Matteo Pirotta
- Abstract summary: We study the role of the representation of state-action value functions in regret minimization in finite-horizon Markov Decision Processes (MDPs) with linear structure.
We first derive a necessary condition on the representation, called universally spanning optimal features (UNISOFT), to achieve constant regret in any MDP with linear reward function.
- Score: 136.4014229319618
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the role of the representation of state-action value functions in
regret minimization in finite-horizon Markov Decision Processes (MDPs) with
linear structure. We first derive a necessary condition on the representation,
called universally spanning optimal features (UNISOFT), to achieve constant
regret in any MDP with linear reward function. This result encompasses the
well-known settings of low-rank MDPs and, more generally, zero inherent Bellman
error (also known as the Bellman closure assumption). We then demonstrate that
this condition is also sufficient for these classes of problems by deriving a
constant regret bound for two optimistic algorithms (LSVI-UCB and ELEANOR).
Finally, we propose an algorithm for representation selection and we prove that
it achieves constant regret when one of the given representations, or a
suitable combination of them, satisfies the UNISOFT condition.
Related papers
- Performative Reinforcement Learning with Linear Markov Decision Process [14.75815792682734]
We study the setting of emphperformative reinforcement learning where the deployed policy affects both the reward and the transition of the underlying Markov decision process.
We generalize the results to emphlinear Markov decision processes which is the primary theoretical model of large-scale MDPs.
arXiv Detail & Related papers (2024-11-07T23:04:48Z) - Local Linearity: the Key for No-regret Reinforcement Learning in Continuous MDPs [56.237917407785545]
Existing solutions either work under very specific assumptions or achieve bounds that are vacuous in some regimes.
Many structural assumptions are known to suffer from a provably unavoidable exponential dependence on the time horizon $H$ in the regret.
We show how any mildly smooth MDP can be represented as a Locally Linearizable MDP by an appropriate choice of representation.
arXiv Detail & Related papers (2024-10-31T16:07:22Z) - Regularized Q-Learning with Linear Function Approximation [2.765106384328772]
We consider a bi-level optimization formulation of regularized Q-learning with linear functional approximation.
We show that, under certain assumptions, the proposed algorithm converges to a stationary point in the presence of Markovian noise.
arXiv Detail & Related papers (2024-01-26T20:45:40Z) - Provably Efficient UCB-type Algorithms For Learning Predictive State
Representations [55.00359893021461]
The sequential decision-making problem is statistically learnable if it admits a low-rank structure modeled by predictive state representations (PSRs)
This paper proposes the first known UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the total variation distance between the estimated and true models.
In contrast to existing approaches for PSRs, our UCB-type algorithms enjoy computational tractability, last-iterate guaranteed near-optimal policy, and guaranteed model accuracy.
arXiv Detail & Related papers (2023-07-01T18:35:21Z) - GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP,
and Beyond [101.5329678997916]
We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making.
We propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation.
We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR.
arXiv Detail & Related papers (2022-11-03T16:42:40Z) - Sparsest Univariate Learning Models Under Lipschitz Constraint [31.28451181040038]
We propose continuous-domain formulations for one-dimensional regression problems.
We control the Lipschitz constant explicitly using a user-defined upper-bound.
We show that both problems admit global minimizers that are continuous and piecewise-linear.
arXiv Detail & Related papers (2021-12-27T07:03:43Z) - A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs [117.82903457289584]
We derive a novel problem-dependent lower-bound for regret in finite-horizon Markov Decision Processes (MDPs)
We show that our lower-bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all.
We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
arXiv Detail & Related papers (2021-06-24T13:46:09Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.