No-Regret Reinforcement Learning in Smooth MDPs
- URL: http://arxiv.org/abs/2402.03792v1
- Date: Tue, 6 Feb 2024 08:18:14 GMT
- Title: No-Regret Reinforcement Learning in Smooth MDPs
- Authors: Davide Maran, Alberto Maria Metelli, Matteo Papini, Marcello Restell
- Abstract summary: We introduce a novel structural assumption on the decision processes (MDPs) that generalizes most of the settings proposed so far.
We propose two algorithms for regret minimization in $nu-$smoothness.
We compare our results with state-of-the-art ones from RL theory, showing that our algorithms achieve the best guarantees.
- Score: 24.249446550171307
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Obtaining no-regret guarantees for reinforcement learning (RL) in the case of
problems with continuous state and/or action spaces is still one of the major
open challenges in the field. Recently, a variety of solutions have been
proposed, but besides very specific settings, the general problem remains
unsolved. In this paper, we introduce a novel structural assumption on the
Markov decision processes (MDPs), namely $\nu-$smoothness, that generalizes
most of the settings proposed so far (e.g., linear MDPs and Lipschitz MDPs). To
face this challenging scenario, we propose two algorithms for regret
minimization in $\nu-$smooth MDPs. Both algorithms build upon the idea of
constructing an MDP representation through an orthogonal feature map based on
Legendre polynomials. The first algorithm, \textsc{Legendre-Eleanor}, archives
the no-regret property under weaker assumptions but is computationally
inefficient, whereas the second one, \textsc{Legendre-LSVI}, runs in polynomial
time, although for a smaller class of problems. After analyzing their regret
properties, we compare our results with state-of-the-art ones from RL theory,
showing that our algorithms achieve the best guarantees.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP.
We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z) - Learning Adversarial Low-rank Markov Decision Processes with Unknown
Transition and Full-information Feedback [30.23951525723659]
We study the low-rank MDPs with adversarially changed losses in the full-information feedback setting.
We propose a policy optimization-based algorithm POLO, and we prove that it attains the $widetildeO(Kfrac56Afrac12dln (1+M)/ (1-gamma)2)$ regret guarantee.
arXiv Detail & Related papers (2023-11-14T03:12:43Z) - A Doubly Robust Approach to Sparse Reinforcement Learning [19.68978899041642]
We propose a new regret algorithm for episodic sparse linear Markov decision process (SMDP)
The proposed algorithm is $tildeO(sigma-1_min s_star H sqrtN)$, where $sigma_min$ denotes the restrictive minimum eigenvalue of the average Gram matrix of feature vectors.
arXiv Detail & Related papers (2023-10-23T18:52:17Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Efficient and Optimal Algorithms for Contextual Dueling Bandits under
Realizability [59.81339109121384]
We study the $K$ contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes emphpreference-based feedback suggesting that one decision was better than the other.
We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
arXiv Detail & Related papers (2021-11-24T07:14:57Z) - A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs [117.82903457289584]
We derive a novel problem-dependent lower-bound for regret in finite-horizon Markov Decision Processes (MDPs)
We show that our lower-bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all.
We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
arXiv Detail & Related papers (2021-06-24T13:46:09Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.