Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces
- URL: http://arxiv.org/abs/2410.19919v1
- Date: Fri, 25 Oct 2024 18:14:42 GMT
- Title: Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces
- Authors: Avik Kar, Rahul Singh,
- Abstract summary: We develop an algorithm ZoRL that discretizes the state-action space adaptively and zooms into promising regions of the state-action space.
ZoRL outperforms other state-of-the-art algorithms in experiments.
- Score: 2.2984209387877628
- License:
- Abstract: We study infinite-horizon average-reward reinforcement learning (RL) for Lipschitz MDPs and develop an algorithm ZoRL that discretizes the state-action space adaptively and zooms into promising regions of the state-action space. We show that its regret can be bounded as $\mathcal{\tilde{O}}\big(T^{1 - d_{\text{eff.}}^{-1}}\big)$, where $d_{\text{eff.}} = 2d_\mathcal{S} + d_z + 3$, $d_\mathcal{S}$ is the dimension of the state space, and $d_z$ is the zooming dimension. $d_z$ is a problem-dependent quantity, which allows us to conclude that if MDP is benign, then its regret will be small. We note that the existing notion of zooming dimension for average reward RL is defined in terms of policy coverings, and hence it can be huge when the policy class is rich even though the underlying MDP is simple, so that the regret upper bound is nearly $O(T)$. The zooming dimension proposed in the current work is bounded above by $d$, the dimension of the state-action space, and hence is truly adaptive, i.e., shows how to capture adaptivity gains for infinite-horizon average-reward RL. ZoRL outperforms other state-of-the-art algorithms in experiments; thereby demonstrating the gains arising due to adaptivity.
Related papers
- The Limits of Transfer Reinforcement Learning with Latent Low-rank Structure [9.631640936820126]
Many reinforcement learning algorithms are too costly to use in practice due to the large sizes $S, A$ of the problem's state and action space.
We consider the problem of transferring a latent low rank representation when the source and target MDPs have transition kernels.
Our algorithm learns latent representations in each source MDP and then exploits the linear structure to remove the dependence on $S, A $, or $S A$ in the target MDP regret bound.
arXiv Detail & Related papers (2024-10-28T23:12:08Z) - Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning [2.2984209387877628]
We develop an algorithm PZRL that discretizes the state-action space adaptively and zooms in to promising regions of the "policy space"
We show that the regret of PZRL can be bounded as $tildemathcalObig(T1 - d_texteff.-1big)$.
arXiv Detail & Related papers (2024-05-29T06:18:09Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit
Feedback and Unknown Transition [71.33787410075577]
We study reinforcement learning with linear function approximation, unknown transition, and adversarial losses.
We propose a new algorithm that attains an $widetildeO(dsqrtHS3K + sqrtHSAK)$ regret with high probability.
arXiv Detail & Related papers (2024-03-07T15:03:50Z) - Learning Adversarial Low-rank Markov Decision Processes with Unknown
Transition and Full-information Feedback [30.23951525723659]
We study the low-rank MDPs with adversarially changed losses in the full-information feedback setting.
We propose a policy optimization-based algorithm POLO, and we prove that it attains the $widetildeO(Kfrac56Afrac12dln (1+M)/ (1-gamma)2)$ regret guarantee.
arXiv Detail & Related papers (2023-11-14T03:12:43Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Variance-aware robust reinforcement learning with linear function
approximation with heavy-tailed rewards [6.932056534450556]
We present two algorithms, AdaOFUL and VARA, for online sequential decision-making in the presence of heavy-tailed rewards.
AdaOFUL achieves a state-of-the-art regret bound of $widetildemathcalObig.
VarA achieves a tighter variance-aware regret bound of $widetildemathcalO(dsqrtHmathcalG*K)$.
arXiv Detail & Related papers (2023-03-09T22:16:28Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Provably Efficient Reinforcement Learning for Discounted MDPs with
Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP)
We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret.
Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z) - Zooming for Efficient Model-Free Reinforcement Learning in Metric Spaces [26.297887542066505]
We consider episodic reinforcement learning with a continuous state-action space which is assumed to be equipped with a natural metric.
We propose ZoomRL, an online algorithm that leverages ideas from continuous bandits to learn an adaptive discretization of the joint space.
We show that ZoomRL achieves a worst-case regret $tildeO(Hfrac52 Kfracd+1d+2)$ where $H$ is the planning horizon, $K$ is the number of episodes and $d$ is the covering dimension of the space.
arXiv Detail & Related papers (2020-03-09T12:32:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.