Variance-aware robust reinforcement learning with linear function
approximation with heavy-tailed rewards
- URL: http://arxiv.org/abs/2303.05606v1
- Date: Thu, 9 Mar 2023 22:16:28 GMT
- Title: Variance-aware robust reinforcement learning with linear function
approximation with heavy-tailed rewards
- Authors: Xiang Li, Qiang Sun
- Abstract summary: We present two algorithms, AdaOFUL and VARA, for online sequential decision-making in the presence of heavy-tailed rewards.
AdaOFUL achieves a state-of-the-art regret bound of $widetildemathcalObig.
VarA achieves a tighter variance-aware regret bound of $widetildemathcalO(dsqrtHmathcalG*K)$.
- Score: 6.932056534450556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents two algorithms, AdaOFUL and VARA, for online sequential
decision-making in the presence of heavy-tailed rewards with only finite
variances. For linear stochastic bandits, we address the issue of heavy-tailed
rewards by modifying the adaptive Huber regression and proposing AdaOFUL.
AdaOFUL achieves a state-of-the-art regret bound of
$\widetilde{\mathcal{O}}\big(d\big(\sum_{t=1}^T \nu_{t}^2\big)^{1/2}+d\big)$ as
if the rewards were uniformly bounded, where $\nu_{t}^2$ is the observed
conditional variance of the reward at round $t$, $d$ is the feature dimension,
and $\widetilde{\mathcal{O}}(\cdot)$ hides logarithmic dependence. Building
upon AdaOFUL, we propose VARA for linear MDPs, which achieves a tighter
variance-aware regret bound of
$\widetilde{\mathcal{O}}(d\sqrt{H\mathcal{G}^*K})$. Here, $H$ is the length of
episodes, $K$ is the number of episodes, and $\mathcal{G}^*$ is a smaller
instance-dependent quantity that can be bounded by other instance-dependent
quantities when additional structural conditions on the MDP are satisfied. Our
regret bound is superior to the current state-of-the-art bounds in three ways:
(1) it depends on a tighter instance-dependent quantity and has optimal
dependence on $d$ and $H$, (2) we can obtain further instance-dependent bounds
of $\mathcal{G}^*$ under additional structural conditions on the MDP, and (3)
our regret bound is valid even when rewards have only finite variances,
achieving a level of generality unmatched by previous works. Overall, our
modified adaptive Huber regression algorithm may serve as a useful building
block in the design of algorithms for online problems with heavy-tailed
rewards.
Related papers
- Provably Adaptive Average Reward Reinforcement Learning for Metric Spaces [2.2984209387877628]
We develop an algorithm ZoRL that discretizes the state-action space adaptively and zooms into promising regions of the state-action space.
ZoRL outperforms other state-of-the-art algorithms in experiments.
arXiv Detail & Related papers (2024-10-25T18:14:42Z) - Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism [1.4999444543328293]
We propose a model-based algorithm based on novel cost and reward function estimators.
Our algorithm achieves a regret upper bound of $widetildemathcalO((bar C - bar C_b)-1H2.5 SsqrtAK)$ where $bar C$ is the cost budget for an episode.
arXiv Detail & Related papers (2024-10-14T04:51:06Z) - Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Towards Theoretical Understanding of Inverse Reinforcement Learning [45.3190496371625]
Inverse reinforcement learning (IRL) denotes a powerful family of algorithms for recovering a reward function justifying the behavior demonstrated by an expert agent.
In this paper, we make a step towards closing the theory gap of IRL in the case of finite-horizon problems with a generative model.
arXiv Detail & Related papers (2023-04-25T16:21:10Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Causal Bandits for Linear Structural Equation Models [58.2875460517691]
This paper studies the problem of designing an optimal sequence of interventions in a causal graphical model.
It is assumed that the graph's structure is known and has $N$ nodes.
Two algorithms are proposed for the frequentist (UCB-based) and Bayesian settings.
arXiv Detail & Related papers (2022-08-26T16:21:31Z) - Improved Regret Analysis for Variance-Adaptive Linear Bandits and
Horizon-Free Linear Mixture MDPs [12.450760567361531]
In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees.
We present novel analyses that improve their regret bounds significantly.
Our analysis critically relies on a novel elliptical potential count' lemma.
arXiv Detail & Related papers (2021-11-05T06:47:27Z) - Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP)
For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension.
For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z) - Curse of Dimensionality on Randomized Smoothing for Certifiable
Robustness [151.67113334248464]
We show that extending the smoothing technique to defend against other attack models can be challenging.
We present experimental results on CIFAR to validate our theory.
arXiv Detail & Related papers (2020-02-08T22:02:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.