Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both
Worlds in Stochastic and Deterministic Environments
- URL: http://arxiv.org/abs/2301.13446v3
- Date: Sun, 21 May 2023 20:44:06 GMT
- Title: Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both
Worlds in Stochastic and Deterministic Environments
- Authors: Runlong Zhou, Zihan Zhang, Simon S. Du
- Abstract summary: We study variance-dependent regret bounds for Markov decision processes (MDPs)
We propose two new environment norms to characterize the fine-grained variance properties of the environment.
For model-based methods, we design a variant of the MVP algorithm.
In particular, this bound is simultaneously minimax optimal for both and deterministic MDPs.
- Score: 48.96971760679639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study variance-dependent regret bounds for Markov decision processes
(MDPs). Algorithms with variance-dependent regret guarantees can automatically
exploit environments with low variance (e.g., enjoying constant regret on
deterministic MDPs). The existing algorithms are either variance-independent or
suboptimal. We first propose two new environment norms to characterize the
fine-grained variance properties of the environment. For model-based methods,
we design a variant of the MVP algorithm (Zhang et al., 2021a). We apply new
analysis techniques to demonstrate that this algorithm enjoys
variance-dependent bounds with respect to the norms we propose. In particular,
this bound is simultaneously minimax optimal for both stochastic and
deterministic MDPs, the first result of its kind. We further initiate the study
on model-free algorithms with variance-dependent regret bounds by designing a
reference-function-based algorithm with a novel capped-doubling reference
update schedule. Lastly, we also provide lower bounds to complement our upper
bounds.
Related papers
- A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP [1.0923877073891446]
We consider the problem of policy evaluation for variance in a discounted reward Markov decision process.
For this problem, a temporal difference (TD) type learning algorithm with linear function approximation (LFA) exists in the literature.
We derive finite sample bounds that hold (i) in the mean-squared sense; and (ii) with high probability, when tail iterate averaging is employed.
arXiv Detail & Related papers (2024-06-12T05:49:53Z) - Constrained Online Two-stage Stochastic Optimization: Algorithm with
(and without) Predictions [19.537289123577022]
We consider an online two-stage optimization with long-term constraints over a finite horizon of $T$ periods.
We develop online algorithms for the online two-stage problem from adversarial learning algorithms.
arXiv Detail & Related papers (2024-01-02T07:46:33Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Constrained Online Two-stage Stochastic Optimization: Near Optimal Algorithms via Adversarial Learning [1.994307489466967]
We consider an online two-stage optimization with long-term constraints over a finite horizon of $T$ periods.
We develop online algorithms for the online two-stage problem from adversarial learning algorithms.
arXiv Detail & Related papers (2023-02-02T10:33:09Z) - A unified algorithm framework for mean-variance optimization in
discounted Markov decision processes [7.510742715895749]
This paper studies the risk-averse mean-variance optimization in infinite-horizon discounted Markov decision processes (MDPs)
We introduce a pseudo mean to transform the untreatable MDP to a standard one with a redefined reward function in standard form.
We propose a unified algorithm framework with a bilevel optimization structure for the discounted mean-variance optimization.
arXiv Detail & Related papers (2022-01-15T02:19:56Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model.
We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z) - Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.