Thompson Sampling Algorithms for Mean-Variance Bandits
- URL: http://arxiv.org/abs/2002.00232v3
- Date: Mon, 3 Aug 2020 13:34:15 GMT
- Title: Thompson Sampling Algorithms for Mean-Variance Bandits
- Authors: Qiuyu Zhu and Vincent Y. F. Tan
- Abstract summary: We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
- Score: 97.43678751629189
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The multi-armed bandit (MAB) problem is a classical learning task that
exemplifies the exploration-exploitation tradeoff. However, standard
formulations do not take into account {\em risk}. In online decision making
systems, risk is a primary concern. In this regard, the mean-variance risk
measure is one of the most common objective functions. Existing algorithms for
mean-variance optimization in the context of MAB problems have unrealistic
assumptions on the reward distributions. We develop Thompson Sampling-style
algorithms for mean-variance MAB and provide comprehensive regret analyses for
Gaussian and Bernoulli bandits with fewer assumptions. Our algorithms achieve
the best known regret bounds for mean-variance MABs and also attain the
information-theoretic bounds in some parameter regimes. Empirical simulations
show that our algorithms significantly outperform existing LCB-based algorithms
for all risk tolerances.
Related papers
- Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Risk-aware linear bandits with convex loss [0.0]
We propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits.
This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent.
arXiv Detail & Related papers (2022-09-15T09:09:53Z) - A Unifying Theory of Thompson Sampling for Continuous Risk-Averse
Bandits [91.3755431537592]
This paper unifies the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem.
Using the contraction principle in the theory of large deviations, we prove novel concentration bounds for continuous risk functionals.
We show that a wide class of risk functionals as well as "nice" functions of them satisfy the continuity condition.
arXiv Detail & Related papers (2021-08-25T17:09:01Z) - Thompson Sampling for Gaussian Entropic Risk Bandits [0.0]
Risk complicates the basic reward-maximising objectives, in part because there is no universally agreed definition of it.
We consider an entropic risk (ER) measure and explore the performance of a Thompson sampling-based algorithm ERTS under this risk measure.
arXiv Detail & Related papers (2021-05-14T17:01:02Z) - Risk-Constrained Thompson Sampling for CVaR Bandits [82.47796318548306]
We consider a popular risk measure in quantitative finance known as the Conditional Value at Risk (CVaR)
We explore the performance of a Thompson Sampling-based algorithm CVaR-TS under this risk measure.
arXiv Detail & Related papers (2020-11-16T15:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.