Risk-Constrained Thompson Sampling for CVaR Bandits
- URL: http://arxiv.org/abs/2011.08046v4
- Date: Thu, 4 Feb 2021 05:43:04 GMT
- Title: Risk-Constrained Thompson Sampling for CVaR Bandits
- Authors: Joel Q. L. Chang, Qiuyu Zhu and Vincent Y. F. Tan
- Abstract summary: We consider a popular risk measure in quantitative finance known as the Conditional Value at Risk (CVaR)
We explore the performance of a Thompson Sampling-based algorithm CVaR-TS under this risk measure.
- Score: 82.47796318548306
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The multi-armed bandit (MAB) problem is a ubiquitous decision-making problem
that exemplifies the exploration-exploitation tradeoff. Standard formulations
exclude risk in decision making. Risk notably complicates the basic
reward-maximising objective, in part because there is no universally agreed
definition of it. In this paper, we consider a popular risk measure in
quantitative finance known as the Conditional Value at Risk (CVaR). We explore
the performance of a Thompson Sampling-based algorithm CVaR-TS under this risk
measure. We provide comprehensive comparisons between our regret bounds with
state-of-the-art L/UCB-based algorithms in comparable settings and demonstrate
their clear improvement in performance. We also include numerical simulations
to empirically verify that CVaR-TS outperforms other L/UCB-based algorithms.
Related papers
- Robust Risk-Sensitive Reinforcement Learning with Conditional Value-at-Risk [23.63388546004777]
We analyze the robustness of CVaR-based risk-sensitive RL under Robust Markov Decision Processes.
Motivated by the existence of decision-dependent uncertainty in real-world problems, we study problems with state-action-dependent ambiguity sets.
arXiv Detail & Related papers (2024-05-02T20:28:49Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Risk-aware linear bandits with convex loss [0.0]
We propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits.
This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent.
arXiv Detail & Related papers (2022-09-15T09:09:53Z) - A Survey of Risk-Aware Multi-Armed Bandits [84.67376599822569]
We review various risk measures of interest, and comment on their properties.
We consider algorithms for the regret minimization setting, where the exploration-exploitation trade-off manifests.
We conclude by commenting on persisting challenges and fertile areas for future research.
arXiv Detail & Related papers (2022-05-12T02:20:34Z) - A Unifying Theory of Thompson Sampling for Continuous Risk-Averse
Bandits [91.3755431537592]
This paper unifies the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem.
Using the contraction principle in the theory of large deviations, we prove novel concentration bounds for continuous risk functionals.
We show that a wide class of risk functionals as well as "nice" functions of them satisfy the continuity condition.
arXiv Detail & Related papers (2021-08-25T17:09:01Z) - Thompson Sampling for Gaussian Entropic Risk Bandits [0.0]
Risk complicates the basic reward-maximising objectives, in part because there is no universally agreed definition of it.
We consider an entropic risk (ER) measure and explore the performance of a Thompson sampling-based algorithm ERTS under this risk measure.
arXiv Detail & Related papers (2021-05-14T17:01:02Z) - Near-Optimal MNL Bandits Under Risk Criteria [13.251377915797674]
We study MNL bandits, which is a variant of the traditional multi-armed bandit problem, under risk criteria.
We design algorithms for a broad class of risk criteria, including but not limited to the well-known conditional value-at-risk, Sharpe ratio and entropy risk, and prove that they suffer a near-optimal regret.
arXiv Detail & Related papers (2020-09-26T03:24:40Z) - Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.