Risk-aware linear bandits with convex loss
- URL: http://arxiv.org/abs/2209.07154v2
- Date: Mon, 27 Mar 2023 09:49:45 GMT
- Title: Risk-aware linear bandits with convex loss
- Authors: Patrick Saux (Inria Scool, CRIStAL, Univ. Lille), Odalric-Ambrym
Maillard (Inria Scool, CRIStAL, Univ. Lille)
- Abstract summary: We propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits.
This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In decision-making problems such as the multi-armed bandit, an agent learns
sequentially by optimizing a certain feedback. While the mean reward criterion
has been extensively studied, other measures that reflect an aversion to
adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR),
can be of interest for critical applications (healthcare, agriculture).
Algorithms have been proposed for such risk-aware measures under bandit
feedback without contextual information. In this work, we study contextual
bandits where such risk measures can be elicited as linear functions of the
contexts through the minimization of a convex loss. A typical example that fits
within this framework is the expectile measure, which is obtained as the
solution of an asymmetric least-square problem. Using the method of mixtures
for supermartingales, we derive confidence sequences for the estimation of such
risk measures. We then propose an optimistic UCB algorithm to learn optimal
risk-aware actions, with regret guarantees similar to those of generalized
linear bandits. This approach requires solving a convex problem at each round
of the algorithm, which we can relax by allowing only approximated solution
obtained by online gradient descent, at the cost of slightly higher regret. We
conclude by evaluating the resulting algorithms on numerical experiments.
Related papers
- Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - A Unifying Theory of Thompson Sampling for Continuous Risk-Averse
Bandits [91.3755431537592]
This paper unifies the analysis of risk-averse Thompson sampling algorithms for the multi-armed bandit problem.
Using the contraction principle in the theory of large deviations, we prove novel concentration bounds for continuous risk functionals.
We show that a wide class of risk functionals as well as "nice" functions of them satisfy the continuity condition.
arXiv Detail & Related papers (2021-08-25T17:09:01Z) - A Full Characterization of Excess Risk via Empirical Risk Landscape [8.797852602680445]
In this paper, we provide a unified analysis of the risk of the model trained by a proper algorithm with both smooth convex and non- loss functions.
arXiv Detail & Related papers (2020-12-04T08:24:50Z) - Risk-Constrained Thompson Sampling for CVaR Bandits [82.47796318548306]
We consider a popular risk measure in quantitative finance known as the Conditional Value at Risk (CVaR)
We explore the performance of a Thompson Sampling-based algorithm CVaR-TS under this risk measure.
arXiv Detail & Related papers (2020-11-16T15:53:22Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z) - Constrained regret minimization for multi-criterion multi-armed bandits [5.349852254138086]
We study the problem of regret minimization over a given time horizon, subject to a risk constraint.
We propose a Risk-Constrained Lower Confidence Bound algorithm that guarantees logarithmic regret.
We prove lower bounds on the performance of any risk-constrained regret minimization algorithm.
arXiv Detail & Related papers (2020-06-17T04:23:18Z) - Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.