Contextual bandits with concave rewards, and an application to fair
ranking
- URL: http://arxiv.org/abs/2210.09957v1
- Date: Tue, 18 Oct 2022 16:11:55 GMT
- Title: Contextual bandits with concave rewards, and an application to fair
ranking
- Authors: Virginie Do, Elvis Dohmatob, Matteo Pirotta, Alessandro Lazaric and
Nicolas Usunier
- Abstract summary: We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
- Score: 108.48223948875685
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective
bandit problem where the desired trade-off between the rewards is defined by a
known concave objective function, and the reward vector depends on an observed
stochastic context. We present the first algorithm with provably vanishing
regret for CBCR without restrictions on the policy space, whereas prior works
were restricted to finite policy spaces or tabular representations. Our
solution is based on a geometric interpretation of CBCR algorithms as
optimization algorithms over the convex set of expected rewards spanned by all
stochastic policies. Building on Frank-Wolfe analyses in constrained convex
optimization, we derive a novel reduction from the CBCR regret to the regret of
a scalar-reward bandit problem. We illustrate how to apply the reduction
off-the-shelf to obtain algorithms for CBCR with both linear and general reward
functions, in the case of non-combinatorial actions. Motivated by fairness in
recommendation, we describe a special case of CBCR with rankings and
fairness-aware objectives, leading to the first algorithm with regret
guarantees for contextual combinatorial bandits with fairness of exposure.
Related papers
- Optimizing Sharpe Ratio: Risk-Adjusted Decision-Making in Multi-Armed Bandits [3.5502600490147196]
We consider the Sharpe Ratio (SR) as a critical parameter in characterizing financial time series.
We propose a novel algorithm for optimizing the SR called UCB- RSSR for Regret Minimization (RM) and Best Arm Identification (BAI)
We demonstrate that UCB- RSSR outperforms the only other known SR optimizing bandit algorithm, U-UCB Cassel et al (2023)
arXiv Detail & Related papers (2024-05-28T14:24:36Z) - $\alpha$-Fair Contextual Bandits [10.74025233418392]
Contextual bandit algorithms are at the core of many applications, including recommender systems, clinical trials, and optimal portfolio selection.
One of the most popular problems studied in the contextual bandit literature is to maximize the sum of the rewards in each round.
In this paper, we consider the $alpha$-Fairtextual Con Bandits problem, where the objective is to maximize the global $alpha$-fair utility function.
arXiv Detail & Related papers (2023-10-22T03:42:59Z) - Contextual Bandits with Packing and Covering Constraints: A Modular Lagrangian Approach via Regression [65.8785736964253]
We consider contextual bandits with linear constraints (CBwLC), a variant of contextual bandits in which the algorithm consumes multiple resources subject to linear constraints on total consumption.
This problem generalizes contextual bandits with knapsacks (CBwK), allowing for packing and covering constraints, as well as positive and negative resource consumption.
We provide the first algorithm for CBwLC (or CBwK) that is based on regression oracles. The algorithm is simple, computationally efficient, and statistically optimal under mild assumptions.
arXiv Detail & Related papers (2022-11-14T16:08:44Z) - Risk-aware linear bandits with convex loss [0.0]
We propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits.
This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent.
arXiv Detail & Related papers (2022-09-15T09:09:53Z) - Risk-Aware Algorithms for Combinatorial Semi-Bandits [7.716156977428555]
We study the multi-armed bandit problem under semi-bandit feedback.
We consider the problem of maximizing the Conditional Value-at-Risk (CVaR), a risk measure that takes into account only the worst-case rewards.
We propose new algorithms that maximize the CVaR of the rewards obtained from the super arms of the bandit.
arXiv Detail & Related papers (2021-12-02T11:29:43Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Bias-Robust Bayesian Optimization via Dueling Bandit [57.82422045437126]
We consider Bayesian optimization in settings where observations can be adversarially biased.
We propose a novel approach for dueling bandits based on information-directed sampling (IDS)
Thereby, we obtain the first efficient kernelized algorithm for dueling bandits that comes with cumulative regret guarantees.
arXiv Detail & Related papers (2021-05-25T10:08:41Z) - A One-Size-Fits-All Solution to Conservative Bandit Problems [32.907883501286]
We study a family of conservative bandit problems (CBPs) with sample-path reward constraints.
We propose a One-Size-Fits-All solution to CBPs and present its applications to three encompassed problems.
arXiv Detail & Related papers (2020-12-14T08:50:23Z) - Risk-Constrained Thompson Sampling for CVaR Bandits [82.47796318548306]
We consider a popular risk measure in quantitative finance known as the Conditional Value at Risk (CVaR)
We explore the performance of a Thompson Sampling-based algorithm CVaR-TS under this risk measure.
arXiv Detail & Related papers (2020-11-16T15:53:22Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.