Related papers: Incentive-compatible Bandits: Importance Weighting No More

Incentive-compatible Bandits: Importance Weighting No More

URL: http://arxiv.org/abs/2405.06480v1
Date: Fri, 10 May 2024 13:57:13 GMT
Title: Incentive-compatible Bandits: Importance Weighting No More
Authors: Julian Zimmert, Teodor V. Marinov,
Abstract summary: We study the problem of incentive-compatible online learning with bandit feedback. In this work we propose the first incentive-compatible algorithms that enjoy $O(sqrtKT)$ regret bounds.
Score: 14.344759978208957
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the problem of incentive-compatible online learning with bandit feedback. In this class of problems, the experts are self-interested agents who might misrepresent their preferences with the goal of being selected most often. The goal is to devise algorithms which are simultaneously incentive-compatible, that is the experts are incentivised to report their true preferences, and have no regret with respect to the preferences of the best fixed expert in hindsight. \citet{freeman2020no} propose an algorithm in the full information setting with optimal $O(\sqrt{T \log(K)})$ regret and $O(T^{2/3}(K\log(K))^{1/3})$ regret in the bandit setting. In this work we propose the first incentive-compatible algorithms that enjoy $O(\sqrt{KT})$ regret bounds. We further demonstrate how simple loss-biasing allows the algorithm proposed in Freeman et al. 2020 to enjoy $\tilde O(\sqrt{KT})$ regret. As a byproduct of our approach we obtain the first bandit algorithm with nearly optimal regret bounds in the adversarial setting which works entirely on the observed loss sequence without the need for importance-weighted estimators. Finally, we provide an incentive-compatible algorithm that enjoys asymptotically optimal best-of-both-worlds regret guarantees, i.e., logarithmic regret in the stochastic regime as well as worst-case $O(\sqrt{KT})$ regret.

Related papers

On the price of exact truthfulness in incentive-compatible online learning with bandit feedback: A regret lower bound for WSU-UX [8.861995276194435]
In one view of the classical game of prediction with expert advice with binary outcomes, each expert maintains an adversarially chosen belief. We consider a recently introduced, strategic variant of this problem with selfish (reputation-seeking) experts. We show, via explicit construction of loss sequences, that the algorithm suffers a worst-case $Omega(T2/3)$ lower bound.
arXiv Detail & Related papers (2024-04-08T02:41:32Z)
Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed. SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z)
Private Online Prediction from Experts: Separations and Faster Rates [74.52487417350221]
Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints. We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries.
arXiv Detail & Related papers (2022-10-24T18:40:19Z)
Leveraging Initial Hints for Free in Stochastic Linear Bandits [38.58449930961883]
We study the setting of optimizing with bandit feedback with additional prior knowledge provided to a learner in the form of an initial hint of the optimal action. We present a novel algorithm for linear bandits that uses this hint to improve its regret to $tilde O(sqrtT)$ when the hint is accurate. Perhaps surprisingly, our work shows that leveraging a hint shows provable gains without sacrificing worst-case performance.
arXiv Detail & Related papers (2022-03-08T18:48:55Z)
Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback [29.553652241608997]
We study the class of textitsmooth and strongly monotone games and study optimal no-regret learning therein. We first construct a new bandit learning algorithm and show that it achieves the single-agent optimal regret of $tildeTheta(nsqrtT)$. Our results thus settle this open problem and contribute to the broad landscape of bandit game-theoretical learning.
arXiv Detail & Related papers (2021-12-06T08:27:54Z)
Efficient and Optimal Algorithms for Contextual Dueling Bandits under Realizability [59.81339109121384]
We study the $K$ contextual dueling bandit problem, a sequential decision making setting in which the learner uses contextual information to make two decisions, but only observes emphpreference-based feedback suggesting that one decision was better than the other. We provide a new algorithm that achieves the optimal regret rate for a new notion of best response regret, which is a strictly stronger performance measure than those considered in prior works.
arXiv Detail & Related papers (2021-11-24T07:14:57Z)
Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary Dueling Bandits [27.279654173896372]
We study the problem of emphdynamic regret minimization in $K$-armed Dueling Bandits under non-stationary or time varying preferences. This is an online learning setup where the agent chooses a pair of items at each round and observes only a relative binary win-loss' feedback for this pair.
arXiv Detail & Related papers (2021-11-06T16:46:55Z)
Multinomial Logit Contextual Bandits: Provable Optimality and Practicality [15.533842336139063]
We consider a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown. We propose upper confidence bound based algorithms for this MNL contextual bandit. We show that a simple variant of the algorithm achieves the optimal regret for a broad class of important applications.
arXiv Detail & Related papers (2021-03-25T15:42:25Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
Corralling Stochastic Bandit Algorithms [54.10645564702416]
We show that the regret of the corralling algorithms is no worse than that of the best algorithm containing the arm with the highest reward. We show that the gap between the highest reward and other rewards depends on the gap between the highest reward and other rewards.
arXiv Detail & Related papers (2020-06-16T15:33:12Z)
Improved Sleeping Bandits with Stochastic Actions Sets and Adversarial Rewards [59.559028338399855]
We consider the problem of sleeping bandits with action sets and adversarial rewards. In this paper, we provide a new computationally efficient algorithm inspired by EXP3 satisfying a regret of order $O(sqrtT)$.
arXiv Detail & Related papers (2020-04-14T00:41:26Z)
Model Selection in Contextual Stochastic Bandit Problems [51.94632035240787]
We develop a meta-algorithm that selects between base algorithms. We show through a lower bound that even when one of the base algorithms has $O(sqrtT)$ regret, in general it is impossible to get better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2020-03-03T18:46:34Z)
Regret Minimization in Stochastic Contextual Dueling Bandits [40.17224226373741]
We consider the problem of $K$-armed dueling bandit in the contextual setting. We present two algorithms for the setup with respective regret guarantees.
arXiv Detail & Related papers (2020-02-20T06:36:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.