Adversarial Bandits with Multi-User Delayed Feedback: Theory and
Application
- URL: http://arxiv.org/abs/2310.11188v2
- Date: Sun, 26 Nov 2023 17:18:23 GMT
- Title: Adversarial Bandits with Multi-User Delayed Feedback: Theory and
Application
- Authors: Yandi Li, Jianxiong Guo, Yupeng Li, Tian Wang, Weijia Jia
- Abstract summary: We formulate an adversarial MAB problem with multi-user delayed feedback and design a modified EXP3 algorithm MUD-EXP3.
In this paper, we consider that the delayed feedback results are from multiple users and are unrestricted on internal distribution.
- Score: 17.64363983613468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The multi-armed bandit (MAB) models have attracted significant research
attention due to their applicability and effectiveness in various real-world
scenarios such as resource allocation, online advertising, and dynamic pricing.
As an important branch, the adversarial MAB problems with delayed feedback have
been proposed and studied by many researchers recently where a conceptual
adversary strategically selects the reward distributions associated with each
arm to challenge the learning algorithm and the agent experiences a delay
between taking an action and receiving the corresponding reward feedback.
However, the existing models restrict the feedback to be generated from only
one user, which makes models inapplicable to the prevailing scenarios of
multiple users (e.g. ad recommendation for a group of users). In this paper, we
consider that the delayed feedback results are from multiple users and are
unrestricted on internal distribution. In contrast, the feedback delay is
arbitrary and unknown to the player in advance. Also, for different users in a
round, the delays in feedback have no assumption of latent correlation. Thus,
we formulate an adversarial MAB problem with multi-user delayed feedback and
design a modified EXP3 algorithm MUD-EXP3, which makes a decision at each round
by considering the importance-weighted estimator of the received feedback from
different users. On the premise of known terminal round index $T$, the number
of users $M$, the number of arms $N$, and upper bound of delay $d_{max}$, we
prove a regret of $\mathcal{O}(\sqrt{TM^2\ln{N}(N\mathrm{e}+4d_{max})})$.
Furthermore, for the more common case of unknown $T$, an adaptive algorithm
AMUD-EXP3 is proposed with a sublinear regret with respect to $T$. Finally,
extensive experiments are conducted to indicate the correctness and
effectiveness of our algorithms.
Related papers
- Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays [25.757803459592104]
We study the semi-bandit problem with unrestricted feedback delays under merit-based fairness constraints.
This is motivated by applications such as crowdsourcing, and online advertising, where immediate feedback is not immediately available.
We present new bandit algorithms to select arms under unrestricted feedback delays based on their merits.
arXiv Detail & Related papers (2024-07-22T07:36:27Z) - Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM)
We study a model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary.
We propose an algorithm namely robust contextual dueling bandit (algo), which is based on uncertainty-weighted maximum likelihood estimation.
arXiv Detail & Related papers (2024-04-16T17:59:55Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - A Best-of-both-worlds Algorithm for Bandits with Delayed Feedback with Robustness to Excessive Delays [24.998122268199797]
We propose a new best-of-both-worlds algorithm for bandits with variably delayed feedback.
Our algorithm can tolerate arbitrary excessive delays up to order $T$.
arXiv Detail & Related papers (2023-08-21T12:17:40Z) - Combinatorial Bandits for Maximum Value Reward Function under Max
Value-Index Feedback [9.771002043127728]
We consider a multi-armed bandit problem for maximum value reward function under maximum value and index feedback.
We propose an algorithm and provide a regret bound for problem instances with arm outcomes according to arbitrary distributions with finite supports.
Our algorithm achieves a $O((k/Delta)log(T))$ distribution-dependent and a $tildeO(sqrtT)$ distribution-independent regret.
arXiv Detail & Related papers (2023-05-25T14:02:12Z) - Stochastic Submodular Bandits with Delayed Composite Anonymous Bandit
Feedback [39.12903814606534]
This paper investigates the problem of multiarmed bandits with submodular (in expectation) rewards and full-bandit delayed feedback.
The delayed feedback is composed of components of rewards from past actions, with unknown division among the sub-components.
The considered algorithm is demonstrated to outperform other full-bandit approaches with delayed composite anonymous feedback.
arXiv Detail & Related papers (2023-03-23T18:38:33Z) - Multi-Armed Bandit Problem with Temporally-Partitioned Rewards: When
Partial Feedback Counts [53.579515853222986]
We study a novel bandit setting, namely Multi-Armed Bandit with Temporally-Partitioned Rewards (TP-MAB)
This setting is a natural extension of delayed-feedback bandits to the case in which rewards may be dilated over a finite-time span after the pull.
We provide two algorithms to address TP-MAB problems, namely, TP-UCB-FR and TP-UCB-EW.
arXiv Detail & Related papers (2022-06-01T15:56:59Z) - The price of unfairness in linear bandits with biased feedback [62.25313751895011]
We study the problem of sequential decision making with biased linear bandit feedback.
We show that the worst case regret is higher than the dT 1/2 log(T) regret rate obtained under unbiased feedback.
Interestingly, the gap-dependent rates reveal the existence of non-trivial instances where the problem is no more difficult than its unbiased counterpart.
arXiv Detail & Related papers (2022-03-18T08:03:20Z) - Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback [67.63049551992816]
We study online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback.
We present the first algorithms that achieve near-optimal $sqrtK + D$ regret, where $K$ is the number of episodes and $D = sum_k=1K dk$ is the total delay.
arXiv Detail & Related papers (2022-01-31T12:34:26Z) - Combinatorial Multi-armed Bandits for Resource Allocation [23.869080705146395]
Motivating examples include allocating limited computing time or wireless spectrum bands to multiple users.
Decision maker should learn the value of the resources allocated for each user from feedback on each user's received reward.
arXiv Detail & Related papers (2021-05-10T13:55:30Z) - Partial Bandit and Semi-Bandit: Making the Most Out of Scarce Users'
Feedback [62.997667081978825]
We present a novel approach for considering user feedback and evaluate it using three distinct strategies.
Despite a limited number of feedbacks returned by users (as low as 20% of the total), our approach obtains similar results to those of state of the art approaches.
arXiv Detail & Related papers (2020-09-16T07:32:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.