Coordinated Attacks against Contextual Bandits: Fundamental Limits and
  Defense Mechanisms
        - URL: http://arxiv.org/abs/2201.12700v1
- Date: Sun, 30 Jan 2022 01:45:13 GMT
- Title: Coordinated Attacks against Contextual Bandits: Fundamental Limits and
  Defense Mechanisms
- Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor
- Abstract summary: Motivated by online recommendation systems, we propose the problem of finding the optimal policy in contextual bandits.
The goal is to robustly learn the policy that maximizes rewards for good users with as few user interactions as possible.
We show we can achieve an $tildeO(min(S,A)cdot alpha/epsilon2)$ upper-bound, by employing efficient robust mean estimators.
- Score: 75.17357040707347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Motivated by online recommendation systems, we propose the problem of finding
the optimal policy in multitask contextual bandits when a small fraction
$\alpha < 1/2$ of tasks (users) are arbitrary and adversarial. The remaining
fraction of good users share the same instance of contextual bandits with $S$
contexts and $A$ actions (items). Naturally, whether a user is good or
adversarial is not known in advance. The goal is to robustly learn the policy
that maximizes rewards for good users with as few user interactions as
possible. Without adversarial users, established results in collaborative
filtering show that $O(1/\epsilon^2)$ per-user interactions suffice to learn a
good policy, precisely because information can be shared across users. This
parallelization gain is fundamentally altered by the presence of adversarial
users: unless there are super-polynomial number of users, we show a lower bound
of $\tilde{\Omega}(\min(S,A) \cdot \alpha^2 / \epsilon^2)$ {\it per-user}
interactions to learn an $\epsilon$-optimal policy for the good users. We then
show we can achieve an $\tilde{O}(\min(S,A)\cdot \alpha/\epsilon^2)$
upper-bound, by employing efficient robust mean estimators for both uni-variate
and high-dimensional random variables. We also show that this can be improved
depending on the distributions of contexts.
 
      
        Related papers
        - Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
 We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct.
Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
 arXiv  Detail & Related papers  (2024-06-18T08:54:04Z)
- Nearly Optimal Algorithms for Contextual Dueling Bandits from   Adversarial Feedback [58.66941279460248]
 Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM)
We study a model within this domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary.
We propose an algorithm namely robust contextual dueling bandits (RCDB), which is based on uncertainty-weighted maximum likelihood estimation.
 arXiv  Detail & Related papers  (2024-04-16T17:59:55Z)
- Context-lumpable stochastic bandits [49.024050919419366]
 We consider a contextual bandit problem with $S$ contexts and $K$ actions.
We give an algorithm that outputs an $epsilon$-optimal policy after using at most $widetilde O(r (S +K )/epsilon2)$ samples.
In the regret setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $widetilde O(sqrtr3(S+K)T)$.
 arXiv  Detail & Related papers  (2023-06-22T17:20:30Z)
- Disincentivizing Polarization in Social Networks [10.758115514959593]
 We present a model for content curation and personalization that avoids filter bubbles.
We provide algorithmic guarantees for optimizing recommendations.
Using real-world preference data, we verify that under our model, users share the burden of diversification with only minor utility loss.
 arXiv  Detail & Related papers  (2023-05-23T21:47:31Z)
- Optimal Algorithms for Latent Bandits with Cluster Structure [50.44722775727619]
 We consider the problem of latent bandits with cluster structure where there are multiple users, each with an associated multi-armed bandit problem.
We propose LATTICE which allows exploitation of the latent cluster structure to provide the minimax optimal regret of $widetildeO(sqrt(mathsfM+mathsfN)mathsfT.
 arXiv  Detail & Related papers  (2023-01-17T17:49:04Z)
- Tractable Optimality in Episodic Latent MABs [75.17357040707347]
 We consider a multi-armed bandit problem with $M$ latent contexts, where an agent interacts with the environment for an episode of $H$ time steps.
 Depending on the length of the episode, the learner may not be able to estimate accurately the latent context.
We design a procedure that provably learns a near-optimal policy with $O(textttpoly(A) + texttttpoly(M,H)min(M,H))$ interactions.
 arXiv  Detail & Related papers  (2022-10-05T22:53:46Z)
- Instance-optimal PAC Algorithms for Contextual Bandits [20.176752818200438]
 In this work, we focus on the bandit problem in the $(epsilon,delta)$-$textitPAC$ setting.
We show that no algorithm can be simultaneously minimax-optimal regret minimization and instance-dependent PAC for best-arm identification.
 arXiv  Detail & Related papers  (2022-07-05T23:19:43Z)
- Modeling Attrition in Recommender Systems with Departing Bandits [84.85560764274399]
 We propose a novel multi-armed bandit setup that captures policy-dependent horizons.
We first address the case where all users share the same type, demonstrating that a recent UCB-based algorithm is optimal.
We then move forward to the more challenging case, where users are divided among two types.
 arXiv  Detail & Related papers  (2022-03-25T02:30:54Z)
- Contextual Bandits with Side-Observations [10.248045133793287]
 We investigate contextual bandits in the presence of side-observations across arms in order to design recommendation algorithms for users connected via social networks.
We show that a naive application of existing learning algorithms results in $Oleft(Nlog Tright)$ regret, where $N$ is the number of users.
 arXiv  Detail & Related papers  (2020-06-06T19:34:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.