Related papers: Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms

Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms

URL: http://arxiv.org/abs/2201.12700v1
Date: Sun, 30 Jan 2022 01:45:13 GMT
Title: Coordinated Attacks against Contextual Bandits: Fundamental Limits and Defense Mechanisms
Authors: Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor
Abstract summary: Motivated by online recommendation systems, we propose the problem of finding the optimal policy in contextual bandits. The goal is to robustly learn the policy that maximizes rewards for good users with as few user interactions as possible. We show we can achieve an $tildeO(min(S,A)cdot alpha/epsilon2)$ upper-bound, by employing efficient robust mean estimators.
Score: 75.17357040707347
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Motivated by online recommendation systems, we propose the problem of finding the optimal policy in multitask contextual bandits when a small fraction $\alpha < 1/2$ of tasks (users) are arbitrary and adversarial. The remaining fraction of good users share the same instance of contextual bandits with $S$ contexts and $A$ actions (items). Naturally, whether a user is good or adversarial is not known in advance. The goal is to robustly learn the policy that maximizes rewards for good users with as few user interactions as possible. Without adversarial users, established results in collaborative filtering show that $O(1/\epsilon^2)$ per-user interactions suffice to learn a good policy, precisely because information can be shared across users. This parallelization gain is fundamentally altered by the presence of adversarial users: unless there are super-polynomial number of users, we show a lower bound of $\tilde{\Omega}(\min(S,A) \cdot \alpha^2 / \epsilon^2)$ {\it per-user} interactions to learn an $\epsilon$-optimal policy for the good users. We then show we can achieve an $\tilde{O}(\min(S,A)\cdot \alpha/\epsilon^2)$ upper-bound, by employing efficient robust mean estimators for both uni-variate and high-dimensional random variables. We also show that this can be improved depending on the distributions of contexts.

Related papers

Fast Rates for Bandit PAC Multiclass Classification [73.17969992976501]
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct. Our main contribution is in designing a novel learning algorithm for the agnostic $(varepsilon,delta)$PAC version of the problem.
arXiv Detail & Related papers (2024-06-18T08:54:04Z)
Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM) We study a model within this domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandits (RCDB), which is based on uncertainty-weighted maximum likelihood estimation.
arXiv Detail & Related papers (2024-04-16T17:59:55Z)
Context-lumpable stochastic bandits [49.024050919419366]
We consider a contextual bandit problem with $S$ contexts and $K$ actions. We give an algorithm that outputs an $epsilon$-optimal policy after using at most $widetilde O(r (S +K )/epsilon2)$ samples. In the regret setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $widetilde O(sqrtr3(S+K)T)$.
arXiv Detail & Related papers (2023-06-22T17:20:30Z)
Disincentivizing Polarization in Social Networks [10.758115514959593]
We present a model for content curation and personalization that avoids filter bubbles. We provide algorithmic guarantees for optimizing recommendations. Using real-world preference data, we verify that under our model, users share the burden of diversification with only minor utility loss.
arXiv Detail & Related papers (2023-05-23T21:47:31Z)
Optimal Algorithms for Latent Bandits with Cluster Structure [50.44722775727619]
We consider the problem of latent bandits with cluster structure where there are multiple users, each with an associated multi-armed bandit problem. We propose LATTICE which allows exploitation of the latent cluster structure to provide the minimax optimal regret of $widetildeO(sqrt(mathsfM+mathsfN)mathsfT.
arXiv Detail & Related papers (2023-01-17T17:49:04Z)
Tractable Optimality in Episodic Latent MABs [75.17357040707347]
We consider a multi-armed bandit problem with $M$ latent contexts, where an agent interacts with the environment for an episode of $H$ time steps. Depending on the length of the episode, the learner may not be able to estimate accurately the latent context. We design a procedure that provably learns a near-optimal policy with $O(textttpoly(A) + texttttpoly(M,H)min(M,H))$ interactions.
arXiv Detail & Related papers (2022-10-05T22:53:46Z)
Instance-optimal PAC Algorithms for Contextual Bandits [20.176752818200438]
In this work, we focus on the bandit problem in the $(epsilon,delta)$-$textitPAC$ setting. We show that no algorithm can be simultaneously minimax-optimal regret minimization and instance-dependent PAC for best-arm identification.
arXiv Detail & Related papers (2022-07-05T23:19:43Z)
Modeling Attrition in Recommender Systems with Departing Bandits [84.85560764274399]
We propose a novel multi-armed bandit setup that captures policy-dependent horizons. We first address the case where all users share the same type, demonstrating that a recent UCB-based algorithm is optimal. We then move forward to the more challenging case, where users are divided among two types.
arXiv Detail & Related papers (2022-03-25T02:30:54Z)
Contextual Bandits with Side-Observations [10.248045133793287]
We investigate contextual bandits in the presence of side-observations across arms in order to design recommendation algorithms for users connected via social networks. We show that a naive application of existing learning algorithms results in $Oleft(Nlog Tright)$ regret, where $N$ is the number of users.
arXiv Detail & Related papers (2020-06-06T19:34:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.