Preference Poisoning Attacks on Reward Model Learning
- URL: http://arxiv.org/abs/2402.01920v1
- Date: Fri, 2 Feb 2024 21:45:24 GMT
- Title: Preference Poisoning Attacks on Reward Model Learning
- Authors: Junlin Wu, Jiongxiao Wang, Chaowei Xiao, Chenguang Wang, Ning Zhang,
Yevgeniy Vorobeychik
- Abstract summary: We show how an attacker can flip a small subset of preference comparisons with the goal of either promoting or demoting a target outcome.
We find that the best attacks are often highly successful, achieving in the most extreme case 100% success rate with only 0.3% of the data poisoned.
We also show that several state-of-the-art defenses against other classes of poisoning attacks exhibit, at best, limited efficacy in our setting.
- Score: 49.806139447922526
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Learning utility, or reward, models from pairwise comparisons is a
fundamental component in a number of application domains. These approaches
inherently entail collecting preference information from people, with feedback
often provided anonymously. Since preferences are subjective, there is no gold
standard to compare against; yet, reliance of high-impact systems on preference
learning creates a strong motivation for malicious actors to skew data
collected in this fashion to their ends. We investigate the nature and extent
of this vulnerability systematically by considering a threat model in which an
attacker can flip a small subset of preference comparisons with the goal of
either promoting or demoting a target outcome. First, we propose two classes of
algorithmic approaches for these attacks: a principled gradient-based
framework, and several variants of rank-by-distance methods. Next, we
demonstrate the efficacy of best attacks in both these classes in successfully
achieving malicious goals on datasets from three diverse domains: autonomous
control, recommendation system, and textual prompt-response preference
learning. We find that the best attacks are often highly successful, achieving
in the most extreme case 100% success rate with only 0.3% of the data poisoned.
However, which attack is best can vary significantly across domains,
demonstrating the value of our comprehensive vulnerability analysis that
involves several classes of attack algorithms. In addition, we observe that the
simpler and more scalable rank-by-distance approaches are often competitive
with the best, and on occasion significantly outperform gradient-based methods.
Finally, we show that several state-of-the-art defenses against other classes
of poisoning attacks exhibit, at best, limited efficacy in our setting.
Related papers
- AttackBench: Evaluating Gradient-based Attacks for Adversarial Examples [26.37278338032268]
Adrial examples are typically optimized with gradient-based attacks.
Each is shown to outperform its predecessors using different experimental setups.
This provides overly-optimistic and even biased evaluations.
arXiv Detail & Related papers (2024-04-30T11:19:05Z) - Alternating Objectives Generates Stronger PGD-Based Adversarial Attacks [78.2700757742992]
Projected Gradient Descent (PGD) is one of the most effective and conceptually simple algorithms to generate such adversaries.
We experimentally verify this assertion on a synthetic-data example and by evaluating our proposed method across 25 different $ell_infty$-robust models and 3 datasets.
Our strongest adversarial attack outperforms all of the white-box components of AutoAttack ensemble.
arXiv Detail & Related papers (2022-12-15T17:44:31Z) - Universal Distributional Decision-based Black-box Adversarial Attack
with Reinforcement Learning [5.240772699480865]
We propose a pixel-wise decision-based attack algorithm that finds a distribution of adversarial perturbation through a reinforcement learning algorithm.
Experiments show that the proposed approach outperforms state-of-the-art decision-based attacks with a higher attack success rate and greater transferability.
arXiv Detail & Related papers (2022-11-15T18:30:18Z) - A Tale of HodgeRank and Spectral Method: Target Attack Against Rank
Aggregation Is the Fixed Point of Adversarial Game [153.74942025516853]
The intrinsic vulnerability of the rank aggregation methods is not well studied in the literature.
In this paper, we focus on the purposeful adversary who desires to designate the aggregated results by modifying the pairwise data.
The effectiveness of the suggested target attack strategies is demonstrated by a series of toy simulations and several real-world data experiments.
arXiv Detail & Related papers (2022-09-13T05:59:02Z) - Resisting Adversarial Attacks in Deep Neural Networks using Diverse
Decision Boundaries [12.312877365123267]
Deep learning systems are vulnerable to crafted adversarial examples, which may be imperceptible to the human eye, but can lead the model to misclassify.
We develop a new ensemble-based solution that constructs defender models with diverse decision boundaries with respect to the original model.
We present extensive experimentations using standard image classification datasets, namely MNIST, CIFAR-10 and CIFAR-100 against state-of-the-art adversarial attacks.
arXiv Detail & Related papers (2022-08-18T08:19:26Z) - Adversarial Robustness of Deep Reinforcement Learning based Dynamic
Recommender Systems [50.758281304737444]
We propose to explore adversarial examples and attack detection on reinforcement learning-based interactive recommendation systems.
We first craft different types of adversarial examples by adding perturbations to the input and intervening on the casual factors.
Then, we augment recommendation systems by detecting potential attacks with a deep learning-based classifier based on the crafted data.
arXiv Detail & Related papers (2021-12-02T04:12:24Z) - Towards A Conceptually Simple Defensive Approach for Few-shot
classifiers Against Adversarial Support Samples [107.38834819682315]
We study a conceptually simple approach to defend few-shot classifiers against adversarial attacks.
We propose a simple attack-agnostic detection method, using the concept of self-similarity and filtering.
Our evaluation on the miniImagenet (MI) and CUB datasets exhibit good attack detection performance.
arXiv Detail & Related papers (2021-10-24T05:46:03Z) - Adversarial Attack and Defense in Deep Ranking [100.17641539999055]
We propose two attacks against deep ranking systems that can raise or lower the rank of chosen candidates by adversarial perturbations.
Conversely, an anti-collapse triplet defense is proposed to improve the ranking model robustness against all proposed attacks.
Our adversarial ranking attacks and defenses are evaluated on MNIST, Fashion-MNIST, CUB200-2011, CARS196 and Stanford Online Products datasets.
arXiv Detail & Related papers (2021-06-07T13:41:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.