ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned
Samples in NLP
- URL: http://arxiv.org/abs/2308.02122v2
- Date: Fri, 27 Oct 2023 04:51:56 GMT
- Title: ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned
Samples in NLP
- Authors: Lu Yan, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Xuan Chen, Guangyu
Shen, Xiangyu Zhang
- Abstract summary: We propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions.
We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem.
- Score: 29.375957205348115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Backdoor attacks have emerged as a prominent threat to natural language
processing (NLP) models, where the presence of specific triggers in the input
can lead poisoned models to misclassify these inputs to predetermined target
classes. Current detection mechanisms are limited by their inability to address
more covert backdoor strategies, such as style-based attacks. In this work, we
propose an innovative test-time poisoned sample detection framework that hinges
on the interpretability of model predictions, grounded in the semantic meaning
of inputs. We contend that triggers (e.g., infrequent words) are not supposed
to fundamentally alter the underlying semantic meanings of poisoned samples as
they want to stay stealthy. Based on this observation, we hypothesize that
while the model's predictions for paraphrased clean samples should remain
stable, predictions for poisoned samples should revert to their true labels
upon the mutations applied to triggers during the paraphrasing process. We
employ ChatGPT, a state-of-the-art large language model, as our paraphraser and
formulate the trigger-removal task as a prompt engineering problem. We adopt
fuzzing, a technique commonly used for unearthing software vulnerabilities, to
discover optimal paraphrase prompts that can effectively eliminate triggers
while concurrently maintaining input semantics. Experiments on 4 types of
backdoor attacks, including the subtle style backdoors, and 4 distinct datasets
demonstrate that our approach surpasses baseline methods, including STRIP, RAP,
and ONION, in precision and recall.
Related papers
- A Study of Backdoors in Instruction Fine-tuned Language Models [16.10608633005216]
Backdoor data poisoning is a serious security concern due to the evasive nature of such attacks.
Such backdoor attacks can: alter response sentiment, violate censorship, over-refuse (invoke censorship for legitimate queries), inject false content, or trigger nonsense responses (hallucinations)
arXiv Detail & Related papers (2024-06-12T00:01:32Z) - PSBD: Prediction Shift Uncertainty Unlocks Backdoor Detection [57.571451139201855]
Prediction Shift Backdoor Detection (PSBD) is a novel method for identifying backdoor samples in deep neural networks.
PSBD is motivated by an intriguing Prediction Shift (PS) phenomenon, where poisoned models' predictions on clean data often shift away from true labels towards certain other labels.
PSBD identifies backdoor training samples by computing the Prediction Shift Uncertainty (PSU), the variance in probability values when dropout layers are toggled on and off during model inference.
arXiv Detail & Related papers (2024-06-09T15:31:00Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples.
We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z) - Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in
Language Models [41.1058288041033]
We propose ProAttack, a novel and efficient method for performing clean-label backdoor attacks based on the prompt.
Our method does not require external triggers and ensures correct labeling of poisoned samples, improving the stealthy nature of the backdoor attack.
arXiv Detail & Related papers (2023-05-02T06:19:36Z) - Backdoor Attacks with Input-unique Triggers in NLP [34.98477726215485]
Backdoor attack aims at inducing neural models to make incorrect predictions for poison data while keeping predictions on the clean dataset unchanged.
In this paper, we propose an input-unique backdoor attack(NURA), where we generate backdoor triggers unique to inputs.
arXiv Detail & Related papers (2023-03-25T01:41:54Z) - A Unified Evaluation of Textual Backdoor Learning: Frameworks and
Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning.
We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z) - Triggerless Backdoor Attack for NLP Tasks with Clean Labels [31.308324978194637]
A standard strategy to construct poisoned data in backdoor attacks is to insert triggers into selected sentences and alter the original label to a target label.
This strategy comes with a severe flaw of being easily detected from both the trigger and the label perspectives.
We propose a new strategy to perform textual backdoor attacks which do not require an external trigger, and the poisoned samples are correctly labeled.
arXiv Detail & Related papers (2021-11-15T18:36:25Z) - Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data.
We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level.
Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z) - Poisoned classifiers are not only backdoored, they are fundamentally
broken [84.67778403778442]
Under a commonly-studied backdoor poisoning attack against classification models, an attacker adds a small trigger to a subset of the training data.
It is often assumed that the poisoned classifier is vulnerable exclusively to the adversary who possesses the trigger.
In this paper, we show empirically that this view of backdoored classifiers is incorrect.
arXiv Detail & Related papers (2020-10-18T19:42:44Z) - Backdoor Smoothing: Demystifying Backdoor Attacks on Deep Neural
Networks [25.23881974235643]
We show that backdoor attacks induce a smoother decision function around the triggered samples -- a phenomenon which we refer to as textitbackdoor smoothing.
Our experiments show that smoothness increases when the trigger is added to the input samples, and that this phenomenon is more pronounced for more successful attacks.
arXiv Detail & Related papers (2020-06-11T18:28:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.