Generating Label Cohesive and Well-Formed Adversarial Claims
- URL: http://arxiv.org/abs/2009.08205v1
- Date: Thu, 17 Sep 2020 10:50:42 GMT
- Title: Generating Label Cohesive and Well-Formed Adversarial Claims
- Authors: Pepa Atanasova, Dustin Wright, and Isabelle Augenstein
- Abstract summary: Adversarial attacks reveal important vulnerabilities and flaws of trained models.
We investigate how to generate adversarial attacks against fact checking systems that preserve the ground truth meaning.
We find that the generated attacks maintain the directionality and semantic validity of the claim better than previous work.
- Score: 44.29895319592488
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adversarial attacks reveal important vulnerabilities and flaws of trained
models. One potent type of attack are universal adversarial triggers, which are
individual n-grams that, when appended to instances of a class under attack,
can trick a model into predicting a target class. However, for inference tasks
such as fact checking, these triggers often inadvertently invert the meaning of
instances they are inserted in. In addition, such attacks produce semantically
nonsensical inputs, as they simply concatenate triggers to existing samples.
Here, we investigate how to generate adversarial attacks against fact checking
systems that preserve the ground truth meaning and are semantically valid. We
extend the HotFlip attack algorithm used for universal trigger generation by
jointly minimising the target class loss of a fact checking model and the
entailment class loss of an auxiliary natural language inference model. We then
train a conditional language model to generate semantically valid statements,
which include the found universal triggers. We find that the generated attacks
maintain the directionality and semantic validity of the claim better than
previous work.
Related papers
- Defense Against Syntactic Textual Backdoor Attacks with Token Substitution [15.496176148454849]
It embeds carefully chosen triggers into a victim model at the training stage, and makes the model erroneously predict inputs containing the same triggers as a certain class.
This paper proposes a novel online defense algorithm that effectively counters syntax-based as well as special token-based backdoor attacks.
arXiv Detail & Related papers (2024-07-04T22:48:57Z) - Query-Based Adversarial Prompt Generation [67.238873588125]
We build adversarial examples that cause an aligned language model to emit harmful strings.
We validate our attack on GPT-3.5 and OpenAI's safety classifier.
arXiv Detail & Related papers (2024-02-19T18:01:36Z) - Adversarial Attacks are a Surprisingly Strong Baseline for Poisoning
Few-Shot Meta-Learners [28.468089304148453]
We attack amortized meta-learners, which allows us to craft colluding sets of inputs that fool the system's learning algorithm.
We show that in a white box setting, these attacks are very successful and can cause the target model's predictions to become worse than chance.
We explore two hypotheses to explain this: 'overfitting' by the attack, and mismatch between the model on which the attack is generated and that to which the attack is transferred.
arXiv Detail & Related papers (2022-11-23T14:55:44Z) - Exploring the Universal Vulnerability of Prompt-based Learning Paradigm [21.113683206722207]
We find that prompt-based learning bridges the gap between pre-training and fine-tuning, and works effectively under the few-shot setting.
However, we find that this learning paradigm inherits the vulnerability from the pre-training stage, where model predictions can be misled by inserting certain triggers into the text.
We explore this universal vulnerability by either injecting backdoor triggers or searching for adversarial triggers on pre-trained language models using only plain text.
arXiv Detail & Related papers (2022-04-11T16:34:10Z) - Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data.
We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level.
Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z) - Poisoned classifiers are not only backdoored, they are fundamentally
broken [84.67778403778442]
Under a commonly-studied backdoor poisoning attack against classification models, an attacker adds a small trigger to a subset of the training data.
It is often assumed that the poisoned classifier is vulnerable exclusively to the adversary who possesses the trigger.
In this paper, we show empirically that this view of backdoored classifiers is incorrect.
arXiv Detail & Related papers (2020-10-18T19:42:44Z) - Universal Adversarial Attacks with Natural Triggers for Text
Classification [30.74579821832117]
We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems.
Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models.
arXiv Detail & Related papers (2020-05-01T01:58:24Z) - Adversarial Imitation Attack [63.76805962712481]
A practical adversarial attack should require as little as possible knowledge of attacked models.
Current substitute attacks need pre-trained models to generate adversarial examples.
In this study, we propose a novel adversarial imitation attack.
arXiv Detail & Related papers (2020-03-28T10:02:49Z) - Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial
Perturbations [65.05561023880351]
Adversarial examples are malicious inputs crafted to induce misclassification.
This paper studies a complementary failure mode, invariance-based adversarial examples.
We show that defenses against sensitivity-based attacks actively harm a model's accuracy on invariance-based attacks.
arXiv Detail & Related papers (2020-02-11T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.