Preserving Semantics in Textual Adversarial Attacks
- URL: http://arxiv.org/abs/2211.04205v2
- Date: Thu, 5 Oct 2023 20:13:12 GMT
- Title: Preserving Semantics in Textual Adversarial Attacks
- Authors: David Herel and Hugo Cisneros and Tomas Mikolov
- Abstract summary: Up to 70% of adversarial examples generated by adversarial attacks should be discarded because they do not preserve semantics.
We propose a new, fully supervised sentence embedding technique called Semantics-Preserving-Encoder (SPE)
Our method outperforms existing sentence encoders used in adversarial attacks by achieving 1.2x - 5.1x better real attack success rate.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The growth of hateful online content, or hate speech, has been associated
with a global increase in violent crimes against minorities [23]. Harmful
online content can be produced easily, automatically and anonymously. Even
though, some form of auto-detection is already achieved through text
classifiers in NLP, they can be fooled by adversarial attacks. To strengthen
existing systems and stay ahead of attackers, we need better adversarial
attacks. In this paper, we show that up to 70% of adversarial examples
generated by adversarial attacks should be discarded because they do not
preserve semantics. We address this core weakness and propose a new, fully
supervised sentence embedding technique called Semantics-Preserving-Encoder
(SPE). Our method outperforms existing sentence encoders used in adversarial
attacks by achieving 1.2x - 5.1x better real attack success rate. We release
our code as a plugin that can be used in any existing adversarial attack to
improve its quality and speed up its execution.
Related papers
- Automated Adversarial Discovery for Safety Classifiers [10.61889194493287]
We formalize the task of automated adversarial discovery for safety classifiers.
Our evaluation of existing attack generation methods on the CivilComments toxicity task reveals their limitations.
Even our best-performing prompt-based method finds new successful attacks on unseen harm dimensions of attacks only 5% of the time.
arXiv Detail & Related papers (2024-06-24T19:45:12Z) - White-box Multimodal Jailbreaks Against Large Vision-Language Models [61.97578116584653]
We propose a more comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within Large Vision-Language Models.
Our attack method begins by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input.
An adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions.
arXiv Detail & Related papers (2024-05-28T07:13:30Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks.
Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters.
We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z) - Adversarial Text Normalization [2.9434930072968584]
Adversarial Text Normalizer restores baseline performance on attacked content with low computational overhead.
We find that text normalization provides a task-agnostic defense against character-level attacks.
arXiv Detail & Related papers (2022-06-08T19:44:03Z) - Don't sweat the small stuff, classify the rest: Sample Shielding to
protect text classifiers against adversarial attacks [2.512827436728378]
Deep learning (DL) is being used extensively for text classification.
Attackers modify the text in a way which misleads the classifier while keeping the original meaning close to intact.
We propose a novel and intuitive defense strategy called Sample Shielding.
arXiv Detail & Related papers (2022-05-03T18:24:20Z) - Fixed Points in Cyber Space: Rethinking Optimal Evasion Attacks in the
Age of AI-NIDS [70.60975663021952]
We study blackbox adversarial attacks on network classifiers.
We argue that attacker-defender fixed points are themselves general-sum games with complex phase transitions.
We show that a continual learning approach is required to study attacker-defender dynamics.
arXiv Detail & Related papers (2021-11-23T23:42:16Z) - Grey-box Adversarial Attack And Defence For Sentiment Classification [19.466940655682727]
We introduce a grey-box adversarial attack and defence framework for sentiment classification.
We address the issues of differentiability, label preservation and input reconstruction for adversarial attack and defence in one unified framework.
arXiv Detail & Related papers (2021-03-22T04:05:17Z) - Universal Adversarial Attacks with Natural Triggers for Text
Classification [30.74579821832117]
We develop adversarial attacks that appear closer to natural English phrases and yet confuse classification systems.
Our attacks effectively reduce model accuracy on classification tasks while being less identifiable than prior models.
arXiv Detail & Related papers (2020-05-01T01:58:24Z) - Deflecting Adversarial Attacks [94.85315681223702]
We present a new approach towards ending this cycle where we "deflect" adversarial attacks by causing the attacker to produce an input that resembles the attack's target class.
We first propose a stronger defense based on Capsule Networks that combines three detection mechanisms to achieve state-of-the-art detection performance.
arXiv Detail & Related papers (2020-02-18T06:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.