Revisiting Character-level Adversarial Attacks for Language Models
- URL: http://arxiv.org/abs/2405.04346v2
- Date: Wed, 4 Sep 2024 15:48:40 GMT
- Title: Revisiting Character-level Adversarial Attacks for Language Models
- Authors: Elias Abad Rocamora, Yongtao Wu, Fanghui Liu, Grigorios G. Chrysos, Volkan Cevher,
- Abstract summary: We introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR)
Our method successfully targets both small (BERT) and large (Llama 2) models.
- Score: 53.446619686108754
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in 4.84% points and the USE similarity in 8% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.
Related papers
- Target-driven Attack for Large Language Models [14.784132523066567]
We propose our target-driven black-box attack method to maximize the KL divergence between the conditional probabilities of clean text and the attack text.
Experimental results on multiple Large Language Models and datasets demonstrate the effectiveness of our attack method.
arXiv Detail & Related papers (2024-11-09T15:59:59Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z) - DALA: A Distribution-Aware LoRA-Based Adversarial Attack against
Language Models [64.79319733514266]
Adversarial attacks can introduce subtle perturbations to input data.
Recent attack methods can achieve a relatively high attack success rate (ASR)
We propose a Distribution-Aware LoRA-based Adversarial Attack (DALA) method.
arXiv Detail & Related papers (2023-11-14T23:43:47Z) - LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial
Attack [3.410883081705873]
We propose a novel hard-label attack algorithm named LimeAttack.
We show that LimeAttack achieves the better attacking performance compared with existing hard-label attack.
adversarial examples crafted by LimeAttack are highly transferable and effectively improve model robustness in adversarial training.
arXiv Detail & Related papers (2023-08-01T06:30:37Z) - Transferable Attack for Semantic Segmentation [59.17710830038692]
adversarial attacks, and observe that the adversarial examples generated from a source model fail to attack the target models.
We propose an ensemble attack for semantic segmentation to achieve more effective attacks with higher transferability.
arXiv Detail & Related papers (2023-07-31T11:05:55Z) - Modeling Adversarial Attack on Pre-trained Language Models as Sequential
Decision Making [10.425483543802846]
adversarial attack task has found that pre-trained language models (PLMs) are vulnerable to small perturbations.
In this paper, we model the adversarial attack task on PLMs as a sequential decision-making problem.
We propose to use reinforcement learning to find an appropriate sequential attack path to generate adversaries, named SDM-Attack.
arXiv Detail & Related papers (2023-05-27T10:33:53Z) - Adv-Attribute: Inconspicuous and Transferable Adversarial Attack on Face
Recognition [111.1952945740271]
Adversarial Attributes (Adv-Attribute) is designed to generate inconspicuous and transferable attacks on face recognition.
Experiments on the FFHQ and CelebA-HQ datasets show that the proposed Adv-Attribute method achieves the state-of-the-art attacking success rates.
arXiv Detail & Related papers (2022-10-13T09:56:36Z) - Generating Natural Language Adversarial Examples through An Improved
Beam Search Algorithm [0.5735035463793008]
In this paper, a novel attack model is proposed, its attack success rate surpasses the benchmark attack methods.
The novel method is empirically evaluated by attacking WordCNN, LSTM, BiLSTM, and BERT on four benchmark datasets.
It achieves a 100% attack success rate higher than the state-of-the-art method when attacking BERT and BiLSTM on IMDB.
arXiv Detail & Related papers (2021-10-15T12:09:04Z) - BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images)
We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples.
Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.